Quail VAD

The ai-coustics SDK includes Voice Activity Detection models and model-internal VAD post-processors for real-time applications. For new Voice AI pipelines, we recommend the standalone Quail VAD 2.0 model, or Quail Voice Focus VAD 2.0 when you need to detect only the primary speaker.

What is VAD?

A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. Voice agents use this signal for turn-taking, endpointing, and deciding when to send audio to downstream Speech-to-Text (STT) systems. Quail VAD 2.0 is a standalone VAD model trained for noisy, multi-speaker environments. It predicts speech activity directly from the input audio and does not require Quail Voice Focus, Quail, or Rook enhancement to run. This makes it the recommended choice when you need a modular VAD component in a Voice AI pipeline. Quail Voice Focus VAD 2.0 is the primary-speaker counterpart. It predicts speech activity for the targeted primary speaker only, ignoring interfering and background speech, mirroring how Quail Voice Focus isolates the primary speaker for enhancement. Use it when turn-taking should be driven by a single target speaker rather than any audible voice.

Use Cases

Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.

Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.
Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).
Enhanced STT Accuracy: Sending audio to your Speech-to-Text engine only when speech is present can reduce insertions and substitutions caused by background noise or interfering speech.

How it Works

The basic workflow is the same for standalone VAD models and model-internal VAD modes:

Create a processor with a VAD-capable model
Create a VAD instance from the processor
Process your audio
Query the VAD to see if speech was detected in the last processed frame

For standalone VAD use cases, create the processor with quail-vad-2.0-xxs-16khz, or quail-vf-vad-2.0-s-16khz to detect only the primary speaker. The model runs independently and produces a frame-level speech probability that the SDK compares against the configured sensitivity threshold.

Model Differences

The SDK supports three VAD options:

Approach	How it works	Dependency	Recommended use
Quail VAD 2.0	A standalone VAD model predicts speech probability directly from the input audio.	Requires only the Quail VAD model.	New Voice AI pipelines, modular LiveKit or Pipecat setups, and noisy environments where background speech, music, or impulsive noise can trigger false positives.
Quail Voice Focus VAD 2.0	A standalone VAD model predicts speech probability for the primary speaker only.	Requires only the Quail Voice Focus VAD model.	Single-target-speaker turn-taking where interfering speakers should not trigger the VAD.
Model-internal VAD (Quail VAD 1.0)	A speech enhancement model first suppresses non-speech components, then the SDK estimates speech activity from the remaining signal energy.	Requires an enhancement model such as Quail Voice Focus, Quail, or Rook.	Existing pipelines that already run ai-coustics enhancement and want a VAD signal from the same processor.

Quail VAD 2.0 and Quail Voice Focus VAD 2.0 are the preferred option when VAD should be managed independently from enhancement. Model-internal VAD remains useful when enhancement is already part of the signal chain and you want to avoid an additional model.

VAD Parameters

You can fine-tune the VAD’s behavior using the following parameters.

Sensitivity

Controls the sensitivity of the VAD.There are two kinds of VADs offered by the SDK:

VAD models (e.g. Quail VAD 2.0): These are models specifically trained for voice activity detection. They output a probability of speech presence for each processed audio buffer, with 1.0 meaning the model is certain speech is present and 0.0 meaning the model is certain speech is not present. The probability is compared against the sensitivity threshold to determine if speech is detected.
Energy-based VAD of speech enhancement models (e.g. Quail, Rook): These models filter out background noise and enhance speech, but they do not explicitly output a VAD decision. To provide VAD functionality, the SDK determines whether speech is present based on how much energy is left in the signal after enhancement, since the model suppresses non-speech components. For these models, the sensitivity parameter controls the energy threshold for detecting speech presence. The formula for the energy threshold is $10^{(-\text{sensitivity})}$ , so higher sensitivity values require less energy in the signal and therefore result in more aggressive speech detection.

A value above the threshold triggers a speech detected decision.Range:

On VAD models: 0.0 to 1.0
On energy-based VADs: 1.0 to 15.0

Default: model-specific

Speech Hold Duration

Controls for how long the VAD continues to detect speech after the audio signal no longer contains speech.This affects the stability of speech detected -> not detected transitions. The VAD reports speech detected if the audio signal contained speech in at least 50% of the frames processed in the last speech_hold_duration * 2 seconds. For example, if speech_hold_duration is set to 0.5 seconds and the VAD stops detecting speech in the audio signal, the VAD will continue to report speech for 0.5 seconds assuming the VAD does not detect speech again during that period. If a few frames of speech are detected during that period, those frames will be included in the 50% calculation, which will extend the speech detection period until the 50% threshold is no longer met.

The VAD returns a value per processed buffer, so this duration is rounded to the closest model window length. For example, if the model has a processing window length of 10 ms, the VAD will round up/down to the closest multiple of 10 ms. Because of this, this parameter may return a different value than the one it was last set to.

Unit: Seconds
Range: 0.0 to 300x model window length
Default: 0.03 (30 ms)

Minimum Speech Duration

Controls for how long speech needs to be present in the audio signal before the VAD considers it speech.This affects the stability of speech not detected -> detected transitions.

Unit: Seconds
Range: 0.0 to 1.0
Default: 0.0

Best Practices

Use Quail VAD 2.0 by default: It is the standalone VAD model and is the recommended option for new Voice AI integrations.
Tune Sensitivity: The optimal sensitivity may vary depending on your audio source, environment, and turn-taking behavior. Start with the default and adjust against real production audio.
Primary-speaker VAD: Use Quail Voice Focus VAD to trigger the VAD only when the primary speaker is talking, instead of chaining Quail Voice Focus enhancement with Quail VAD.

Third-Party VADs

The enhanced audio output from our models will have reduced noise and reverberation, which can improve the accuracy of downstream VADs that may struggle in noisy conditions. However, when using a speech enhancement model with an enhancement level lower than 1.0, the output will have a noise component which helps improve ASR accuracy, but it may harm downstream VAD models if they are not robust to noise. For new integrations, use Quail VAD 2.0 instead of adding a third-party VAD when possible. It is noise-robust, runs in the ai-coustics SDK and does not require Torch or ONNX runtime dependencies. If ASR accuracy is not a concern, you can still use ai-coustics enhancement models as a pre-processing step for third-party VADs:

If you need foreground speaker isolation, use one of our Quail Voice Focus models before the VAD to suppress background noise and interfering speech, so a generic VAD only triggers on the primary speaker.
If you need general speech detection, use one of our Rook models, which will preserve both foreground and background speech while suppressing non-speech sounds. Use an enhancement level of 100% for best VAD performance.

Get started

Voice Focus

Audio Insight

Voice Activity Detection

Speech Enhancement

Perceptual SE

Older Models

What is VAD?

Use Cases

How it Works

Model Differences

VAD Parameters

Best Practices

Third-Party VADs

​What is VAD?

​Use Cases

​How it Works

​Model Differences

​VAD Parameters

​Best Practices

​Third-Party VADs

What is VAD?

Use Cases

How it Works

Model Differences

VAD Parameters

Best Practices

Third-Party VADs