The ai-coustics SDK includes Voice Activity Detection models and model-internal VAD post-processors for real-time applications. For new Voice AI pipelines, we recommend using the standalone Quail VAD 2.0 model.Documentation Index
Fetch the complete documentation index at: https://docs.ai-coustics.com/llms.txt
Use this file to discover all available pages before exploring further.
What is VAD?
A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. Voice agents use this signal for turn-taking, endpointing, and deciding when to send audio to downstream Speech-to-Text (STT) systems. Quail VAD 2.0 is a standalone VAD model trained for noisy, multi-speaker environments. It predicts speech activity directly from the input audio and does not require Quail Voice Focus, Quail, or Rook enhancement to run. This makes it the recommended choice when you need a modular VAD component in a Voice AI pipeline.Use Cases
Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.- Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.
- Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).
- Enhanced STT Accuracy: Sending audio to your Speech-to-Text engine only when speech is present can reduce insertions and substitutions caused by background noise or interfering speech.
How it Works
The basic workflow is the same for standalone VAD models and model-internal VAD modes:- Create a processor with a VAD-capable model
- Create a VAD instance from the processor
- Process your audio
- Query the VAD to see if speech was detected in the last processed frame
quail-vad-2.0-xxs-16khz.
The model runs independently and produces a frame-level speech probability that the SDK compares against the configured sensitivity threshold.
Model Differences
The SDK supports two VAD approaches:| Approach | How it works | Dependency | Recommended use |
|---|---|---|---|
| Quail VAD 2.0 | A standalone VAD model predicts speech probability directly from the input audio. | Requires only the Quail VAD model. | New Voice AI pipelines, modular LiveKit or Pipecat setups, and noisy environments where background speech, music, or impulsive noise can trigger false positives. |
| Model-internal VAD (Quail VAD 1.0) | A speech enhancement model first suppresses non-speech components, then the SDK estimates speech activity from the remaining signal energy. | Requires an enhancement model such as Quail Voice Focus, Quail, or Rook. | Existing pipelines that already run ai-coustics enhancement and want a VAD signal from the same processor. |
VAD Parameters
You can fine-tune the VAD’s behavior using the following parameters.Sensitivity
Sensitivity
Controls the sensitivity of the VAD.There are two kinds of VADs offered by the SDK:
- VAD models (e.g. Quail VAD 2.0): These are models specifically trained for voice activity detection. They output a probability of speech presence for each processed audio buffer, with 1.0 meaning the model is certain speech is present and 0.0 meaning the model is certain speech is not present. The probability is compared against the sensitivity threshold to determine if speech is detected.
- Energy-based VAD of speech enhancement models (e.g. Quail, Rook): These models filter out background noise and enhance speech, but they do not explicitly output a VAD decision. To provide VAD functionality, the SDK determines whether speech is present based on how much energy is left in the signal after enhancement, since the model suppresses non-speech components. For these models, the sensitivity parameter controls the energy threshold for detecting speech presence. The formula for the energy threshold is , so higher sensitivity values require less energy in the signal and therefore result in more aggressive speech detection.
- On VAD models: 0.0 to 1.0
- On energy-based VADs: 1.0 to 15.0
Speech Hold Duration
Speech Hold Duration
Controls for how long the VAD continues to detect speech after the audio signal
no longer contains speech.This affects the stability of speech detected -> not detected transitions.
The VAD reports speech detected if the audio signal contained speech in at least 50% of the frames processed in the last
speech_hold_duration * 2 seconds.
For example, if speech_hold_duration is set to 0.5 seconds and the VAD stops detecting speech in the audio signal, the VAD will continue to report speech for 0.5 seconds assuming the
VAD does not detect speech again during that period. If a few frames of speech are detected during that period, those frames will be included in the 50% calculation, which will extend
the speech detection period until the 50% threshold is no longer met.The VAD returns a value per processed buffer, so this duration is rounded to the closest model window length.
For example, if the model has a processing window length of 10 ms, the VAD will round up/down to the closest multiple of 10 ms.
Because of this, this parameter may return a different value than the one it was last set to.
- Unit: Seconds
- Range: 0.0 to 300x model window length
- Default: 0.03 (30 ms)
Minimum Speech Duration
Minimum Speech Duration
Controls for how long speech needs to be present in the audio signal before
the VAD considers it speech.This affects the stability of speech not detected -> detected transitions.
The VAD returns a value per processed buffer, so this duration is rounded to the closest model window length.
For example, if the model has a processing window length of 10 ms, the VAD will round up/down to the closest multiple of 10 ms.
Because of this, this parameter may return a different value than the one it was last set to.
- Unit: Seconds
- Range:
0.0to1.0 - Default:
0.0
Best Practices
- Use Quail VAD 2.0 by default: It is the standalone VAD model and is the recommended option for new Voice AI integrations.
- Tune Sensitivity: The optimal sensitivity may vary depending on your audio source, environment, and turn-taking behavior. Start with the default and adjust against real production audio.