Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ai-coustics.com/llms.txt

Use this file to discover all available pages before exploring further.

The ai-coustics SDK includes Voice Activity Detection models and model-internal VAD post-processors for real-time applications. For new Voice AI pipelines, we recommend using the standalone Quail VAD 2.0 model.

What is VAD?

A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. Voice agents use this signal for turn-taking, endpointing, and deciding when to send audio to downstream Speech-to-Text (STT) systems. Quail VAD 2.0 is a standalone VAD model trained for noisy, multi-speaker environments. It predicts speech activity directly from the input audio and does not require Quail Voice Focus, Quail, or Rook enhancement to run. This makes it the recommended choice when you need a modular VAD component in a Voice AI pipeline.

Use Cases

Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.
  • Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.
  • Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).
  • Enhanced STT Accuracy: Sending audio to your Speech-to-Text engine only when speech is present can reduce insertions and substitutions caused by background noise or interfering speech.

How it Works

The basic workflow is the same for standalone VAD models and model-internal VAD modes:
  1. Create a processor with a VAD-capable model
  2. Create a VAD instance from the processor
  3. Process your audio
  4. Query the VAD to see if speech was detected in the last processed frame
For standalone VAD use cases, create the processor with quail-vad-2.0-xxs-16khz. The model runs independently and produces a frame-level speech probability that the SDK compares against the configured sensitivity threshold.

Model Differences

The SDK supports two VAD approaches:
ApproachHow it worksDependencyRecommended use
Quail VAD 2.0A standalone VAD model predicts speech probability directly from the input audio.Requires only the Quail VAD model.New Voice AI pipelines, modular LiveKit or Pipecat setups, and noisy environments where background speech, music, or impulsive noise can trigger false positives.
Model-internal VAD (Quail VAD 1.0)A speech enhancement model first suppresses non-speech components, then the SDK estimates speech activity from the remaining signal energy.Requires an enhancement model such as Quail Voice Focus, Quail, or Rook.Existing pipelines that already run ai-coustics enhancement and want a VAD signal from the same processor.
Quail VAD 2.0 is the preferred option when VAD should be managed independently from enhancement. Model-internal VAD remains useful when enhancement is already part of the signal chain and you want to avoid an additional model.

VAD Parameters

You can fine-tune the VAD’s behavior using the following parameters.
Controls the sensitivity of the VAD.There are two kinds of VADs offered by the SDK:
  • VAD models (e.g. Quail VAD 2.0): These are models specifically trained for voice activity detection. They output a probability of speech presence for each processed audio buffer, with 1.0 meaning the model is certain speech is present and 0.0 meaning the model is certain speech is not present. The probability is compared against the sensitivity threshold to determine if speech is detected.
  • Energy-based VAD of speech enhancement models (e.g. Quail, Rook): These models filter out background noise and enhance speech, but they do not explicitly output a VAD decision. To provide VAD functionality, the SDK determines whether speech is present based on how much energy is left in the signal after enhancement, since the model suppresses non-speech components. For these models, the sensitivity parameter controls the energy threshold for detecting speech presence. The formula for the energy threshold is 10(sensitivity)10^{(-\text{sensitivity})}, so higher sensitivity values require less energy in the signal and therefore result in more aggressive speech detection.
A value above the threshold triggers a speech detected decision.Range:
  • On VAD models: 0.0 to 1.0
  • On energy-based VADs: 1.0 to 15.0
Default: model-specific
Controls for how long the VAD continues to detect speech after the audio signal no longer contains speech.This affects the stability of speech detected -> not detected transitions. The VAD reports speech detected if the audio signal contained speech in at least 50% of the frames processed in the last speech_hold_duration * 2 seconds. For example, if speech_hold_duration is set to 0.5 seconds and the VAD stops detecting speech in the audio signal, the VAD will continue to report speech for 0.5 seconds assuming the VAD does not detect speech again during that period. If a few frames of speech are detected during that period, those frames will be included in the 50% calculation, which will extend the speech detection period until the 50% threshold is no longer met.
The VAD returns a value per processed buffer, so this duration is rounded to the closest model window length. For example, if the model has a processing window length of 10 ms, the VAD will round up/down to the closest multiple of 10 ms. Because of this, this parameter may return a different value than the one it was last set to.
  • Unit: Seconds
  • Range: 0.0 to 300x model window length
  • Default: 0.03 (30 ms)
Controls for how long speech needs to be present in the audio signal before the VAD considers it speech.This affects the stability of speech not detected -> detected transitions.
The VAD returns a value per processed buffer, so this duration is rounded to the closest model window length. For example, if the model has a processing window length of 10 ms, the VAD will round up/down to the closest multiple of 10 ms. Because of this, this parameter may return a different value than the one it was last set to.
  • Unit: Seconds
  • Range: 0.0 to 1.0
  • Default: 0.0

Best Practices

  • Use Quail VAD 2.0 by default: It is the standalone VAD model and is the recommended option for new Voice AI integrations.
  • Tune Sensitivity: The optimal sensitivity may vary depending on your audio source, environment, and turn-taking behavior. Start with the default and adjust against real production audio.

Third-Party VADs

For new integrations, use Quail VAD 2.0 instead of adding a third-party VAD when possible. It runs in the ai-coustics SDK and does not require Torch or ONNX runtime dependencies. You can still use ai-coustics enhancement models as a pre-processing step for third-party VADs. The enhanced audio output from our models will have reduced noise and reverberation, which can improve the accuracy of downstream VADs that may struggle in noisy conditions. If a third-party VAD is required and you need foreground speaker isolation, use one of our Quail Voice Focus 2.1 models with enhancement level of 100% before the third-party VAD. This will more aggressively suppress background noise and speech, which can help improve VAD accuracy, but may also harm the ASR’s performance. If a third-party VAD is required and you need general speech detection, use one of our Rook models, which will preserve both foreground and background speech while suppressing non-speech sounds. Use an enhancement level of 100% for best third-party VAD performance.