What is VAD?
A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. The ai-coustics VAD is tightly integrated with our Quail enhancement models, allowing it to make highly accurate predictions even in noisy environments or with interfering speakers. While our standard Quail models handle most environments effectively, the Quail Voice Focus model is the best choice for pure speaker isolation. It specifically targets the primary speaker, preventing background chatter from triggering false positives.Use Cases
Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.- Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.
- Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).
- Enhanced STT Accuracy: Providing a clean, speech-only audio stream to your Speech-to-Text engine can reduce errors like insertions and substitutions caused by background noise or interfering speech.
How it Works
The VAD is not a standalone component; it is created from an existing model instance. It analyzes the enhanced audio from the model to make its prediction, benefiting from the noise and reverb removal already performed. The basic workflow is:- Create a processor with a model
- Create a VAD instance from the processor
- Process your audio
- Query the VAD to see if speech was detected in the last processed frame
VAD Parameters
You can fine-tune the VAD’s behavior using the following parameters.Sensitivity
Sensitivity
- Description: Controls the sensitivity (energy threshold) of the VAD. Higher values make the VAD more likely to classify audio as speech.
- Formula: Energy Threshold =
- Range:
0.0to15.0 - Default:
6.0
Speech Hold Duration
Speech Hold Duration
- Description: Controls for how long the VAD continues to detect speech after the audio signal no longer contains speech. This is useful for bridging short pauses within a sentence.
- Unit: Seconds
- Range:
0.0to 20 model window length - Default:
0.05
Minimum Speech Duration
Minimum Speech Duration
- Description: Controls for how long speech needs to be present in the audio signal before the VAD considers it speech. This helps in filtering out short, non-speech sounds.
- Unit: Seconds
- Range:
0.0to1.0 - Default:
0.0
Best Practices
- Tune Sensitivity: The optimal sensitivity may vary depending on your audio source and environment. Start with the default and adjust as needed.
- Use with Quail Voice Focus: For applications with multiple speakers, using the VAD with the Voice Focus model provides the best results for isolating the primary speaker.