Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ai-coustics.com/llms.txt

Use this file to discover all available pages before exploring further.

ai-coustics provides different speech enhancement and voice activity detection model families for real-time SDK use cases. Use this guide to choose a model that fits your needs. All models are language-agnostic and can be used for any language, as they operate on the acoustic properties of the audio signal rather than processing its linguistic content.

Overview

Speech Enhancement for Voice AI

Best for: Improving Voice AI Agents and Speech-to-Text (STT) accuracy. Includes Quail Voice Focus.
  • Model Family: Quail
  • Real-time

Perceptual Speech Enhancement

Best for: Removing background noise and reverb on real-time communication use-cases.
  • Model Family: Rook
  • Real-time

Voice Activity Detection

Best for: Detecting speech in noisy Voice AI pipelines. Includes Quail VAD.
  • Model Family: Quail VAD
  • Real-time

Quail

SDK, Real-time, Human-to-Machine The Quail models are purpose-built for Voice AI Agents and human-to-machine interactions. Unlike standard noise suppression, Quail is tuned to improve the performance of downstream Speech-to-Text (STT) engines.
Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices). Voice Focus 2.1 listens for a moment at the start of each session before applying suppression. During this warm-up, audio may sound closer to the original. Once a clear primary speaker is detected (typically within a few seconds) full suppression kicks in. The sooner the primary speaker talks, the shorter the warm-up period. On very short clips where the primary speaker doesn’t get a chance to speak, suppression may not fully activate. This is by design, as Voice Focus 2.1 prioritizes accuracy over speed and will not suppress a speaker it hasn’t confidently identified yet. When used with enhancement level of 100%, Quail Voice Focus 2.1 may also be used as a pre-processing step for third-party VADs that do not perform well in noisy conditions, as it will more aggressively suppress background noise and speech, which can help improve VAD accuracy, but may also harm the ASR’s performance. Quail, in contrast, is designed for far-field and multi-speaker environments. It does not suppress distant-sounding speech, making it better suited for speakerphone setups, meeting rooms, or situations with multiple participants spread across a space.
The Quail models are designed to enhance the performance of Voice AI Agents and STT systems, and may not always produce the most natural-sounding audio for human listeners.It is expected that some noise and reverberation may remain in the output, as these can actually help improve STT accuracy by providing additional acoustic context.If your primary goal is to improve the listening experience for humans, we recommend using the Rook models instead.
Take a look at our ASR optimization guide
  • ID: quail-vf-2.1-l-16khz
  • File size: 20 MB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms
  • ID: quail-vf-2.1-s-16khz
  • File size: 5.3 MB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms
  • ID: quail-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Quail VAD

SDK, Real-time, Voice Activity Detection Quail VAD 2.0 is a standalone, noise-robust Voice Activity Detection model for real-time Voice AI pipelines. It predicts speech activity directly from the input audio and, compared to Quail VAD 1.0, it does not require a larger speech enhancement model to run alongside it.
  • ID: quail-vad-2.0-xxs-16khz
  • File size: 630 kB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms
Quail VAD 1.0 continues to be available for use. This VAD is paired with a speech enhancement model, analyzing the energy in the enhanced audio to make its prediction.

Rook

SDK, Real-time, Human-to-Human The Rook models are specifically optimized for human-to-human interaction in real-time constrained systems (e.g. voice calls). They reduce background noise and reverberation while preserving speech naturalness and intelligibility for human perception. In contrast to the Quail models, Rook will suppress any sound that is does not recognize as speech. This makes Rook suitable also as a pre-processing step for third-party VADs that do not perform well in noisy conditions. However, note that Rook will preserve both foreground and background speech.
  • ID: rook-l-48khz
  • File size: 35.1 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: rook-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: rook-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-48khz
  • File size: 8.96 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Using models with non-native sample rates

Our models are trained for specific sample rates (8 kHz, 16 kHz, and 48 kHz). However, the ai-coustics SDK allows you to use any model with audio at non-native sample rates by internally resampling the input. The model always processes the audio at its native sample rate. You can choose any sample rate between 8 kHz and 192 kHz when calling the processor’s initialize function in the SDK, regardless of the model being used. Higher-than-native sample rate (e.g. 48 kHz audio with a 16 kHz model): In this case, the SDK cuts away the frequency content above the model’s native Nyquist frequency (everything above half the sample rate) before feeding it to the model. The SDK output is then upsampled back to the original sample rate. The mixback (enhancement_level) stays at the higher sample rate, so the output will contain the full frequency range of the original audio, but the model’s enhancement will only be applied to the frequencies within the model’s native Nyquist frequency. Lower-than-native sample rate (e.g. 8 kHz audio with a 16 kHz model): When the input audio sample rate is lower than the model’s native sample rate, compute resources are effectively “wasted” processing higher frequencies where no signal is contained (the model is just processing zeros). Therefore, if there is a model available matching your audio’s sample rate, we recommend using that model to avoid unnecessary compute and ensure optimal performance. In both cases, delay and CPU consumption are not affected by the input sample rate.
Learn more about performance here.

Compatibility

All models can be used in any of our available integrations, including all of our SDK language bindings and Pipecat filter. The LiveKit plugin has a limited selection of models available. For more information, see here. For more information about model compatibility with our different integrations, see the compatibility matrix.