Skip to main content
This page contains details about the models compatible with the current version of the SDK.
All models are language-agnostic and can be used for any language, as they operate on the acoustic properties of the audio signal rather than processing its linguistic content.All models can also be used with any sample rate using our SDK. You can learn more about that here.

Quail

Far-field Speech Enhancement for Voice AI Quail is designed for far-field and multi-speaker environments. It does not suppress distant-sounding speech, making it better suited for speakerphone setups, meeting rooms, or situations with multiple participants spread across a space.
  • ID: quail-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Quail Voice Focus

Near-field Speech Enhancement for Voice AI Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices).
  • ID: quail-vf-2.1-l-16khz
  • File size: 20 MB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms
  • ID: quail-vf-2.1-s-16khz
  • File size: 5.3 MB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms

Quail VAD

Noise-robust Voice Activity Detection Quail VAD is a standalone, noise-robust Voice Activity Detection model for real-time Voice AI pipelines.
  • ID: quail-vad-2.0-xxs-16khz
  • File size: 630 kB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms

Tyto

Audio Insight for Voice AI Tyto is an audio intelligence model that predicts whether an audio signal is likely to cause failures in the downstream models that consume it (VAD, turn-taking, STT and speech-to-speech).
  • ID: tyto-l-16khz
  • File size: 19.8 MB
  • Window length: 5 s
  • Native sample rate: 16 kHz

Rook

Speech Enhancement for Human Intelligibility Rook reduces background noise and reverberation while preserving speech naturalness and intelligibility for human perception.
  • ID: rook-l-48khz
  • File size: 35.1 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: rook-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: rook-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-48khz
  • File size: 8.96 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Using models with non-native sample rates

Our models are trained for specific sample rates (8 kHz, 16 kHz, and 48 kHz). However, the ai-coustics SDK allows you to use any model with audio at non-native sample rates by internally resampling the input. The model always processes the audio at its native sample rate. You can choose any sample rate between 8 kHz and 192 kHz when calling the processor’s initialize function in the SDK, regardless of the model being used. Higher-than-native sample rate (e.g. 48 kHz audio with a 16 kHz model): In this case, the SDK cuts away the frequency content above the model’s native Nyquist frequency (everything above half the sample rate) before feeding it to the model. The SDK output is then upsampled back to the original sample rate. The mixback (enhancement_level) stays at the higher sample rate, so the output will contain the full frequency range of the original audio, but the model’s enhancement will only be applied to the frequencies within the model’s native Nyquist frequency. Lower-than-native sample rate (e.g. 8 kHz audio with a 16 kHz model): When the input audio sample rate is lower than the model’s native sample rate, compute resources are effectively “wasted” processing higher frequencies where no signal is contained (the model is just processing zeros). Therefore, if there is a model available matching your audio’s sample rate, we recommend using that model to avoid unnecessary compute and ensure optimal performance. In both cases, CPU consumption is only marginally affected by the input sample rate.
Learn more about performance here.

Compatibility

All models can be used in any of our available integrations, including all of our SDK language bindings and Pipecat filter. The LiveKit plugin has a limited selection of models available. For more information, see here. For more information about model compatibility with our different integrations, see the compatibility matrix.