Skip to main content
This page contains details about the models compatible with the current version of the SDK.
All models can be used with any sample rate using our SDK. You can learn more about that here.

Quail

Far-field Speech Enhancement for Voice AI Quail is designed for far-field and multi-speaker environments. It does not suppress distant-sounding speech, making it better suited for speakerphone setups, meeting rooms, or situations with multiple participants spread across a space.
  • ID: quail-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Quail Voice Focus

Near-field Speech Enhancement for Voice AI Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices).
  • ID: quail-vf-2.1-l-16khz
  • File size: 20 MB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms
  • ID: quail-vf-2.1-s-16khz
  • File size: 5.3 MB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms

Quail VAD

Noise-robust Voice Activity Detection Quail VAD is a standalone, noise-robust Voice Activity Detection model for real-time Voice AI pipelines.
  • ID: quail-vad-2.0-xxs-16khz
  • File size: 630 kB
  • Window length: 15 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 240
  • Minimal algorithmic delay: 30 ms

Tyto

Audio Insight for Voice AI Tyto is an audio intelligence model that predicts whether an audio signal is likely to cause failures in the downstream models that consume it (VAD, turn-taking, STT and speech-to-speech).
  • ID: tyto-l-16khz
  • File size: 19.8 MB
  • Window length: 5 s
  • Native sample rate: 16 kHz

Rook

Speech Enhancement for Human Intelligibility Rook reduces background noise and reverberation while preserving speech naturalness and intelligibility for human perception.
  • ID: rook-l-48khz
  • File size: 35.1 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: rook-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: rook-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-48khz
  • File size: 8.96 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: rook-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms