Skip to main content
ai-coustics offers complementary speech enhancement model families for SDK and API users. Use this guide to choose a model that fits your needs.

Overview

Quail

SDK, Real-time, Human-to-Machine The Quail models are purpose-built for Voice AI Agents and human-to-machine interactions. Unlike standard noise suppression, Quail is tuned to improve the performance of downstream Speech-to-Text (STT) engines. Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices). Unlike diarization-based systems, Quail Voice Focus is signal-based rather than speaker-based. It enhances whichever speech signal is dominant in the foreground, allowing multiple near-field speakers to be enhanced without locking onto a single voice.
For optimal performance, the foreground speaker should typically fall within a level range of -35 to -10 LUFS (integrated) at the model input. We recommend tuning the input gain to satisfy this range. If the foreground speaker is too quiet, the model may classify it as background speech and suppress it.
Quail, in contrast, is designed for far-field and multi-speaker environments. It does not suppress distant-sounding speech, making it better suited for speakerphone setups, meeting rooms, or situations with multiple participants spread across a space.
The Quail models are designed to enhance the performance of Voice AI Agents and STT systems, and may not always produce the most natural-sounding audio for human listeners.It is expected that some noise and reverberation may remain in the output, as these can actually help improve STT accuracy by providing additional acoustic context.If your primary goal is to improve the listening experience for humans, we recommend using the Sparrow models instead.
  • ID: quail-vf-2.0-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Optimal sample rate: 16 kHz
  • Optimal num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: quail-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Sparrow

SDK, Real-time, Human-to-Human The Sparrow models are specifically optimized for human-to-human interaction in real-time constrained systems (e.g. voice calls). They reduce background noise and reverberation while preserving speech naturalness and intelligibility for human perception.
  • ID: sparrow-l-48khz
  • File size: 35.1 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: sparrow-l-16khz
  • File size: 35 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: sparrow-l-8khz
  • File size: 33.4 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms
  • ID: sparrow-s-48khz
  • File size: 8.96 MB
  • Window length: 10 ms
  • Native sample rate: 48 kHz
  • Native num frames: 480
  • Minimal algorithmic delay: 30 ms
  • ID: sparrow-s-16khz
  • File size: 8.88 MB
  • Window length: 10 ms
  • Native sample rate: 16 kHz
  • Native num frames: 160
  • Minimal algorithmic delay: 30 ms
  • ID: sparrow-s-8khz
  • File size: 8.43 MB
  • Window length: 10 ms
  • Native sample rate: 8 kHz
  • Native num frames: 80
  • Minimal algorithmic delay: 30 ms

Finch 2

API, File-based, Subtractive Finch 2 is our updated voice isolation model designed to remove undesired sounds (noise, reverb) while preserving the original speaker’s identity.
  • Best for: Strong background noise, heavy reverb, distant speakers, voice isolation needs
  • Strengths: Improved de-noising/de-reverb, fewer artifacts, more robust, faster and more energy-efficient
  • Parameter: enhancement_model: "FINCH" (maps to Finch 2)

Lark 2

API, File-based, Reconstructive Lark 2 is our reconstructive model that goes beyond isolation to repair degraded audio (e.g., compression, band-limiting) and restore a full, modern studio sound while keeping the authentic voice.
  • Best for: Old/phone/Zoom recordings, clipped or compressed audio, bandwidth-limited sources
  • Strengths: Better denoising and reverb removal, robust across complex real-world distortions, anti-hallucination training
  • Parameter: enhancement_model: "LARK_V2" (Lark 2). LARK is legacy.

Using models with non-native sample rates

Our models are trained for specific sample rates (8 kHz, 16 kHz, and 48 kHz). However, the ai-coustics SDK allows you to use any model with audio at non-native sample rates by internally resampling the input. The model always processes the audio at its native sample rate. You can choose any sample rate between 8 kHz and 192 kHz when calling the processor’s initialize function in the SDK, regardless of the model being used. Higher-than-native sample rate (e.g. 48 kHz audio with a 16 kHz model): In this case, the SDK cuts away the frequency content above the model’s native Nyquist frequency (everything above half the sample rate) before feeding it to the model. The SDK output is then upsampled back to the original sample rate. The mixback (enhancement_level) stays at the higher sample rate, so the output will contain the full frequency range of the original audio, but the model’s enhancement will only be applied to the frequencies within the model’s native Nyquist frequency. Lower-than-native sample rate (e.g. 8 kHz audio with a 16 kHz model): When the input audio sample rate is lower than the model’s native sample rate, compute resources are effectively “wasted” processing higher frequencies where no signal is contained (the model is just processing zeros). Therefore, if there is a model available matching your audio’s sample rate, we recommend using that model to avoid unnecessary compute and ensure optimal performance. In both cases, delay and CPU consumption are not affected by the input sample rate.