Models - ai-coustics Docs

ai-coustics offers complementary speech enhancement model families for SDK and API users. Use this guide to choose a model that fits your needs.

Overview

Speech Enhancement for Voice AI

Best for: Improving Voice AI Agents and Speech-to-Text (STT) accuracy. Includes Quail Voice Focus.

Model Family: Quail
Platform: SDK
Real-time

Perceptual Speech Enhancement

Best for: Removing background noise and reverb on real-time communication use-cases.

Model Family: Sparrow
Platform: SDK
Real-time

Voice Isolation & Clarity

Best for: Removing background noise and reverb while preserving the speaker’s identity.

Model: Finch 2
Platform: API
File-based

Audio Repair & Restoration

Best for: Repairing degraded audio (phone calls, old recordings) to studio quality.

Model: Lark 2
Platform: API
File-based

Quail

SDK, Real-time, Human-to-Machine The Quail models are purpose-built for Voice AI Agents and human-to-machine interactions. Unlike standard noise suppression, Quail is tuned to improve the performance of downstream Speech-to-Text (STT) engines. Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices). Unlike diarization-based systems, Quail Voice Focus is signal-based rather than speaker-based. It enhances whichever speech signal is dominant in the foreground, allowing multiple near-field speakers to be enhanced without locking onto a single voice.

For optimal performance, the foreground speaker should typically fall within a level range of -35 to -10 LUFS (integrated) at the model input. We recommend tuning the input gain to satisfy this range. If the foreground speaker is too quiet, the model may classify it as background speech and suppress it.

Quail, in contrast, is designed for far-field and multi-speaker environments. It does not suppress distant-sounding speech, making it better suited for speakerphone setups, meeting rooms, or situations with multiple participants spread across a space.

The Quail models are designed to enhance the performance of Voice AI Agents and STT systems, and may not always produce the most natural-sounding audio for human listeners.It is expected that some noise and reverberation may remain in the output, as these can actually help improve STT accuracy by providing additional acoustic context.If your primary goal is to improve the listening experience for humans, we recommend using the Sparrow models instead.

Quail Voice Focus 2.0 L (16 kHz)

ID: quail-vf-2.0-l-16khz
File size: 35 MB
Window length: 10 ms
Optimal sample rate: 16 kHz
Optimal num frames: 160
Minimal algorithmic delay: 30 ms

Quail L (16 kHz)

ID: quail-l-16khz
File size: 35 MB
Window length: 10 ms
Optimal sample rate: 16 kHz
Optimal num frames: 160
Minimal algorithmic delay: 30 ms

Quail L (8 kHz)

ID: quail-l-8khz
File size: 33.4 MB
Window length: 10 ms
Native sample rate: 8 kHz
Native num frames: 80
Minimal algorithmic delay: 30 ms

Quail S (16 kHz)

ID: quail-s-16khz
File size: 8.88 MB
Window length: 10 ms
Native sample rate: 16 kHz
Native num frames: 160
Minimal algorithmic delay: 30 ms

Quail S (8 kHz)

ID: quail-s-8khz
File size: 8.43 MB
Window length: 10 ms
Native sample rate: 8 kHz
Native num frames: 80
Minimal algorithmic delay: 30 ms

Model files available for download here.

Learn more about the models’ sample rate.

Sparrow

SDK, Real-time, Human-to-Human The Sparrow models are specifically optimized for human-to-human interaction in real-time constrained systems (e.g. voice calls). They reduce background noise and reverberation while preserving speech naturalness and intelligibility for human perception.

Sparrow L (48 kHz)

ID: sparrow-l-48khz
File size: 35.1 MB
Window length: 10 ms
Native sample rate: 48 kHz
Native num frames: 480
Minimal algorithmic delay: 30 ms

Sparrow L (16 kHz)

ID: sparrow-l-16khz
File size: 35 MB
Window length: 10 ms
Native sample rate: 16 kHz
Native num frames: 160
Minimal algorithmic delay: 30 ms

Sparrow L (8 kHz)

ID: sparrow-l-8khz
File size: 33.4 MB
Window length: 10 ms
Native sample rate: 8 kHz
Native num frames: 80
Minimal algorithmic delay: 30 ms

Sparrow S (48 kHz)

ID: sparrow-s-48khz
File size: 8.96 MB
Window length: 10 ms
Native sample rate: 48 kHz
Native num frames: 480
Minimal algorithmic delay: 30 ms

Sparrow S (16 kHz)

ID: sparrow-s-16khz
File size: 8.88 MB
Window length: 10 ms
Native sample rate: 16 kHz
Native num frames: 160
Minimal algorithmic delay: 30 ms

Sparrow S (8 kHz)

ID: sparrow-s-8khz
File size: 8.43 MB
Window length: 10 ms
Native sample rate: 8 kHz
Native num frames: 80
Minimal algorithmic delay: 30 ms

Model files available for download here.

Learn more about the models’ sample rate.

Finch 2

API, File-based, Subtractive Finch 2 is our updated voice isolation model designed to remove undesired sounds (noise, reverb) while preserving the original speaker’s identity.

Best for: Strong background noise, heavy reverb, distant speakers, voice isolation needs
Strengths: Improved de-noising/de-reverb, fewer artifacts, more robust, faster and more energy-efficient
Parameter: enhancement_model: "FINCH" (maps to Finch 2)

Lark 2

API, File-based, Reconstructive Lark 2 is our reconstructive model that goes beyond isolation to repair degraded audio (e.g., compression, band-limiting) and restore a full, modern studio sound while keeping the authentic voice.

Best for: Old/phone/Zoom recordings, clipped or compressed audio, bandwidth-limited sources
Strengths: Better denoising and reverb removal, robust across complex real-world distortions, anti-hallucination training
Parameter: enhancement_model: "LARK_V2" (Lark 2). LARK is legacy.

Using models with non-native sample rates

Our models are trained for specific sample rates (8 kHz, 16 kHz, and 48 kHz). However, the ai-coustics SDK allows you to use any model with audio at non-native sample rates by internally resampling the input. The model always processes the audio at its native sample rate. You can choose any sample rate between 8 kHz and 192 kHz when calling the processor’s initialize function in the SDK, regardless of the model being used. Higher-than-native sample rate (e.g. 48 kHz audio with a 16 kHz model): In this case, the SDK cuts away the frequency content above the model’s native Nyquist frequency (everything above half the sample rate) before feeding it to the model. The SDK output is then upsampled back to the original sample rate. The mixback (enhancement_level) stays at the higher sample rate, so the output will contain the full frequency range of the original audio, but the model’s enhancement will only be applied to the frequencies within the model’s native Nyquist frequency. Lower-than-native sample rate (e.g. 8 kHz audio with a 16 kHz model): When the input audio sample rate is lower than the model’s native sample rate, compute resources are effectively “wasted” processing higher frequencies where no signal is contained (the model is just processing zeros). Therefore, if there is a model available matching your audio’s sample rate, we recommend using that model to avoid unnecessary compute and ensure optimal performance. In both cases, delay and CPU consumption are not affected by the input sample rate.

General

SDK

API

​Overview

Speech Enhancement for Voice AI

Perceptual Speech Enhancement

Voice Isolation & Clarity

Audio Repair & Restoration

​Quail

​Sparrow

​Finch 2

​Lark 2

​Using models with non-native sample rates

Overview

Quail

Sparrow

Finch 2

Lark 2

Using models with non-native sample rates