Overview
Speech Enhancement for Voice AI
Best for: Improving Voice AI Agents and Speech-to-Text (STT) accuracy. Includes Quail Voice Focus.
- Model Family: Quail
- Platform: SDK
- Real-time
Perceptual Speech Enhancement
Best for: Removing background noise and reverb on real-time communication use-cases.
- Model Family: Sparrow
- Platform: SDK
- Real-time
Voice Isolation & Clarity
Best for: Removing background noise and reverb while preserving the speaker’s identity.
- Model: Finch 2
- Platform: API
- File-based
Audio Repair & Restoration
Best for: Repairing degraded audio (phone calls, old recordings) to studio quality.
- Model: Lark 2
- Platform: API
- File-based
Quail
SDK, Real-time, Human-to-Machine The Quail models are purpose-built for Voice AI Agents and human-to-machine interactions. Unlike standard noise suppression, Quail is tuned to improve the performance of downstream Speech-to-Text (STT) engines. Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices). Unlike diarization-based systems, Quail Voice Focus is signal-based rather than speaker-based. It enhances whichever speech signal is dominant in the foreground, allowing multiple near-field speakers to be enhanced without locking onto a single voice.For optimal performance, the foreground speaker should typically fall within a level range of -35 to -10 LUFS (integrated) at the model input.
We recommend tuning the input gain to satisfy this range. If the foreground speaker is too quiet, the model may classify it as background speech and suppress it.
Quail Voice Focus 2.0 L (16 kHz)
Quail Voice Focus 2.0 L (16 kHz)
- ID:
quail-vf-2.0-l-16khz - File size: 35 MB
- Window length: 10 ms
- Optimal sample rate: 16 kHz
- Optimal num frames: 160
- Minimal algorithmic delay: 30 ms
Quail L (16 kHz)
Quail L (16 kHz)
- ID:
quail-l-16khz - File size: 35 MB
- Window length: 10 ms
- Optimal sample rate: 16 kHz
- Optimal num frames: 160
- Minimal algorithmic delay: 30 ms
Quail L (8 kHz)
Quail L (8 kHz)
- ID:
quail-l-8khz - File size: 33.4 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Quail S (16 kHz)
Quail S (16 kHz)
- ID:
quail-s-16khz - File size: 8.88 MB
- Window length: 10 ms
- Native sample rate: 16 kHz
- Native num frames: 160
- Minimal algorithmic delay: 30 ms
Quail S (8 kHz)
Quail S (8 kHz)
- ID:
quail-s-8khz - File size: 8.43 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Sparrow
SDK, Real-time, Human-to-Human The Sparrow models are specifically optimized for human-to-human interaction in real-time constrained systems (e.g. voice calls). They reduce background noise and reverberation while preserving speech naturalness and intelligibility for human perception.Sparrow L (48 kHz)
Sparrow L (48 kHz)
- ID:
sparrow-l-48khz - File size: 35.1 MB
- Window length: 10 ms
- Native sample rate: 48 kHz
- Native num frames: 480
- Minimal algorithmic delay: 30 ms
Sparrow L (16 kHz)
Sparrow L (16 kHz)
- ID:
sparrow-l-16khz - File size: 35 MB
- Window length: 10 ms
- Native sample rate: 16 kHz
- Native num frames: 160
- Minimal algorithmic delay: 30 ms
Sparrow L (8 kHz)
Sparrow L (8 kHz)
- ID:
sparrow-l-8khz - File size: 33.4 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Sparrow S (48 kHz)
Sparrow S (48 kHz)
- ID:
sparrow-s-48khz - File size: 8.96 MB
- Window length: 10 ms
- Native sample rate: 48 kHz
- Native num frames: 480
- Minimal algorithmic delay: 30 ms
Sparrow S (16 kHz)
Sparrow S (16 kHz)
- ID:
sparrow-s-16khz - File size: 8.88 MB
- Window length: 10 ms
- Native sample rate: 16 kHz
- Native num frames: 160
- Minimal algorithmic delay: 30 ms
Sparrow S (8 kHz)
Sparrow S (8 kHz)
- ID:
sparrow-s-8khz - File size: 8.43 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Finch 2
API, File-based, Subtractive Finch 2 is our updated voice isolation model designed to remove undesired sounds (noise, reverb) while preserving the original speaker’s identity.- Best for: Strong background noise, heavy reverb, distant speakers, voice isolation needs
- Strengths: Improved de-noising/de-reverb, fewer artifacts, more robust, faster and more energy-efficient
- Parameter:
enhancement_model: "FINCH"(maps to Finch 2)
Lark 2
API, File-based, Reconstructive Lark 2 is our reconstructive model that goes beyond isolation to repair degraded audio (e.g., compression, band-limiting) and restore a full, modern studio sound while keeping the authentic voice.- Best for: Old/phone/Zoom recordings, clipped or compressed audio, bandwidth-limited sources
- Strengths: Better denoising and reverb removal, robust across complex real-world distortions, anti-hallucination training
- Parameter:
enhancement_model: "LARK_V2"(Lark 2).LARKis legacy.
Using models with non-native sample rates
Our models are trained for specific sample rates (8 kHz, 16 kHz, and 48 kHz). However, the ai-coustics SDK allows you to use any model with audio at non-native sample rates by internally resampling the input. The model always processes the audio at its native sample rate. You can choose any sample rate between 8 kHz and 192 kHz when calling the processor’sinitialize function in the SDK, regardless of the model being used.
Higher-than-native sample rate (e.g. 48 kHz audio with a 16 kHz model):
In this case, the SDK cuts away the frequency content above the model’s native Nyquist frequency (everything above half the sample rate) before feeding it to the model.
The SDK output is then upsampled back to the original sample rate. The mixback (enhancement_level) stays at the higher sample rate, so the output will contain the full frequency range of the original audio,
but the model’s enhancement will only be applied to the frequencies within the model’s native Nyquist frequency.
Lower-than-native sample rate (e.g. 8 kHz audio with a 16 kHz model):
When the input audio sample rate is lower than the model’s native sample rate, compute resources are effectively “wasted” processing higher frequencies where no signal is contained (the model is just processing zeros).
Therefore, if there is a model available matching your audio’s sample rate, we recommend using that model to avoid unnecessary compute and ensure optimal performance.
In both cases, delay and CPU consumption are not affected by the input sample rate.