Working with Real-World Audio
Commercial STT APIs vary significantly in how they handle noisy, real-world audio. Some are more robust to background noise but struggle with competing speakers. Others handle overlapping speech well but are sensitive to reverb. There is no single audio preprocessing strategy that is optimal across all STT providers.Optimizing For Your STT Provider
The Quail models expose a tunableenhancement_level parameter which allows you to optimize for your specific STT engine, deployment environment, and user experience requirements.
The enhancement_level parameter controls how aggressively the model suppresses noise, and with Quail Voice Focus, also background and competing speech.
It accepts values from 0.0 to 1.0.
For voice agent applications where a single near-field speaker needs to be isolated, Quail Voice Focus is the recommended model.
Using Quail Voice Focus
Theenhancement_level parameter behaves differently depending on the model. On Quail Voice Focus, it provides fine-grained control over foreground isolation.
How it works
How it works
The model uses an internal probabilistic confidence estimate for its foreground isolation decisions. The
enhancement_level parameter modulates how the model acts on this confidence signal:- Lower values bias the model toward preserving ambiguous speech. Foreground speech is always kept, but more background leakage may pass through.
- Higher values shift the model toward stricter decisions under uncertainty. Competing speech and echo are attenuated more aggressively, but there is an increased risk of suppressing low-energy foreground speech.
Recommended Settings
| Value | Behavior | When to use |
|---|---|---|
0.5 (default) | Conservative. Foreground speech is always preserved. | When minimizing any risk of speech deletion is the top priority. |
0.8 | Balanced. Optimal word error rate on challenging data. | Best starting point for most Voice AI deployments. Slightly higher chance of over-suppression in edge cases. |
1.0 | Aggressive. Maximum suppression of interfering speech. | When reducing insertions from background speakers is critical. Higher risk of suppressing quiet foreground speech. |
Tune Your Input Gain
Unlike diarization-based systems, Quail Voice Focus is signal-based rather than speaker-based. It enhances whichever speech signal is dominant in the foreground, allowing multiple near-field speakers to be enhanced without locking onto a single voice. For optimal performance, the foreground speaker should typically fall within a level range of -35 to -10 LUFS (integrated) at the model input. We recommend tuning the input gain to satisfy this range. If the foreground speaker is too quiet, the model may classify it as background speech and suppress it.Using Quail
On Quail theenhancement_level parameters acts as a simpler mix-back between the enhanced and original signal. For these models, an enhancement_level of 1.0 is recommended.
Best Practices
- Tune per STT provider. Different engines respond differently to the same audio. Run evaluations with your specific STT model to find the optimal
enhancement_level. - Use Quail Voice Focus to isolate the main speaker. It provides the best foreground isolation for headset and handheld use cases, eliminating interfering background speech and noise.
- Monitor both insertions and deletions. Increasing the enhancement level reduces false insertions from background speech but may increase deletions of quiet foreground speech. Find the right balance for your application.