Skip to main content
Speech-to-Text (STT) engines are a critical component of any Voice AI system. Their accuracy directly impacts agent performance, user experience, and downstream task completion. However, real-world audio is rarely clean: background noise, room reverb, and competing speakers all degrade STT accuracy. The ai-coustics Quail models are purpose-built to address this. By enhancing audio before it reaches your STT engine, Quail reduces word error rate (WER) and minimizes hallucinated transcriptions caused by interfering speech.

Working with Real-World Audio

Commercial STT APIs vary significantly in how they handle noisy, real-world audio. Some are more robust to background noise but struggle with competing speakers. Others handle overlapping speech well but are sensitive to reverb. There is no single audio preprocessing strategy that is optimal across all STT providers.

Optimizing For Your STT Provider

The Quail models expose a tunable enhancement_level parameter which allows you to optimize for your specific STT engine, deployment environment, and user experience requirements. The enhancement_level parameter controls how aggressively the model suppresses noise, and with Quail Voice Focus, also background and competing speech. It accepts values from 0.0 to 1.0. For voice agent applications where a single near-field speaker needs to be isolated, Quail Voice Focus is the recommended model.

Using Quail Voice Focus

The enhancement_level parameter behaves differently depending on the model. On Quail Voice Focus, it provides fine-grained control over foreground isolation.
The model uses an internal probabilistic confidence estimate for its foreground isolation decisions. The enhancement_level parameter modulates how the model acts on this confidence signal:
  • Lower values bias the model toward preserving ambiguous speech. Foreground speech is always kept, but more background leakage may pass through.
  • Higher values shift the model toward stricter decisions under uncertainty. Competing speech and echo are attenuated more aggressively, but there is an increased risk of suppressing low-energy foreground speech.
ValueBehaviorWhen to use
0.5 (default)Conservative. Foreground speech is always preserved.When minimizing any risk of speech deletion is the top priority.
0.8Balanced. Optimal word error rate on challenging data.Best starting point for most Voice AI deployments. Slightly higher chance of over-suppression in edge cases.
1.0Aggressive. Maximum suppression of interfering speech.When reducing insertions from background speakers is critical. Higher risk of suppressing quiet foreground speech.
Start with the default value of 0.5 and increase toward 0.8 if you observe STT errors caused by competing speech or background noise. The ideal setting depends on your STT provider, language, environment, and pre-processing pipeline.

Tune Your Input Gain

Unlike diarization-based systems, Quail Voice Focus is signal-based rather than speaker-based. It enhances whichever speech signal is dominant in the foreground, allowing multiple near-field speakers to be enhanced without locking onto a single voice. For optimal performance, the foreground speaker should typically fall within a level range of -35 to -10 LUFS (integrated) at the model input. We recommend tuning the input gain to satisfy this range. If the foreground speaker is too quiet, the model may classify it as background speech and suppress it.

Using Quail

On Quail the enhancement_level parameters acts as a simpler mix-back between the enhanced and original signal. For these models, an enhancement_level of 1.0 is recommended.

Best Practices

  • Tune per STT provider. Different engines respond differently to the same audio. Run evaluations with your specific STT model to find the optimal enhancement_level.
  • Use Quail Voice Focus to isolate the main speaker. It provides the best foreground isolation for headset and handheld use cases, eliminating interfering background speech and noise.
  • Monitor both insertions and deletions. Increasing the enhancement level reduces false insertions from background speech but may increase deletions of quiet foreground speech. Find the right balance for your application.