Skip to main content
Speech-to-Text (STT) engines are a critical component of any Voice AI system. Their accuracy directly impacts agent performance, user experience, and downstream task completion. However, real-world audio is rarely clean: background noise, room reverb, and competing speakers all degrade STT accuracy. The ai-coustics Quail models are purpose-built to address this. By enhancing audio before it reaches your STT engine, Quail reduces word error rate (WER) and minimizes hallucinated transcriptions caused by interfering speech.

Working with Real-World Audio

Commercial STT APIs vary significantly in how they handle noisy, real-world audio. Some are more robust to background noise but struggle with competing speakers. Others handle overlapping speech well but are sensitive to reverb. There is no single audio preprocessing strategy that is optimal across all STT providers.

Tuning the enhancement_level

The Quail models expose a tunable enhancement_level parameter which allows you to optimize for your specific STT engine, deployment environment, and user experience requirements. It accepts values from 0.0 to 1.0. An enhancement level of 0.0 lets the original audio signal through, while a value of 1.0 outputs the fully-clean signal as per the model’s prediction. On Quail, the enhancement_level parameter acts as a simpler mix-back between the enhanced and original signal. For these models, an enhancement_level of 1.0 is recommended. For voice agent applications where a single near-field speaker needs to be isolated, use Quail Voice Focus instead. See Improving ASR with Voice Focus.

Best Practices

  • Tune per STT provider. Different engines respond differently to the same audio. Run evaluations with your specific STT model to find the optimal enhancement_level.
  • Monitor both insertions and deletions. Increasing the enhancement level reduces false insertions from background speech but may increase deletions of quiet foreground speech. Find the right balance for your application.