Voice Focus for Voice AI Systems

For voice agent applications where a single near-field speaker needs to be isolated, Quail Voice Focus is the recommended model for improving downstream Speech-to-Text (STT) accuracy. On Quail Voice Focus, the enhancement_level parameter provides fine-grained control over foreground isolation. It accepts values from 0.0 to 1.0.

How it works

The model uses an internal probabilistic confidence estimate for its foreground isolation decisions. The enhancement_level parameter modulates how the model acts on this confidence signal:

Lower values bias the model toward preserving ambiguous speech. Foreground speech is always kept, but more background leakage may pass through.
Higher values shift the model toward stricter decisions under uncertainty. Competing speech and echo are attenuated more aggressively, but there is an increased risk of suppressing low-energy foreground speech.

Recommended Settings

Value	Behavior	When to use
`0.5` (default)	Conservative. Foreground speech is always preserved.	When minimizing any risk of speech deletion is the top priority.
`0.8`	Balanced. Optimal word error rate on challenging data.	Best starting point for most Voice AI deployments. Slightly higher chance of over-suppression in edge cases.
`1.0`	Aggressive. Maximum suppression of interfering speech.	When reducing insertions from background speakers is critical. Higher risk of suppressing quiet foreground speech.

Start with the default value of 0.5 and increase toward 0.8 if you observe STT errors caused by competing speech or background noise. The ideal setting depends on your STT provider, language, environment, and pre-processing pipeline.

Tune Your Input Gain

Unlike diarization-based systems, Quail Voice Focus is signal-based rather than speaker-based. It enhances whichever speech signal is dominant in the foreground, allowing multiple near-field speakers to be enhanced without locking onto a single voice. For optimal performance, the foreground speaker should typically fall within a level range of -35 to -10 LUFS (integrated) at the model input. We recommend tuning the input gain to satisfy this range. If the foreground speaker is too quiet, the model may classify it as background speech and suppress it.

Best Practices

Use Quail Voice Focus to isolate the main speaker. It provides the best foreground isolation for headset and handheld use cases, eliminating interfering background speech and noise.
Tune per STT provider. Different engines respond differently to the same audio. Run evaluations with your specific STT model to find the optimal enhancement_level.
Monitor both insertions and deletions. Increasing the enhancement level reduces false insertions from background speech but may increase deletions of quiet foreground speech. Find the right balance for your application.

​Recommended Settings

​Tune Your Input Gain

​Best Practices

Recommended Settings

Tune Your Input Gain

Best Practices