Skip to main content
Tyto is a lightweight model for audio insight that listens to an audio stream flowing from a human user into a Voice AI stack and tells you whether that audio is likely to cause failures in the downstream models that consume it, as well as why those models fail qualitatively. Tyto can be run in real time over live audio streams or offline over static audio files, in both cases on CPU. Tyto answers two questions:
  1. Will my Voice AI model fail on this audio? - Tyto reports the Tyto Risk Score, a single number in [0, 1] predicting the likelihood of downstream model failure.
  2. If so, why does it fail? - Six audio quality Tyto Dimensions, each ranging in [0, 1], that classify the type of degradation present: noise, speaker reverb, speaker loudness, interfering speech, background media speech and packet loss.
Tyto is designed to predict failures in downstream models including voice activity detection (VAD), turn-taking, speech-to-text (STT) and speech-to-speech (S2S) models. The Tyto Risk Score is independent of any particular STT vendor or model: it is a property of the audio, not of the recognition system. Swapping your STT provider does not warp the metric, which makes it suitable as a stable, long-lived KPI. Tyto works even with end-to-end speech models.

The Tyto Risk Score

What it measures: The headline audio score: the likelihood that this audio will cause failures in downstream models (STT, VAD, turn-taking or speech-to-speech). Lower indicates less problematic audio. Range in [0, 1]. How to use:
  • Flag problematic user audio and trigger in-call interventions for Voice Agent use cases.
  • Rank calls / user interactions for review: Sort descending, review the top N.
  • As an Service Level Objective-style KPI: “Fraction of call-minutes with Tyto Risk Score ≥ 0.60”.
How to interpret: Tyto flags problematic according to the below indicative bands:
BandRangeReading
🟢 Good< 0.35No meaningful degradation; downstream models should be unaffected
🟡 Warn0.35 - 0.60Noticeable degradation; expect elevated error rates
🔴 Bad> 0.60Severe degradation; downstream failure likely; flag the call / intervene
These bands are sensible defaults, not concrete rules. Our recommendation:
  • For real-time usage, we recommend calibrating these thresholds against the distribution of your own data.
  • For offline or post-hoc triage, percentile ranking within your own traffic (“review the worst 1%”) is more useful than absolute thresholds.
Caveats:
  • Tyto was constructed using English speech; on other languages the Tyto Risk Score remains valid but thresholds and performance may differ.
  • Tyto works on 5-second, fixed-length audio snippets. Short or truncated utterances carry too little context for a meaningful score. Analyze them as part of a longer stream rather than as sub-second clips.

Tyto Dimensions

Tyto returns six measures of audio quality in addition to the Tyto Risk Score. The six Tyto dimensions are:
  1. Noise: Ambient or environmental non-speech noise behind the speaker, relative to the speaker’s level.
  2. Speaker Reverb: Speaker distance and room reverberance. Low scores indicate dry, near-field audio (close mic); high scores indicate reverberant, far-field audio.
  3. Speaker Loudness: The loudness level of the main speaker. Note: This is the one neutral dimension. It is a level meter, not a degradation score.
  4. Interfering Speech: Interference from additional live speakers audible in the audio.
  5. Background Media Speech: Interfering speech content from media devices such as TVs, radios or smartphones.
  6. Packet Loss: Audio dropouts or discontinuities in the audio stream or file such as from network packet loss, jitter, frame erasure or CPU overload.
The Tyto Dimensions can be used to identify and track audio quality issues independently of the headline Tyto Risk Score. For example, high packet loss may indicate network overload even in cases where Voice Agents do not necessarily fail. The qualitative dimensions are near-orthogonal. For example, a call can be free of background noise yet severely degraded on packet loss. All six dimensions return values ranging in [0, 1]. In all cases except loudness, higher values denote more problematic audio.

Noise (noise)

What it measures: Ambient or environmental noise behind the speaker, relative to the speaker’s level (effectively a content-aware SNR estimate). Low values indicate clean audio; high values mean the noise is strong compared to the speaker. Range [0, 1]. Sounds like: Clean / clear / noiseless ↔ noisy / hissing / buzzy / roaring. Think of call-center babble, traffic, kitchen appliances, wind, air conditioning or engine noise. Likely root causes: The caller’s environment. This could constitute noise from cars, kitchens, drive-throughs, call centers or be street noise. Occasionally device-side noise may trigger this dimension: fan noise, electrical hum or aggressive mix gain. How to use it:
  • Real-time: Use sustained high noise values to (a) warn the user their environment is a problem for a Voice Agent, (b) suggest switching from automatic to manual turn-taking or (c) disable barge-in so the noise can’t keep interrupting the agent in Voice Agent use cases.
  • Offline: Segment user audio quality by environment and root cause low conversion calls.

Speaker Reverb (speaker_reverb)

What it measures: Speaker distance and room reverberance. Low scores indicate dry, near-field audio (close mic); high scores indicate reverberant, far-field audio. Higher values indicate more echoey, problematic audio. Range [0, 1]. Sounds like: Direct / close / dry ↔ distant / echoey / hollow / “tunnel sound” / speakerphone-across-the-room. Likely root causes: Speakerphone use, laptops/phones at arm’s length or further, hard-walled or acoustically reflective rooms (kitchens, bathrooms, warehouses), in-car hands-free systems mounted far from the speaker. How to use it:
  • Real-time: Sustained high speaker reverb indicates a potentially complex reverberant far-field audio scene (“you sound far away, could you move closer to the phone?”) or distant speaker as may occur in smart speaker use.
  • Offline: Correlate with STT error complaints. Reverb smears word boundaries and degrades STT in ways that are hard to hear but show up clearly in transcription accuracy. This can also induce failure of VAD endpointing (trailing reverb tails delay end-of-speech) interfering with turn-taking for Voice Agent applications.

Speaker Loudness (speaker_loudness)

What it measures: The loudness level of the main speaker. This is a neutral dimension: it is a level meter and does not proxy audio quality. Low values indicate quiet speech, high values indicate loud speech. Range [0, 1]. Sounds like: At the low end: quiet but audible speech. At the high end: strong, present, possibly over-driven audio. Likely root causes (of consistently low values): Quiet talkers, mic far from the speaker, attenuation in the chain, failed or missing automatic gain control, vulnerable/elderly callers on poor handsets. How to use it:
  • Consistently low values indicate a possible failure mode: Very quiet speech might be missed by VAD or STTs or poorly calibrated gain control.
  • High values are usually fine. Don’t alert on them in isolation.

Interfering Speech (interfering_speech)

What it measures: Interference from additional live speakers audible in the audio. Lower indicates less problematic audio. Range [0, 1]. Sounds like: One clear voice ↔ multiple overlapping speakers: a colleague at the next desk or family members in the same room. Likely root causes: Open-plan offices, call centers, shared homes, public places; side conversations during the call (“Ask your father where he put it…”). How to use it:
  • Real-time: For Voice Agents, this dimension catches agents answering or transcribing the wrong person. High interfering speech values warn that VAD and turn-taking may trigger on background voices or that STTs may splice additional speakers’ words into the user’s utterance. Disable barge-in, or have the agent confirm before acting on unexpected input.
  • Offline: Filter transcripts for likely cross-talk or “cocktail party” corruption before using them for analytics or fine-tuning; flag calls where “the agent went off the rails” coincides with high interfering speech.

Background Media Speech (media_speech)

What it measures: Interfering speech content from media devices such as televisions, radios or smartphones playing speech content in the background. Higher values indicate more problematic audio. Range [0, 1]. Sounds like: A second “voice” with broadcast character: distorted, constant level, dialogue that doesn’t react to a Voice Agent call. Likely root causes: TV, tablet or smartphone playing (the dominant case for consumer-facing agents), radio in cars and kitchens, speakerphone callers near a screen. How to use it:
  • Real-time: Unlike most degradations, the user can fix this through an in-call intervention such as a request or prompt to the user (“I’m hearing a TV in the background. Could you turn it down please?”). Feeding the score into a live agent or LLM’s context enables interventions based on this dimension.
  • Offline: Media speech is a transcript contaminant (STTs transcribe speech content from background devices). Use it to explain “hallucinated” user turns and to inform analytics or exclude contaminated calls from datasets that you curate in-house.

Packet Loss (packet_loss)

What it measures: Audio discontinuities in the stream from network packet loss, jitter, frame erasure or CPU overload in your Voice Agent stack. Lower indicates less problematic audio. Range [0, 1]. Sounds like: Steady / smooth / continuous ↔ shaky / choppy / uneven, syllables clipped out, robotic concealment artifacts, words arriving “frayed”. Likely root causes: Caller-side network (mobile dead zones, congested Wi-Fi), carrier/PSTN impairments, jitter-buffer underruns, overloaded CPU or upstream media servers. How to use it:
  • Real-time: When content is missing, Voice Agents should leverage this dimension to confirm user inputs, especially for names, (email) addresses and numbers, which are unrecoverable once dropped. High packet loss is also a signal to relax end-of-speech timeouts (gaps may be transport artifacts, not the user finishing).
  • Offline: Spikes correlated with geography, carrier, time-of-day or your own deploys point at network/capacity issues; per-call evidence of transport loss is a clear “the audio never arrived intact” artifact when attributing a failed call.

Qualitative Dimensions Quick Reference

OutputVariableHigh Value MeansPrimary Downstream Risk
Tyto Risk Scorerisk_scoreDownstream failure likelyVAD, turn-taking, STT, S2S
NoisenoiseLoud ambient noise relative to the speaker (low SNR, content-aware)VAD false triggers, STT errors
Speaker Reverbspeaker_reverbDistant, reverberant or echoey, far-field speakerSTT errors, word-boundary smearing
Speaker Loudnessspeaker_loudnessLoud main speakerNeutral dimension for informative use; extreme values do not imply problematic audio
Interfering Speechinterfering_speechOne or more competing live voices in the backgroundVAD false triggers, STT transcribing the wrong person, S2S interruption
Background Media Speechmedia_speechSpeech from a TV / radio / phone in the backgroundVAD false triggers, STT inserting media content, S2S interruption
Packet Losspacket_lossDropouts / discontinuities in the stream or fileMissed or mangled words, STT errors

Real-time vs Offline Usage

Tyto can monitor live audio quality or be run offline on static audio files. Below is a starter guide to using Tyto in each of these usage settings:
Real-timeOffline
What it isThe agent reacts to audio conditions during a human-agent callScore every call, then triage, rank and analyze
Typical actionsWarn the user (“Disruptive Background Noise”); switch to manual turn-taking; disable barge-in; prompt the user reduce TV volume; pass scores to the LLM as contextRank worst calls for review, attribute failures to audio vs. agent logic; track quality by geography/device/campaign; alert on regressions
What mattersPer-window scores, latency, smoothingPer-call aggregates (mean, p95, fraction-of-call-degraded)
ReplacesHand-tuned RMS or energy thresholdsManual call listening, sampled re-transcription with an expensive offline STT, LLM-as-judge on a subset of calls
Because Tyto is small and runs on-premise (no API calls, no audio leaving your infrastructure), it is cheap enough to run on 100% of calls. This makes it possible to follow a “score everything, then look at the worst” workflow over your traffic or static audio data.

Technical Reference

Input requirements

  • Sample rate: 16 kHz internally; resample before scoring.
  • Window length: Tyto operates on fixed 5.0 s windows of audio, returning one set of values per window.
  • Score the user channel: Tyto is intended for the audio your Voice AI model hears: the human-to-model or human-to-agent direction. Though Tyto is not a TTS-output quality monitor, it may be used in some applications. For example, Tyto’s packet_loss dimension may be used to flag “stutter” in TTS systems arising under CPU overload.

Smoothing for Real-time Use Cases

Raw per-window scores are intentionally responsive and therefore jumpy. For real-time use cases where an intervention is triggered e.g. switching a Voice Agent to manual turn-taking, we recommend acting on a smoothed value before triggering an intervention to prevent a single aberrant window resulting in a high rate of interventions. Alternatively, your application can require N consecutive windows to be flagged as problematic by Tyto before acting. We recommend applying an exponential moving average (EMA) to smooth the values Tyto returns. This can be implemented as follows: EMA(st)=αst+(1α)EMA(st1)\text{EMA}(s_t) = \alpha \cdot s_t + (1 - \alpha) \cdot \text{EMA}(s_{t−1}) We recommend setting α=0.3\alpha = 0.3.

Aggregating Over Streams, Calls or Static Audio Files

To aggregate per-window scores over long audio to collect summary features, we suggest the following approaches:
  • Mean: Overall call quality; simplest default for dashboards.
  • p95 / max: Worst moments; good for triage ranking.
  • Fraction of windows ≥ 0.35 (noticeable degradation): “how much of this call was bad”; the most robust single triage feature, since a 30-second noise burst in a 10-minute call barely moves the mean.
  • Per-dimension argmax: Which dimension was worst. This is the label to group by (“show me non-converted calls caused by noise”).

Using Tyto to Flag Problematic Calls or Audio Files

Once each call is summarized via a call-level score, use these to triage your call or audio traffic:
  • Flag problematic calls using a call-level aggregate such as the mean or fraction of windows ≥ 0.35 as a risk_score threshold.
  • Identify why using the per-dimension argmax: which dimension drove the score. This is the label to group by across calls.
  • Surface patterns by grouping flagged calls by their worst dimension - this separates systemic issues from one-off events (“show me all flagged calls driven by background media” to flag customers calling from near a TV or radio).