Skip to main content
This guide explains the factors that influence audio latency (output delay) in the ai-coustics SDK and provides best practices for optimization.

Terminology

It is important to distinguish between Inference Latency and Audio Latency when optimizing your application:
TermMeaningExample
Inference LatencyTime it takes to execute the model’s forward pass (CPU dependent).“2 ms inference latency”
Audio LatencyTime offset of the output audio relative to the input audio.”30 ms audio latency”
These are independent values. For example, a model might calculate a result in 2 ms (inference latency), but the audio output will be shifted by 30 ms (audio latency) relative to the input due to the model’s architecture. To avoid confusion we call the audio latency output delay of the audio processor.

Understanding Output Delay

In real-time audio applications, output delay is the time offset between the input signal and the processed output. The primary function to determine the total end-to-end delay is aic_processor_get_output_delay.
  • Output: The function returns the delay in samples.
  • Conversion: To convert the delay to milliseconds, use the following formula:
delay_ms=(delay_framesconfigured_sample_rate)×1000\text{delay\_ms} = \left( \frac{\text{delay\_frames}}{\text{configured\_sample\_rate}} \right) \times 1000

Components of Output Delay

The total returned value includes three potential sources of delay:
ComponentWhen it appliesAvoidable?
Algorithmic delayAlwaysNo - inherent to model architecture
Adapter delayWhen not using optimal_num_framesYes - use optimal number of frames
Buffering delayWhen allow_variable_frames = trueYes - use fixed number of frames
Sample Rate Impact: Sample rate has no effect on the output delay. Neither the model’s native sample rate nor the processor’s configured sample rate changes the delay duration.

Factors Affecting Delay

Several factors affect the overall delay of the SDK:

1. Algorithmic Delay (Model Choice)

This is the minimum output delay inherent to the model’s architecture. Currently, all of our models have an algorithmic delay of 30 ms. See the model documentation for exact values per model.

2. Adapter Delay (Frame Size)

The SDK uses an internal adapter to handle differences between your application’s buffer size and the model’s internal processing window (typically 10 ms).
  • Optimal Frame Size: Each model was trained to process a buffer of audio of fixed length at a given sample rate on each forward pass. This is what is referred to as the model’s native window size and sample rate. The optimal frame size is the number of frames required to produce the model’s native window size at a given sample rate. You can get this value by calling aic_model_get_optimal_num_frames with your target sample rate.
  • Non-Optimal Frame Size: If you initialize the SDK with a frame size different from the optimal one, the adapter introduces buffering to match the model’s window, which increases delay.

3. Buffering Delay (Variable Frames)

The aic_processor_initialize function has a boolean parameter allow_variable_frames.
  • false (Default): The SDK expects a fixed frame size for every process call. This is the lowest latency mode.
  • true: The SDK allows you to send smaller frame sizes than the one specified at initialization. This flexibility comes at the cost of increased delay due to additional buffering.