This guide explains the factors that influence audio latency (output delay) in the ai-coustics SDK and provides best practices for optimization.
Terminology
It is important to distinguish between Inference Latency and Audio Latency when optimizing your application:
| Term | Meaning | Example |
|---|
| Inference Latency | Time it takes to execute the model’s forward pass (CPU dependent). | “2 ms inference latency” |
| Audio Latency | Time offset of the output audio relative to the input audio. | ”30 ms audio latency” |
These are independent values. For example, a model might calculate a result in 2 ms (inference latency), but the audio output will be shifted by 30 ms (audio latency) relative to the input due to the model’s architecture. To avoid confusion we call the audio latency output delay of the audio processor.
Understanding Output Delay
In real-time audio applications, output delay is the time offset between the input signal and the processed output.
The primary function to determine the total end-to-end delay is aic_processor_get_output_delay.
- Output: The function returns the delay in samples.
- Conversion: To convert the delay to milliseconds, use the following formula:
delay_ms=(configured_sample_ratedelay_frames)×1000
Components of Output Delay
The total returned value includes three potential sources of delay:
| Component | When it applies | Avoidable? |
|---|
| Algorithmic delay | Always | No - inherent to model architecture |
| Adapter delay | When not using optimal_num_frames | Yes - use optimal number of frames |
| Buffering delay | When allow_variable_frames = true | Yes - use fixed number of frames |
Sample Rate Impact: Sample rate has no effect on the output delay. Neither the model’s native sample rate nor the processor’s configured sample rate changes the delay duration.
Factors Affecting Delay
Several factors affect the overall delay of the SDK:
1. Algorithmic Delay (Model Choice)
This is the minimum output delay inherent to the model’s architecture. Currently, all of our models have an algorithmic delay of 30 ms.
See the model documentation for exact values per model.
2. Adapter Delay (Frame Size)
The SDK uses an internal adapter to handle differences between your application’s buffer size and the model’s internal processing window (typically 10 ms).
- Optimal Frame Size: Each model was trained to process a buffer of audio of fixed length at a given sample rate on each forward pass. This is what is referred to as the model’s native window size and sample rate. The optimal frame size is the number of frames required to produce the model’s native window size at a given sample rate. You can get this value by calling
aic_model_get_optimal_num_frames with your target sample rate.
- Non-Optimal Frame Size: If you initialize the SDK with a frame size different from the optimal one, the adapter introduces buffering to match the model’s window, which increases delay.
3. Buffering Delay (Variable Frames)
The aic_processor_initialize function has a boolean parameter allow_variable_frames.
false (Default): The SDK expects a fixed frame size for every process call. This is the lowest latency mode.
true: The SDK allows you to send smaller frame sizes than the one specified at initialization. This flexibility comes at the cost of increased delay due to additional buffering.