Understand and optimize the latency and performance of the SDK.
This guide explains the factors that influence the audio latency (output delay) and performance (CPU usage) of the ai-coustics SDK, and provides best practices for optimization.
It is important to distinguish between Inference Latency and Audio Latency when optimizing your application:
Term
Meaning
Example
Inference Latency
Time it takes to execute the model’s forward pass (CPU dependent).
“2 ms inference latency”
Audio Latency
Time offset of the output audio relative to the input audio.
”30 ms audio latency”
These are independent values. For example, a model might calculate a result in 2 ms (inference latency), but the audio output will be shifted by 30 ms (audio latency) relative to the input due to the model’s architecture. To avoid confusion we call the audio latency output delay of the audio processor.
In real-time audio applications, output delay is the time offset between the input signal and the processed output.The primary function to determine the total end-to-end delay is aic_processor_get_output_delay.
Output: The function returns the delay in samples.
Conversion: To convert the delay to milliseconds, use the following formula:
The total returned value includes three potential sources of delay:
Component
When it applies
Avoidable?
Algorithmic delay
Always
No — inherent to model architecture
Adapter delay
When not using optimal_num_frames
Yes — use optimal number of frames
Buffering delay
When allow_variable_frames = true
Yes — use fixed number of frames
Sample Rate Impact: Sample rate has no effect on the output delay. Neither the model’s native sample rate nor the processor’s configured sample rate changes the delay duration.
The SDK uses an internal adapter to handle differences between your application’s buffer size and the model’s internal processing window (typically 10 ms).
Optimal Frame Size: Each model was trained to process a buffer of audio of fixed length at a given sample rate on each forward pass. This is what is referred to as the model’s native window size and sample rate. The optimal frame size is the number of frames required to produce the model’s native window size at a given sample rate. You can get this value by calling aic_model_get_optimal_num_frames with your target sample rate.
Non-Optimal Frame Size: If you initialize the SDK with a frame size different from the optimal one, the adapter introduces buffering to match the model’s window, which increases delay.
The aic_processor_initialize function has a boolean parameter allow_variable_frames.
false (Default): The SDK expects a fixed frame size for every process call. This is the lowest latency mode.
true: The SDK allows you to send smaller frame sizes than the one specified at initialization. This flexibility comes at the cost of increased delay due to additional buffering.
For real-time processing to work without dropouts, the Inference Latency (execution time) must be lower than the duration of your audio buffer.inference latency≤audio buffer lengthFor example, if you call the process function with 10 ms buffers, the function must complete execution in under 10 ms on your CPU. This depends entirely on your system load and hardware capabilities.
Model Complexity: More complex models like Sparrow L variants consume significantly more CPU than simpler models like Sparrow S or Sparrow XS.
Sample Rate Impact: The configured sample rate does not affect CPU usage when using the same model; for instance, processing at 48 kHz requires the same computational resources as 16 kHz.