Skip to main content
This guide explains the factors that influence the audio latency (output delay) and performance (CPU usage) of the ai-coustics SDK, and provides best practices for optimization.

Terminology

It is important to distinguish between Inference Latency and Audio Latency when optimizing your application:
TermMeaningExample
Inference LatencyTime it takes to execute the model’s forward pass (CPU dependent).“2 ms inference latency”
Audio LatencyTime offset of the output audio relative to the input audio.”30 ms audio latency”
These are independent values. For example, a model might calculate a result in 2 ms (inference latency), but the audio output will be shifted by 30 ms (audio latency) relative to the input due to the model’s architecture. To avoid confusion we call the audio latency output delay of the audio processor.

Understanding Output Delay

In real-time audio applications, output delay is the time offset between the input signal and the processed output. The primary function to determine the total end-to-end delay is aic_processor_get_output_delay.
  • Output: The function returns the delay in samples.
  • Conversion: To convert the delay to milliseconds, use the following formula:
delay_ms=(delay_framesconfigured_sample_rate)×1000\text{delay\_ms} = \left( \frac{\text{delay\_frames}}{\text{configured\_sample\_rate}} \right) \times 1000

Components of Output Delay

The total returned value includes three potential sources of delay:
ComponentWhen it appliesAvoidable?
Algorithmic delayAlwaysNo — inherent to model architecture
Adapter delayWhen not using optimal_num_framesYes — use optimal number of frames
Buffering delayWhen allow_variable_frames = trueYes — use fixed number of frames
Sample Rate Impact: Sample rate has no effect on the output delay. Neither the model’s native sample rate nor the processor’s configured sample rate changes the delay duration.

Factors Affecting Delay

Several factors affect the overall delay of the SDK:

1. Algorithmic Delay (Model Choice)

This is the minimum output delay inherent to the model’s architecture. Different models have different algorithmic delays:
ModelAlgorithmic Delay
Sparrow L, Sparrow S, Quail30 ms
Sparrow XS, Sparrow XXS10 ms
See the model documentation for exact values per model.

2. Adapter Delay (Frame Size)

The SDK uses an internal adapter to handle differences between your application’s buffer size and the model’s internal processing window (typically 10 ms).
  • Optimal Frame Size: Each model was trained to process a buffer of audio of fixed length at a given sample rate on each forward pass. This is what is referred to as the model’s native window size and sample rate. The optimal frame size is the number of frames required to produce the model’s native window size at a given sample rate. You can get this value by calling aic_model_get_optimal_num_frames with your target sample rate.
  • Non-Optimal Frame Size: If you initialize the SDK with a frame size different from the optimal one, the adapter introduces buffering to match the model’s window, which increases delay.

3. Buffering Delay (Variable Frames)

The aic_processor_initialize function has a boolean parameter allow_variable_frames.
  • false (Default): The SDK expects a fixed frame size for every process call. This is the lowest latency mode.
  • true: The SDK allows you to send smaller frame sizes than the one specified at initialization. This flexibility comes at the cost of increased delay due to additional buffering.

Real-Time Guarantees

For real-time processing to work without dropouts, the Inference Latency (execution time) must be lower than the duration of your audio buffer. inference latencyaudio buffer length\text{inference latency} \leq \text{audio buffer length} For example, if you call the process function with 10 ms buffers, the function must complete execution in under 10 ms on your CPU. This depends entirely on your system load and hardware capabilities.

Performance (CPU Usage)

CPU usage is affected by the following factors:
  • Model Complexity: More complex models like Sparrow L variants consume significantly more CPU than simpler models like Sparrow S or Sparrow XS.
Sample Rate Impact: The configured sample rate does not affect CPU usage when using the same model; for instance, processing at 48 kHz requires the same computational resources as 16 kHz.

Parallel Processing

If you need to process multiple audio streams (e.g., multiple speakers) simultaneously, you must create one processor instance per stream.
ProcessorThreadingParallel Streams
ProcessorSame thread❌ Blocks — cannot run concurrently
AsyncProcessorSeparate thread✅ Runs in parallel
With the synchronous Processor, each call blocks the main thread. You cannot process multiple streams simultaneously on a single thread.