Skip to main content
This guide explains the difference between latency and delay in the ai-coustics SDK, and how to optimize performance for real-time applications.

Terminology

TermMeaningExample
LatencyTime to execute the model’s forward pass (CPU dependent)“2 ms inference latency”
DelayTime offset of output audio relative to input”30 ms output delay”
These are independent values. A model that has 2 ms latency on your CPU and 30 ms delay means: computation completes in 2 ms, but the output audio is shifted 30 ms relative to input.

Understanding output_delay

The output_delay value (retrieved via aic_processor_get_output_delay) represents the total audio delay in frames. Convert to milliseconds:
delay_ms = (delay_frames / configured_sample_rate) * 1000

Components of Output Delay

ComponentWhen it appliesAvoidable?
Algorithmic delayAlwaysNo — inherent to model architecture
Adapter delayWhen not using optimal_num_framesYes — use optimal frame size
Buffering delayWhen allow_variable_frames = trueYes — use fixed frame size
Sample rate has no effect on the output delay in ms. Neither the model’s native sample rate nor the processor’s configured sample rate changes the delay.

Algorithmic Delay

Our models typically have 10 ms or 30 ms algorithmic delay. See the model documentation for exact values per model. This delay is based on the model architecture and thus is the minimum output delay of the processor when configured with ProcessorConfig::optimal.

Model Processing Window

Models operate on fixed-size audio chunks called the processing window. Our models typically use 10 ms windows. See the model documentation for exact values per model. The size of the window can be determined with the Model::get_optimal_num_frames function, which is dependent on the sample_rate that you operate on.
window_frames = model.get_optimal_num_frames(sample_rate)
window_ms = (window_frames / sample_rate) * 1000

Adapter Delay

The adapter is responsible for adapting between the incoming audio buffers and the model processing window. If they are configured to be the same, there is no adapter delay introduced. Otherwise the delay depends on the relation between the two frame sizes.

Buffering Delay

If the adapter is configured with allow_variable_frames = true, each process call can use a different number of frames, as long as it is at most the configured num_frames. This is necessary because some audio hosts (e.g. Audacity) only report the maximum number of frames they will call the process with. If variable frames are activated, a buffering delay of the length of the model’s processing window will be introduced.

Real-Time Guarantees

For the real-time processing to work, the inference latency (execution time) of the model has to be lower than the duration of the incoming audio buffers (your configured num_frames). This depends on your CPU and system load.
inference latency < audio buffer length
If you call the process function with 10 ms buffers, the function must complete in under 10 ms.

CPU Usage

CPU usage depends only on model complexity. More complex models (e.g., Sparrow L) require more CPU than simpler ones (e.g., Sparrow XS).
Sample rate has no effect on CPU usage. Neither the model’s native sample rate nor the processor’s configured sample rate changes performance.

Parallel Processing

For multiple audio streams (e.g., multiple speakers), create one processor per stream.
ProcessorThreadingParallel Streams
ProcessorSame thread❌ Blocks — cannot run concurrently
AsyncProcessorSeparate thread✅ Runs in parallel
# Multiple speakers — use AsyncProcessor
processor_speaker1 = AsyncProcessor(model)
processor_speaker2 = AsyncProcessor(model)
# These process concurrently on different threads
With the synchronous Processor, each call blocks the main thread. You cannot process multiple streams simultaneously.
The AsyncProcessor is currently only available in the Python SDK. For other languages, you’ll need to manage threading manually.