vllm.multimodal.audio ¶
Classes:
-
AudioResampler–Resample audio data to a target sample rate.
-
AudioSpec–Specification for target audio format.
-
ChannelReduction–Method to reduce multi-channel audio to target channels.
Functions:
-
find_split_point–Find the best point to split audio by looking for silence or low amplitude.
-
get_audio_duration–Get the duration of an audio array in seconds.
-
normalize_audio–Normalize audio to the specified format.
-
resample_audio_pyav–Resample audio using PyAV (libswresample via FFmpeg).
-
split_audio–Split audio into chunks with intelligent split points.
AudioResampler ¶
Resample audio data to a target sample rate.
Source code in vllm/multimodal/audio.py
AudioSpec dataclass ¶
Specification for target audio format.
This dataclass defines the expected audio format for a model's feature extractor. It is used to normalize audio data before processing.
Attributes:
-
target_channels(int | None) –Number of output channels. None means passthrough (no normalization). 1 = mono, 2 = stereo, etc.
-
channel_reduction(ChannelReduction) –Method to reduce channels when input has more channels than target. Only used when reducing channels.
Source code in vllm/multimodal/audio.py
needs_normalization property ¶
Whether audio normalization is needed.
ChannelReduction ¶
Method to reduce multi-channel audio to target channels.
Source code in vllm/multimodal/audio.py
find_split_point(wav, start_idx, end_idx, min_energy_window) ¶
Find the best point to split audio by looking for silence or low amplitude.
Searches for the quietest region within a specified range by calculating RMS energy in sliding windows.
Parameters:
-
(wav¶ndarray) –Audio array. Can be 1D or multi-dimensional.
-
(start_idx¶int) –Start index of search region (inclusive).
-
(end_idx¶int) –End index of search region (exclusive).
-
(min_energy_window¶int) –Window size in samples for energy calculation.
Returns:
-
int–Index of the quietest point within the search region. This is the
-
int–recommended split point to minimize audio artifacts.
Example
audio = np.random.randn(32000)
Insert quiet region¶
audio[16000:17600] = 0.01 split_idx = find_split_point( ... wav=audio, ... start_idx=0, ... end_idx=32000, ... min_energy_window=1600, ... ) 16000 <= split_idx <= 17600 True
Source code in vllm/multimodal/audio.py
get_audio_duration(*, y, sr=22050) ¶
Get the duration of an audio array in seconds.
Parameters:
-
(y¶NDArray[floating]) –Audio time series. Can be 1D (samples,) or 2D (channels, samples).
-
(sr¶float, default:22050) –Sample rate of the audio in Hz.
Returns:
-
float–Duration of the audio in seconds.
Source code in vllm/multimodal/audio.py
normalize_audio(audio, spec) ¶
Normalize audio to the specified format.
This function handles channel reduction for multi-channel audio, supporting both numpy arrays and torch tensors.
Parameters:
-
(audio¶NDArray[floating] | Tensor) –Input audio data. Can be: - 1D array/tensor: (time,) - already mono - 2D array/tensor: (channels, time) - standard format from torchaudio - 2D array/tensor: (time, channels) - format from soundfile (will be auto-detected and transposed if time > channels)
-
(spec¶AudioSpec) –AudioSpec defining the target format.
Returns:
-
NDArray[floating] | Tensor–Normalized audio in the same type as input (numpy or torch).
-
NDArray[floating] | Tensor–For mono output (target_channels=1), returns 1D array/tensor.
Raises:
-
ValueError–If audio has unsupported dimensions or channel expansion is requested (e.g., mono to stereo).
Source code in vllm/multimodal/audio.py
resample_audio_pyav(audio, *, orig_sr, target_sr) ¶
Resample audio using PyAV (libswresample via FFmpeg).
Parameters:
-
(audio¶NDArray[floating]) –Input audio. Can be: - 1D array
(samples,): mono audio - 2D array(channels, samples): stereo audio -
(orig_sr¶float) –Original sample rate in Hz.
-
(target_sr¶float) –Target sample rate in Hz.
Returns:
Source code in vllm/multimodal/audio.py
split_audio(audio_data, sample_rate, max_clip_duration_s, overlap_duration_s, min_energy_window_size) ¶
Split audio into chunks with intelligent split points.
Splits long audio into smaller chunks at low-energy regions to minimize cutting through speech. Uses overlapping windows to find quiet moments for splitting.
Parameters:
-
(audio_data¶ndarray) –Audio array to split. Can be 1D (mono) or multi-dimensional. Splits along the last dimension (time axis).
-
(sample_rate¶int) –Sample rate of the audio in Hz.
-
(max_clip_duration_s¶float) –Maximum duration of each chunk in seconds.
-
(overlap_duration_s¶float) –Overlap duration in seconds between consecutive chunks. Used to search for optimal split points.
-
(min_energy_window_size¶int) –Window size in samples for finding low-energy regions.
Returns:
-
list[ndarray]–List of audio chunks. Each chunk is a numpy array with the same shape
-
list[ndarray]–as the input except for the last (time) dimension.
Example
audio = np.random.randn(1040000) # 65 seconds at 16kHz chunks = split_audio( ... audio_data=audio, ... sample_rate=16000, ... max_clip_duration_s=30.0, ... overlap_duration_s=1.0, ... min_energy_window_size=1600, ... ) len(chunks) 3