vllm.config.speech_to_text ¶
Classes:
-
SpeechToTextConfig–Configuration for speech-to-text models.
-
SpeechToTextParams–All parameters consumed by
get_generation_prompt().
SpeechToTextConfig ¶
Configuration for speech-to-text models.
Attributes:
-
max_audio_clip_s(int | None) –Maximum duration in seconds for a single audio clip without chunking.
-
min_energy_split_window_size(int | None) –Window size in samples for finding low-energy (quiet) regions to split
-
overlap_chunk_second(int) –Overlap duration in seconds between consecutive audio chunks when
-
sample_rate(float) –Sample rate (Hz) to resample input audio to. Most speech models expect
Source code in vllm/config/speech_to_text.py
max_audio_clip_s = 30 class-attribute instance-attribute ¶
Maximum duration in seconds for a single audio clip without chunking. Audio longer than this will be split into smaller chunks if allow_audio_chunking evaluates to True, otherwise it will be rejected. None means audio duration can be unlimited and won't be chunked.
min_energy_split_window_size = 1600 class-attribute instance-attribute ¶
Window size in samples for finding low-energy (quiet) regions to split audio chunks. The algorithm looks for the quietest moment within this window to minimize cutting through speech. Default 1600 samples ≈ 100ms at 16kHz. If None, no chunking will be done.
overlap_chunk_second = 1 class-attribute instance-attribute ¶
Overlap duration in seconds between consecutive audio chunks when splitting long audio. This helps maintain context across chunk boundaries and improves transcription quality at split points.
sample_rate = 16000 class-attribute instance-attribute ¶
Sample rate (Hz) to resample input audio to. Most speech models expect 16kHz audio input. The input audio will be automatically resampled to this rate before processing.
SpeechToTextParams dataclass ¶
All parameters consumed by get_generation_prompt().
TranscriptionRequest.build_stt_params() constructs this object, mapping API-level fields into typed attributes. Models only receive this object, so new parameters can be added here without changing the get_generation_prompt signature.
Attributes:
-
audio(ndarray) –Resampled audio waveform for a single chunk.
-
hotwords(str | None) –hotwords refers to a list of important words or phrases that the model
-
language(str | None) –ISO 639-1 language code (validated / auto-detected).
-
model_config(ModelConfig) –Model configuration.
-
request_prompt(str) –Optional text prompt to guide the model.
-
stt_config(SpeechToTextConfig) –Server-level speech-to-text configuration.
-
task_type(str) –"transcribe"or"translate". -
to_language(str | None) –Target language for translation (model-dependent).
Source code in vllm/config/speech_to_text.py
audio instance-attribute ¶
Resampled audio waveform for a single chunk.
hotwords = None class-attribute instance-attribute ¶
hotwords refers to a list of important words or phrases that the model should pay extra attention to during transcription.
language = None class-attribute instance-attribute ¶
ISO 639-1 language code (validated / auto-detected).
model_config instance-attribute ¶
Model configuration.
request_prompt = '' class-attribute instance-attribute ¶
Optional text prompt to guide the model.
stt_config instance-attribute ¶
Server-level speech-to-text configuration.
task_type = 'transcribe' class-attribute instance-attribute ¶
"transcribe" or "translate".
to_language = None class-attribute instance-attribute ¶
Target language for translation (model-dependent).