`vllm.model_executor.models` ¶

Modules:

AXK1 –

Inference-only A.X K1 model.
adapters –
afmoe –

Inference-only AfMoE model compatible with HuggingFace weights.
apertus –

Inference-only Apertus model compatible with HuggingFace weights.
arcee –
arctic –

Inference-only Snowflake Arctic model.
aria –
audioflamingo3 –
aya_vision –
bagel –

Inference-only BAGEL model compatible with HuggingFace weights.
baichuan –

Inference-only BaiChuan model compatible with HuggingFace weights.
bailing_moe –

Inference-only BailingMoE model compatible with HuggingFace weights.
bailing_moe_linear –
bamba –

Inference-only Bamba model.
bee –
bert –
blip –

Minimal implementation of BlipVisionModel intended to be only used
blip2 –
bloom –

Inference-only BLOOM model compatible with HuggingFace weights.
chameleon –
chatglm –

Inference-only ChatGLM model compatible with THUDM weights.
cheers –

Inference-only Cheers (UMM) model compatible with HuggingFace weights.
clip –
cohere2_moe –
cohere2_vision –

Command-A-Vision (Cohere2Vision) multimodal model implementation for vLLM.
cohere_asr –
cohere_eagle –
colbert –

ColBERT late interaction model for retrieval and reranking.
colmodernvbert –

ColModernVBERT: multimodal late-interaction retrieval model.
colpali –

ColPali late interaction model for multi-modal retrieval and reranking.
colqwen3 –

ColQwen3 late interaction model for multi-modal retrieval and reranking.
colqwen3_5 –

ColQwen3.5 late interaction model for multi-modal retrieval and reranking.
commandr –

PyTorch Cohere model.
config –
conformer_encoder –

Shared Conformer encoder components for FireRedASR2 and FireRedLID.
cosmos3 –
dbrx –
deepencoder –
deepencoder2 –
deepseek_eagle3 –

Eagle3 speculative decoding model for DeepseekV2/V3 with MLP (no MoE).
deepseek_mtp –
deepseek_ocr –

Inference-only Deepseek-OCR model compatible with HuggingFace weights.
deepseek_ocr2 –

Inference-only Deepseek-OCR model compatible with HuggingFace weights.
deepseek_v2 –

Inference-only DeepseekV2/DeepseekV3 model.
deepseek_vl2 –

Inference-only Deepseek-VL2 model compatible with HuggingFace weights.
dots1 –

Inference-only dots1 model.
dots_ocr –
eagle2_5_vl –
ernie45 –

Inference-only Erine model compatible with HuggingFace weights.
ernie45_moe –

Inference-only ErineMoE model compatible with HuggingFace weights.
ernie45_vl –

Inference-only Ernie VL model compatible with HuggingFace weights.
ernie45_vl_moe –

Inference-only Erine VL model compatible with HuggingFace weights.
ernie_mtp –

Inference-only Ernie-MTP model.
exaone –

Inference-only Exaone model compatible with HuggingFace weights.
exaone4 –

Inference-only Exaone model compatible with HuggingFace weights.
exaone4_5 –

Inference-only EXAONE-4.5 model compatible with HuggingFace weights.
exaone4_5_mtp –

Inference-only EXAONE-4_5 MTP model.
exaone_moe –

Inference-only K-EXAONE-236B-A22B model compatible with HuggingFace weights.
exaone_moe_mtp –

Inference-only ExaoneMoe MTP model.
extract_hidden_states –

Hidden States Extractor Model.
fairseq2_llama –

Llama model for fairseq2 weights.
falcon –

PyTorch Falcon model.
falcon_h1 –

Inference-only FalconH1 model.
fireredasr2 –
fireredlid –

FireRedLID – Language Identification model adapted for vLLM.
flex_olmo –

Inference-only FlexOlmo model compatible with HuggingFace weights.
funasr –
funaudiochat –

Inference-only FunAudioChat model compatible with HuggingFace weights.
fuyu –

PyTorch Fuyu model.
gemma –

Inference-only Gemma model compatible with HuggingFace weights.
gemma3_mm –
gemma3n –
gemma3n_audio_utils –

Lightweight utility functions for Gemma3n audio processing.
gemma3n_mm –
gemma4 –

Gemma 4 model implementation for vLLM.
gemma4_mm –

Gemma 4 multimodal model (image + audio + video support).
gemma4_mtp –

Inference-only Gemma4 MTP (Multi-Token Prediction) model.
gemma4_unified –

Gemma 4 Unified multimodal model (encoder-free image + audio + video).
glm –

Inference-only HF format GLM-4 model compatible with THUDM weights.
glm4 –

Inference-only GLM-4-0414 model compatible with HuggingFace weights.
glm4_1v –

Inference-only GLM-4.1V & GLM-4.6V-Flash, AutoGLM-Phone-9B model
glm4_moe –

Inference-only GLM-4.5, GLM-4.6, GLM-4.7 model
glm4_moe_lite –

Inference-only GLM-4.7-Flash model compatible with HuggingFace weights.
glm4_moe_lite_mtp –

Inference-only GLM-4.7-Flash MTP model compatible with HuggingFace weights.
glm4_moe_mtp –

Inference-only GLM-4.5, GLM-4.6, GLM-4.7 MTP
glm4v –

Inference-only CogAgent model compatible with THUDM weights.
glm_ocr –

Inference-only GLM-OCR model compatible with HuggingFace weights.
glm_ocr_mtp –

Inference-only GLM-OCR MTP model compatible with HuggingFace weights.
glmasr –
glmasr_utils –
gpt2 –

Inference-only GPT-2 model compatible with HuggingFace weights.
gpt_bigcode –

Inference-only GPTBigCode model compatible with HuggingFace weights.
gpt_j –

Inference-only GPT-J model compatible with HuggingFace weights.
gpt_neox –

Inference-only GPT-NeoX model compatible with HuggingFace weights.
granite –

Inference-only IBM Granite model compatible with HuggingFace weights.
granite4_vision –

vLLM implementation of Granite 4 Vision.
granite_speech –

Inference-only IBM Granite speech model.
granite_speech_plus –

Inference-only IBM Granite Speech Plus model.
granitemoe –

Inference-only GraniteMoe model.
granitemoehybrid –

Inference-only GraniteMoeHybrid model.
granitemoeshared –

Inference-only GraniteMoeShared model.
gritlm –
grok1 –

Inference-only Grok (Grok1/Grok2) model.
hunyuan_v1 –

Inference-only HunYuan model compatible with HuggingFace weights.
hunyuan_vision –

Inference-only HunYuan-VL model compatible with HuggingFace weights.
hy_v3 –

Inference-only HY model compatible with HuggingFace weights.
hy_v3_mtp –

Inference-only HY V3 MTP model compatible with HuggingFace weights.
hyperclovax –

Inference-only HyperCLOVAX model compatible with HuggingFace weights.
hyperclovax_vision –
hyperclovax_vision_v2 –

HyperCLOVAX V2 (32B Think Model) Implementation.
idefics2_vision_model –

PyTorch Idefics2 model.
idefics3 –

Inference-only Idefics3 model compatible with HuggingFace weights.
interfaces –
interfaces_base –
intern_vit –
interns1 –
interns1_pro –

Inference-only InternS1Pro model compatible with HuggingFace weights.
interns1_vit –
internvl –
iquest_loopcoder –

Inference-only LoopCoder model compatible with HuggingFace weights.
isaac –
jais2 –

Inference-only Jais2 model compatible with HuggingFace weights.
jamba –

Inference-only Jamba model.
jina –
kanana_v –
keye –
keye_vl1_5 –
kimi_audio –

Inference-only Kimi-Audio model compatible with HuggingFace weights.
kimi_k25 –

Kimi-K2.5 Model Implementation for vLLM.
kimi_k25_vit –

Vision tower implementation for Kimi-K2.5 model.
kimi_linear –
kimi_vl –
laguna –

Inference-only Laguna model compatible with HuggingFace weights.
lfm2 –
lfm2_moe –
lfm2_siglip2 –

Implementation of Siglip2VisionModel intended to be only used
lfm2_vl –
llama –

Inference-only LLaMA model compatible with HuggingFace weights.
llama4 –

Inference-only LLaMA model compatible with HuggingFace weights.
llama4_eagle –
llava –
llava_next –
llava_next_video –
llava_onevision –
longcat_flash –

Inference-only Flash model compatible with HuggingFace weights.
longcat_flash_mtp –
mamba –

PyTorch MAMBA model.
mamba2 –

PyTorch MAMBA2 model.
medusa –
mellum –
midashenglm –

Inference-only MiDashengLM model compatible with HuggingFace weights.
mimo –

Inference-only MiMo model compatible with HuggingFace weights.
mimo_audio –

MiMo audio: tokenizer, encoding utilities, and audio encoder.
mimo_mtp –

Inference-only MiMo-MTP model.
mimo_v2_mtp –

Inference-only MiMo-V2 MTP (Multi-Token Prediction) draft model.
mimo_v2_omni –
minicpm –

Inference-only MiniCPM model compatible with HuggingFace weights.
minicpm3 –

Inference-only MiniCPM3 model compatible with HuggingFace weights.
minicpm_eagle –

Inference-only EagleMiniCPM model compatible with HuggingFace weights.
minicpmo –

Inference-only MiniCPM-O model compatible with HuggingFace weights.
minicpmv –

Inference-only MiniCPM-V model compatible with HuggingFace weights.
minicpmv4_6 –

Inference-only MiniCPM-V 4.6 model (MiniCPMV4_6ForConditionalGeneration).
minimax_m2 –

Inference-only MiniMaxM2 model.
minimax_text_01 –

Inference-only MiniMaxText01 model.
minimax_vl_01 –
mistral –

Mistral adaptation of the LLaMA architecture.
mistral3 –
mistral_large_3 –
mixtral –

Inference-only Mixtral model.
mllama4 –
mlp_speculator –
molmo –
molmo2 –
moondream3 –

Inference-only Moondream3 model implementation.
moonvit –
musicflamingo –
nano_nemotron_vl –
nemotron –

Inference-only Nemotron model compatible with HuggingFace weights.
nemotron_h –

Inference-only NemotronH model.
nemotron_h_mtp –

NemotronH-MTP model with attention layers.
nemotron_nas –

Inference-only deci model compatible with HuggingFace weights.
nemotron_parse –
nemotron_vl –
olmo –

Inference-only OLMo model compatible with HuggingFace weights.
olmo2 –

Inference-only OLMo2 model compatible with HuggingFace weights.
olmo_hybrid –

Inference-only OLMo Hybrid model compatible with HuggingFace weights.
olmoe –

Inference-only OLMoE model compatible with HuggingFace weights.
opencua –

Inference-only OpenCUA-7B model compatible with HuggingFace weights.
openpangu_mtp –
openpangu_vl –
openvla –
opt –

Inference-only OPT model compatible with HuggingFace weights.
orion –

Inference-only Orion-14B model compatible with HuggingFace weights.
ouro –

Inference-only Ouro model compatible with HuggingFace weights.
ovis –

PyTorch Ovis model.
ovis2_5 –

PyTorch Ovis model.
paddleocr_vl –
paligemma –
parakeet –

Modules below used for the audio encoder component in: models/nano_nemotron_vl.py
param2moe –
persimmon –

Inference-only persimmon model compatible with HuggingFace weights.
phi –

Inference-only Phi-1.5 model compatible with HuggingFace weights.
phi3 –

Inference-only Phi3 model code inherit from Llama.py
phi3v –
phi4mm –
phi4mm_audio –
phi4mm_utils –
phi4siglip –

vLLM support for microsoft/Phi-4-reasoning-vision-15B.
phimoe –

Inference-only PhiMoE model.
pixtral –
plamo2 –

Inference-only PLaMo2 model.
plamo3 –

Inference-only PLaMo3 model.
qianfan_ocr –
qwen –

Inference-only QWen model compatible with HuggingFace weights.
qwen2 –

Inference-only Qwen2 model compatible with HuggingFace weights.
qwen2_5_omni_thinker –

Inference-only Qwen2.5-Omni model (thinker part).
qwen2_5_vl –

Inference-only Qwen2.5-VL model compatible with HuggingFace weights.
qwen2_audio –

Inference-only Qwen2-Audio model compatible with HuggingFace weights.
qwen2_moe –

Inference-only Qwen2MoE model compatible with HuggingFace weights.
qwen2_rm –

Inference-only Qwen2-RM model compatible with HuggingFace weights.
qwen2_vl –

Inference-only Qwen2-VL model compatible with HuggingFace weights.
qwen3 –

Inference-only Qwen3 model compatible with HuggingFace weights.
qwen3_5 –

Inference-only Qwen3.5 Series compatible with HuggingFace weights.
qwen3_5_mtp –

Inference-only Qwen3_5 MTP model.
qwen3_asr –

Inference-only Qwen3-ASR model.
qwen3_asr_forced_aligner –

Inference-only Qwen3-ASR ForcedAligner model (token classification).
qwen3_asr_realtime –

Inference-only Qwen3-ASR realtime model.
qwen3_dflash –
qwen3_moe –

Inference-only Qwen3MoE model compatible with HuggingFace weights.
qwen3_next –

Inference-only Qwen3Next model.
qwen3_next_mtp –

Inference-only Qwen3Next MTP model.
qwen3_omni_moe_thinker –

Inference-only Qwen3-Omni-Moe model (thinker part).
qwen3_vl –

Inference-only Qwen3VL model compatible with HuggingFace weights.
qwen3_vl_moe –

Inference-only Qwen3-VL-MoE model compatible with HuggingFace weights.
qwen_vl –

Inference-only Qwen-VL model compatible with HuggingFace weights.
radio –
registry –

Whenever you add an architecture to this page, please also update
roberta –
sarvam –
seed_oss –

Inference-only SeedOss model compatible with HuggingFace weights.
siglip –
siglip2navit –

Implementation of SiglipVisionModel intended to be only used
skyworkr1v –
solar –

Inference-only Solar model compatible with HuggingFace weights.
stablelm –

Inference-only StableLM (https://github.com/Stability-AI/StableLM)
starcoder2 –

PyTorch Starcoder2 model.
step1 –

Shared Step decoder blocks and the Step1 text model.
step3_text –

Inference-only Jurassic model.
step3_vl –
step3p5 –

Inference-only Jurassic model.
step3p5_mtp –
step3p7 –

Inference-only Jurassic model.
step_vl –

This is basically a copy from perception_models/core/vision_encoder/pe.py
tarsier –
terratorch –

Wrapper around Terratorch models
transformers –

Wrapper around transformers models
ultravox –

PyTorch Ultravox model.
utils –
vision –
voxtral –
voxtral_realtime –
voyage –
whisper –
whisper_causal –
zamba2 –

PyTorch Zamba2 model implementation for vLLM.

Classes:

HasInnerState –

The interface required for all models that has inner state.
SupportsLoRA –

The interface required for all models that support LoRA.
SupportsMRoPE –

The interface required for all models that support M-RoPE.
SupportsMultiModal –

The interface required for all multi-modal models.
SupportsPP –

The interface required for all models that support pipeline parallel.
SupportsTranscription –

The interface required for all models that support transcription.
VllmModelForPooling –

The interface required for all pooling models in vLLM.
VllmModelForTextGeneration –

The interface required for all generative models in vLLM.

`HasInnerState` ¶

Bases: Protocol

The interface required for all models that has inner state.

Attributes:

has_inner_state (Literal[True]) –

A flag that indicates this model has inner state.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class HasInnerState(Protocol):
    """The interface required for all models that has inner state."""

    has_inner_state: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has inner state.
        Models that has inner state usually need access to the scheduler_config
        for max_num_seqs, etc. True for e.g. both Mamba and Jamba.
    """

`has_inner_state = True` `class-attribute` ¶

A flag that indicates this model has inner state. Models that has inner state usually need access to the scheduler_config for max_num_seqs, etc. True for e.g. both Mamba and Jamba.

`SupportsLoRA` ¶

Bases: Protocol

The interface required for all models that support LoRA.

Attributes:

supports_lora (Literal[True]) –

A flag that indicates this model supports LoRA.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsLoRA(Protocol):
    """The interface required for all models that support LoRA."""

    supports_lora: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports LoRA.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """
    is_3d_moe_weight: ClassVar[bool] = False
    is_non_gated_moe: ClassVar[bool] = False
    # The `embedding_module` and `embedding_padding_modules`
    # are empty by default.
    embedding_modules: ClassVar[dict[str, str]] = {}
    packed_modules_mapping: dict[str, list[str]] = {}
    # Module prefixes to skip during LoRA loading (e.g., ["mtp."] for MTP layers)
    lora_skip_prefixes: ClassVar[list[str]] = []

`supports_lora = True` `class-attribute` ¶

A flag that indicates this model supports LoRA.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

`SupportsMRoPE` ¶

Bases: Protocol

The interface required for all models that support M-RoPE.

Methods:

get_mrope_input_positions –

Get M-RoPE input positions and delta value for this specific model.

Attributes:

supports_mrope (Literal[True]) –

A flag that indicates this model supports M-RoPE.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMRoPE(Protocol):
    """The interface required for all models that support M-RoPE."""

    supports_mrope: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports M-RoPE.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def get_mrope_input_positions(
        self,
        input_tokens: list[int],
        mm_features: list["MultiModalFeatureSpec"],
    ) -> tuple[torch.Tensor, int]:
        """
        Get M-RoPE input positions and delta value for this specific model.

        This method should be implemented by each model that supports M-RoPE
        to provide model-specific logic for computing input positions.

        Args:
            input_tokens: List of input token IDs
            mm_features: Information about each multi-modal data item

        Returns:
            Tuple of `(llm_positions, mrope_position_delta)`
            - llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
            - mrope_position_delta: Delta for position calculations
        """
        ...

`supports_mrope = True` `class-attribute` ¶

A flag that indicates this model supports M-RoPE.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

`get_mrope_input_positions(input_tokens, mm_features)` ¶

Get M-RoPE input positions and delta value for this specific model.

This method should be implemented by each model that supports M-RoPE to provide model-specific logic for computing input positions.

Parameters:

input_tokens ¶
(list[int]) –

List of input token IDs
mm_features ¶
(list[MultiModalFeatureSpec]) –

Information about each multi-modal data item

Returns:

Tensor –

Tuple of (llm_positions, mrope_position_delta)
int –
- llm_positions: Tensor of shape [3, num_tokens] with T/H/W positions
tuple[Tensor, int] –
- mrope_position_delta: Delta for position calculations

Source code in vllm/model_executor/models/interfaces.py

def get_mrope_input_positions(
    self,
    input_tokens: list[int],
    mm_features: list["MultiModalFeatureSpec"],
) -> tuple[torch.Tensor, int]:
    """
    Get M-RoPE input positions and delta value for this specific model.

    This method should be implemented by each model that supports M-RoPE
    to provide model-specific logic for computing input positions.

    Args:
        input_tokens: List of input token IDs
        mm_features: Information about each multi-modal data item

    Returns:
        Tuple of `(llm_positions, mrope_position_delta)`
        - llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
        - mrope_position_delta: Delta for position calculations
    """
    ...

`SupportsMultiModal` ¶

Bases: Protocol

The interface required for all multi-modal models.

Methods:

configure_mm_token_handling –

Check if any multimodal tokens are out of vocabulary. If so, we will
embed_input_ids –

Apply token embeddings to input_ids.
embed_multimodal –

Returns multimodal embeddings generated from multimodal kwargs
get_language_model –

Returns the underlying language model used for text generation.
get_num_mm_connector_tokens –

Implement this function to enable LoRA support
get_num_mm_encoder_tokens –

Implement this function to enable LoRA support
get_placeholder_str –

Get the placeholder text for the ith modality item in the prompt.

Attributes:

requires_raw_input_tokens (bool) –

A flag that indicates this model processes input id tokens
supports_encoder_tp_data (bool) –

A flag that indicates whether this model supports
supports_multimodal (Literal[True]) –

A flag that indicates this model supports multi-modal inputs.
supports_multimodal_raw_input_only (bool) –

A flag that indicates this model supports multi-modal inputs and processes

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMultiModal(Protocol):
    """The interface required for all multi-modal models."""

    supports_multimodal: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports multi-modal inputs.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    supports_multimodal_raw_input_only: ClassVar[bool] = False
    """
    A flag that indicates this model supports multi-modal inputs and processes
    them in their raw form and not embeddings.
    """

    supports_encoder_tp_data: ClassVar[bool] = False
    """
    A flag that indicates whether this model supports
    `multimodal_config.mm_encoder_tp_mode="data"`.
    """

    requires_raw_input_tokens: ClassVar[bool] = False
    """
    A flag that indicates this model processes input id tokens
    in their raw form and not input embeddings.
    """

    _processor_factory: ClassVar[_ProcessorFactories]
    """
    Set internally by `MultiModalRegistry.register_processor`.
    """

    _language_model_names: list[str] = []
    """
    Set internally by `_mark_language_model`.
    """

    _tower_model_names: list[str] = []
    """
    Set internally by `_mark_tower_model`.
    """

    _has_oov_mm_tokens: bool = False
    """
    In general, this should be set at init time by invoking
    `configure_mm_token_handling` models & passing all potentially
    OOV multimodal tokens.
    """

    @classmethod
    def get_placeholder_str(cls, modality: str, i: int) -> str | None:
        """
        Get the placeholder text for the `i`th `modality` item in the prompt.
        """
        ...

    def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
        """
        Returns multimodal embeddings generated from multimodal kwargs
        to be merged with text embeddings.

        Note:
            The returned multimodal embeddings must be in the same order as
            the appearances of their corresponding multimodal data item in the
            input prompt.
        """
        ...

    def configure_mm_token_handling(self, vocab_size: int, mm_token_ids: list[int]):
        """Check if any multimodal tokens are out of vocabulary. If so, we will
        explicitly mask all multimodal tokens out when computing text embeddings,
        since the multimodal embeddings will be scattered over the results.
        """
        self._has_oov_mm_tokens = any(tok_id >= vocab_size for tok_id in mm_token_ids)
        logger.info(
            "Contains out of vocabulary multimodal tokens? %s",
            self._has_oov_mm_tokens,
        )

    def get_language_model(self) -> VllmModel:
        """
        Returns the underlying language model used for text generation.

        This is typically the `torch.nn.Module` instance responsible for
        processing the merged multimodal embeddings and producing hidden states

        Returns:
            torch.nn.Module: The core language model component.
        """
        # Cached
        if self in _language_model_by_module:
            return _language_model_by_module[self]

        if self._language_model_names:
            mod = self
            for attr in common_prefix(
                [name.split(".") for name in self._language_model_names]
            ):
                if attr:
                    mod = getattr(mod, attr)

            if mod is not self and hasattr(mod, "embed_input_ids"):
                _language_model_by_module[self] = mod
                return mod

        # Fallback
        for mod in self.children():
            if hasattr(mod, "embed_input_ids"):
                _language_model_by_module[self] = mod
                return mod

        raise NotImplementedError(
            f"No language model found in {type(self).__name__}! "
            "You should initialize it via `_mark_language_model`, "
            "and make sure `embed_input_ids` is implemented."
        )

    @contextmanager
    def _mark_language_model(
        self,
        vllm_config: VllmConfig,
        *,
        targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
    ):
        """
        Mark each child module that was assigned to this model during this context
        as a language model component.

        Language model components are automatically skipped in `--mm-encoder-only`
        mode.

        If `targets` is set, instead include descendants that are an instance
        of `targets`, even if they aren't direct children.
        """
        from .utils import StageMissingLayer, collect_children, no_init_weights

        mm_config = vllm_config.model_config.multimodal_config

        with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
            with (
                no_init_weights(
                    self,
                    lambda mod: StageMissingLayer("language_model", mod),
                    targets=targets,
                )
                if mm_config.mm_encoder_only
                else nullcontext()
            ):
                yield

        self._language_model_names = children_names

    @contextmanager
    def _mark_tower_model(
        self,
        vllm_config: VllmConfig,
        modalities: set[str] | str,
        *,
        targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
    ):
        """
        Mark each child module that was assigned to this model during this context
        as a tower model component.

        Tower model components are automatically skipped when `--limit-mm-per-prompt`
        is set to zero for all of their modalities.

        If `targets` is set, instead include descendants that are an instance
        of `targets`, even if they aren't direct children.
        """
        from .utils import StageMissingLayer, collect_children, no_init_weights

        if isinstance(modalities, str):
            modalities = {modalities}

        if modalities == {"image", "video"}:
            stage_name = "vision_tower"
        else:
            stage_name = "_".join([*modalities, "tower"])

        mm_config = vllm_config.model_config.multimodal_config

        with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
            with (
                no_init_weights(
                    self,
                    lambda mod: StageMissingLayer(stage_name, mod),
                    targets=targets,
                )
                if all(mm_config.get_limit_per_prompt(m) == 0 for m in modalities)
                else nullcontext()
            ):
                yield

        self._tower_model_names = children_names

    @contextmanager
    def _mark_composite_model(
        self,
        vllm_config: VllmConfig,
        *,
        language_targets: type[nn.Module] | tuple[type[nn.Module], ...],
        tower_targets: dict[str, type[nn.Module] | tuple[type[nn.Module], ...]],
    ):
        """
        Composite wrapper over `_mark_language_model` and
        `_mark_tower_model` by modality.
        """
        with ExitStack() as stack:
            stack.enter_context(
                self._mark_language_model(
                    vllm_config,
                    targets=language_targets,
                )
            )

            for modality, modality_targets in tower_targets.items():
                stack.enter_context(
                    self._mark_tower_model(
                        vllm_config,
                        modality,
                        targets=modality_targets,
                    )
                )

            yield

    def get_num_mm_encoder_tokens(self, num_image_tokens: int) -> int:
        """
        Implement this function to enable LoRA support
        for the tower module of the multi-modal model.
        Given the number of image tokens, output the number of
        multi-modal encoder tokens.
        """
        ...

    def get_num_mm_connector_tokens(self, num_vision_tokens: int) -> int:
        """
        Implement this function to enable LoRA support
        for the connector module of the multi-modal model.
        Given the number of vision tokens, output the number of
        multi-modal connector tokens.
        """
        ...

    @overload
    def embed_input_ids(self, input_ids: Tensor) -> Tensor: ...

    @overload
    def embed_input_ids(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings,
        *,
        is_multimodal: torch.Tensor,
    ) -> Tensor: ...

    def _embed_text_input_ids(
        self,
        input_ids: Tensor,
        embed_input_ids: Callable[[Tensor], Tensor],
        *,
        is_multimodal: Tensor | None,
    ) -> Tensor:
        if is_multimodal is not None and self._has_oov_mm_tokens:
            # Force all input IDs to be in vocab; we do this instead of squeezing
            # to ensure that any external configuration requiring offset tracking,
            # e.g., LoRA, are applied correctly regardless of whether or not
            # we have multimodal tokens.
            in_vocab_ids = input_ids.masked_fill(
                is_multimodal.to(device=input_ids.device, non_blocking=True), 0
            )
            return embed_input_ids(in_vocab_ids)

        return embed_input_ids(input_ids)

    def embed_input_ids(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings | None = None,
        *,
        is_multimodal: Tensor | None = None,
    ) -> Tensor:
        """
        Apply token embeddings to `input_ids`.

        If `multimodal_embeddings` is passed, scatter them into
        `input_ids` according to the mask `is_multimodal`.

        NOTE: If this model has multimodal tokens that are of vocabulary
        (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied
        and masked to 0 during the forward pass for the text embeddings.
        """
        from .utils import _merge_multimodal_embeddings

        # Get text embeddings first; multimodal embeddings will clobber
        # any invalid contents in the indices of multimodal embeddings
        # for the in vocabulary and out of vocabulary case.
        inputs_embeds = self._embed_text_input_ids(
            input_ids,
            self.get_language_model().embed_input_ids,
            is_multimodal=is_multimodal,
        )

        if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
            return inputs_embeds

        return _merge_multimodal_embeddings(
            inputs_embeds=inputs_embeds,
            multimodal_embeddings=multimodal_embeddings,
            is_multimodal=_require_is_multimodal(is_multimodal),
        )

`_has_oov_mm_tokens = False` `class-attribute` `instance-attribute` ¶

In general, this should be set at init time by invoking configure_mm_token_handling models & passing all potentially OOV multimodal tokens.

`_language_model_names = []` `class-attribute` `instance-attribute` ¶

Set internally by _mark_language_model.

`_processor_factory` `class-attribute` ¶

Set internally by MultiModalRegistry.register_processor.

`_tower_model_names = []` `class-attribute` `instance-attribute` ¶

Set internally by _mark_tower_model.

`requires_raw_input_tokens = False` `class-attribute` ¶

A flag that indicates this model processes input id tokens in their raw form and not input embeddings.

`supports_encoder_tp_data = False` `class-attribute` ¶

A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data".

`supports_multimodal = True` `class-attribute` ¶

A flag that indicates this model supports multi-modal inputs.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

`supports_multimodal_raw_input_only = False` `class-attribute` ¶

A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.

`_mark_composite_model(vllm_config, *, language_targets, tower_targets)` ¶

Composite wrapper over _mark_language_model and _mark_tower_model by modality.

Source code in vllm/model_executor/models/interfaces.py

@contextmanager
def _mark_composite_model(
    self,
    vllm_config: VllmConfig,
    *,
    language_targets: type[nn.Module] | tuple[type[nn.Module], ...],
    tower_targets: dict[str, type[nn.Module] | tuple[type[nn.Module], ...]],
):
    """
    Composite wrapper over `_mark_language_model` and
    `_mark_tower_model` by modality.
    """
    with ExitStack() as stack:
        stack.enter_context(
            self._mark_language_model(
                vllm_config,
                targets=language_targets,
            )
        )

        for modality, modality_targets in tower_targets.items():
            stack.enter_context(
                self._mark_tower_model(
                    vllm_config,
                    modality,
                    targets=modality_targets,
                )
            )

        yield

`_mark_language_model(vllm_config, *, targets=None)` ¶

Mark each child module that was assigned to this model during this context as a language model component.

Language model components are automatically skipped in --mm-encoder-only mode.

If targets is set, instead include descendants that are an instance of targets, even if they aren't direct children.

Source code in vllm/model_executor/models/interfaces.py

@contextmanager
def _mark_language_model(
    self,
    vllm_config: VllmConfig,
    *,
    targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
):
    """
    Mark each child module that was assigned to this model during this context
    as a language model component.

    Language model components are automatically skipped in `--mm-encoder-only`
    mode.

    If `targets` is set, instead include descendants that are an instance
    of `targets`, even if they aren't direct children.
    """
    from .utils import StageMissingLayer, collect_children, no_init_weights

    mm_config = vllm_config.model_config.multimodal_config

    with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
        with (
            no_init_weights(
                self,
                lambda mod: StageMissingLayer("language_model", mod),
                targets=targets,
            )
            if mm_config.mm_encoder_only
            else nullcontext()
        ):
            yield

    self._language_model_names = children_names

`_mark_tower_model(vllm_config, modalities, *, targets=None)` ¶

Mark each child module that was assigned to this model during this context as a tower model component.

Tower model components are automatically skipped when --limit-mm-per-prompt is set to zero for all of their modalities.

If targets is set, instead include descendants that are an instance of targets, even if they aren't direct children.

Source code in vllm/model_executor/models/interfaces.py

@contextmanager
def _mark_tower_model(
    self,
    vllm_config: VllmConfig,
    modalities: set[str] | str,
    *,
    targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
):
    """
    Mark each child module that was assigned to this model during this context
    as a tower model component.

    Tower model components are automatically skipped when `--limit-mm-per-prompt`
    is set to zero for all of their modalities.

    If `targets` is set, instead include descendants that are an instance
    of `targets`, even if they aren't direct children.
    """
    from .utils import StageMissingLayer, collect_children, no_init_weights

    if isinstance(modalities, str):
        modalities = {modalities}

    if modalities == {"image", "video"}:
        stage_name = "vision_tower"
    else:
        stage_name = "_".join([*modalities, "tower"])

    mm_config = vllm_config.model_config.multimodal_config

    with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
        with (
            no_init_weights(
                self,
                lambda mod: StageMissingLayer(stage_name, mod),
                targets=targets,
            )
            if all(mm_config.get_limit_per_prompt(m) == 0 for m in modalities)
            else nullcontext()
        ):
            yield

    self._tower_model_names = children_names

`configure_mm_token_handling(vocab_size, mm_token_ids)` ¶

Check if any multimodal tokens are out of vocabulary. If so, we will explicitly mask all multimodal tokens out when computing text embeddings, since the multimodal embeddings will be scattered over the results.

Source code in vllm/model_executor/models/interfaces.py

def configure_mm_token_handling(self, vocab_size: int, mm_token_ids: list[int]):
    """Check if any multimodal tokens are out of vocabulary. If so, we will
    explicitly mask all multimodal tokens out when computing text embeddings,
    since the multimodal embeddings will be scattered over the results.
    """
    self._has_oov_mm_tokens = any(tok_id >= vocab_size for tok_id in mm_token_ids)
    logger.info(
        "Contains out of vocabulary multimodal tokens? %s",
        self._has_oov_mm_tokens,
    )

`embed_input_ids(input_ids, multimodal_embeddings=None, *, is_multimodal=None)` ¶

embed_input_ids(input_ids: Tensor) -> Tensor

embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings,
    *,
    is_multimodal: torch.Tensor,
) -> Tensor

Apply token embeddings to input_ids.

If multimodal_embeddings is passed, scatter them into input_ids according to the mask is_multimodal.

NOTE: If this model has multimodal tokens that are of vocabulary (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied and masked to 0 during the forward pass for the text embeddings.

Source code in vllm/model_executor/models/interfaces.py

def embed_input_ids(
    self,
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings | None = None,
    *,
    is_multimodal: Tensor | None = None,
) -> Tensor:
    """
    Apply token embeddings to `input_ids`.

    If `multimodal_embeddings` is passed, scatter them into
    `input_ids` according to the mask `is_multimodal`.

    NOTE: If this model has multimodal tokens that are of vocabulary
    (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied
    and masked to 0 during the forward pass for the text embeddings.
    """
    from .utils import _merge_multimodal_embeddings

    # Get text embeddings first; multimodal embeddings will clobber
    # any invalid contents in the indices of multimodal embeddings
    # for the in vocabulary and out of vocabulary case.
    inputs_embeds = self._embed_text_input_ids(
        input_ids,
        self.get_language_model().embed_input_ids,
        is_multimodal=is_multimodal,
    )

    if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
        return inputs_embeds

    return _merge_multimodal_embeddings(
        inputs_embeds=inputs_embeds,
        multimodal_embeddings=multimodal_embeddings,
        is_multimodal=_require_is_multimodal(is_multimodal),
    )

`embed_multimodal(**kwargs)` ¶

Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.

Note

The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.

Source code in vllm/model_executor/models/interfaces.py

def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
    """
    Returns multimodal embeddings generated from multimodal kwargs
    to be merged with text embeddings.

    Note:
        The returned multimodal embeddings must be in the same order as
        the appearances of their corresponding multimodal data item in the
        input prompt.
    """
    ...

`get_language_model()` ¶

Returns the underlying language model used for text generation.

This is typically the torch.nn.Module instance responsible for processing the merged multimodal embeddings and producing hidden states

Returns:

VllmModel –

torch.nn.Module: The core language model component.

Source code in vllm/model_executor/models/interfaces.py

def get_language_model(self) -> VllmModel:
    """
    Returns the underlying language model used for text generation.

    This is typically the `torch.nn.Module` instance responsible for
    processing the merged multimodal embeddings and producing hidden states

    Returns:
        torch.nn.Module: The core language model component.
    """
    # Cached
    if self in _language_model_by_module:
        return _language_model_by_module[self]

    if self._language_model_names:
        mod = self
        for attr in common_prefix(
            [name.split(".") for name in self._language_model_names]
        ):
            if attr:
                mod = getattr(mod, attr)

        if mod is not self and hasattr(mod, "embed_input_ids"):
            _language_model_by_module[self] = mod
            return mod

    # Fallback
    for mod in self.children():
        if hasattr(mod, "embed_input_ids"):
            _language_model_by_module[self] = mod
            return mod

    raise NotImplementedError(
        f"No language model found in {type(self).__name__}! "
        "You should initialize it via `_mark_language_model`, "
        "and make sure `embed_input_ids` is implemented."
    )

`get_num_mm_connector_tokens(num_vision_tokens)` ¶

Implement this function to enable LoRA support for the connector module of the multi-modal model. Given the number of vision tokens, output the number of multi-modal connector tokens.

Source code in vllm/model_executor/models/interfaces.py

def get_num_mm_connector_tokens(self, num_vision_tokens: int) -> int:
    """
    Implement this function to enable LoRA support
    for the connector module of the multi-modal model.
    Given the number of vision tokens, output the number of
    multi-modal connector tokens.
    """
    ...

`get_num_mm_encoder_tokens(num_image_tokens)` ¶

Implement this function to enable LoRA support for the tower module of the multi-modal model. Given the number of image tokens, output the number of multi-modal encoder tokens.

Source code in vllm/model_executor/models/interfaces.py

def get_num_mm_encoder_tokens(self, num_image_tokens: int) -> int:
    """
    Implement this function to enable LoRA support
    for the tower module of the multi-modal model.
    Given the number of image tokens, output the number of
    multi-modal encoder tokens.
    """
    ...

`get_placeholder_str(modality, i)` `classmethod` ¶

Get the placeholder text for the ith modality item in the prompt.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_placeholder_str(cls, modality: str, i: int) -> str | None:
    """
    Get the placeholder text for the `i`th `modality` item in the prompt.
    """
    ...

`SupportsPP` ¶

Bases: Protocol

The interface required for all models that support pipeline parallel.

Methods:

forward –

Accept IntermediateTensors when
make_empty_intermediate_tensors –

Called when PP rank > 0 for profiling purposes.

Attributes:

supports_pp (Literal[True]) –

A flag that indicates this model supports pipeline parallel.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsPP(Protocol):
    """The interface required for all models that support pipeline parallel."""

    supports_pp: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pipeline parallel.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def make_empty_intermediate_tensors(
        self,
        batch_size: int,
        dtype: torch.dtype,
        device: torch.device,
    ) -> IntermediateTensors:
        """Called when PP rank > 0 for profiling purposes."""
        ...

    def forward(
        self,
        input_ids: Tensor | None,
        positions: Tensor,
        *,
        intermediate_tensors: IntermediateTensors | None,
    ) -> IntermediateTensors | None:
        """
        Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
        PP rank > 0.

        Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
        for the last PP rank.
        """
        ...

`supports_pp = True` `class-attribute` ¶

A flag that indicates this model supports pipeline parallel.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

`forward(input_ids, positions, *, intermediate_tensors)` ¶

Accept IntermediateTensors when PP rank > 0.

Return IntermediateTensors only for the last PP rank.

Source code in vllm/model_executor/models/interfaces.py

def forward(
    self,
    input_ids: Tensor | None,
    positions: Tensor,
    *,
    intermediate_tensors: IntermediateTensors | None,
) -> IntermediateTensors | None:
    """
    Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
    PP rank > 0.

    Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
    for the last PP rank.
    """
    ...

`make_empty_intermediate_tensors(batch_size, dtype, device)` ¶

Called when PP rank > 0 for profiling purposes.

Source code in vllm/model_executor/models/interfaces.py

def make_empty_intermediate_tensors(
    self,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device,
) -> IntermediateTensors:
    """Called when PP rank > 0 for profiling purposes."""
    ...

`SupportsTranscription` ¶

Bases: Protocol

The interface required for all models that support transcription.

Methods:

get_generation_prompt –

Get the prompt for the ASR model.
get_language_detection_prompt –

Return a prompt that triggers language detection.
get_language_token_ids –

Return token IDs that represent valid language tokens.
get_num_audio_tokens –

Map from audio duration to number of audio tokens produced by the ASR
get_speech_to_text_config –

Get the speech to text config for the ASR model.
parse_language_detection_output –

Parse the detected language from model output token IDs.
post_process_output –

Post-process the raw model output text.
validate_language –

Ensure the language specified in the transcription request

Attributes:

no_space_languages (set[str]) –

Languages that don't need a space between words.
supports_explicit_language_detection (bool) –

Transcription models that require an explicit language detection step
supports_segment_timestamp (bool) –

Enables the segment timestamp option for supported models by setting this to True.
supports_transcription_only (bool) –

Transcription models can opt out of text generation by setting this to

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsTranscription(Protocol):
    """The interface required for all models that support transcription."""

    # Mapping from ISO639_1 language codes: language names
    supported_languages: ClassVar[Mapping[str, str]]

    supports_transcription: ClassVar[Literal[True]] = True

    supports_transcription_only: ClassVar[bool] = False
    """
    Transcription models can opt out of text generation by setting this to
    `True`.
    """
    supports_segment_timestamp: ClassVar[bool] = False
    """
    Enables the segment timestamp option for supported models by setting this to `True`.
    """

    supports_explicit_language_detection: ClassVar[bool] = False
    """
    Transcription models that require an explicit language detection step
    (e.g. Whisper needs a separate forward pass to predict the language
    token) should set this to ``True`` and implement
    :meth:`get_language_detection_prompt` and
    :meth:`parse_language_detection_output` and
    :meth:`get_language_token_ids`.
    """

    no_space_languages: ClassVar[set[str]] = {"ja", "zh"}
    """
    Languages that don't need a space between words.
    For example, Japanese (ja) and Chinese (zh) don't need a space between words.
    """

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        # language codes in supported_languages
        # that don't exist in the full language map
        invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
        if invalid:
            raise ValueError(
                f"{cls.__name__}.supported_languages contains invalid "
                f"language codes: {sorted(invalid)}\n. "
                f"Valid choices are: {sorted(LANGUAGES.keys())}"
            )

    @classmethod
    def get_generation_prompt(
        cls,
        stt_params: SpeechToTextParams,
    ) -> PromptType:
        """Get the prompt for the ASR model.
        The model has control over the construction, as long as it
        returns a valid PromptType."""
        ...

    @classmethod
    def get_other_languages(cls) -> Mapping[str, str]:
        # other possible language codes from the whisper map
        return {k: v for k, v in LANGUAGES.items() if k not in cls.supported_languages}

    @classmethod
    def validate_language(cls, language: str | None) -> str | None:
        """
        Ensure the language specified in the transcription request
        is a valid ISO 639-1 language code. If the request language is
        valid, but not natively supported by the model, trigger a
        warning (but not an exception).
        """
        if language is None or language in cls.supported_languages:
            return language
        elif language in cls.get_other_languages():
            logger.warning(
                "Language %r is not natively supported by %s; "
                "results may be less accurate. Supported languages: %r",
                language,
                cls.__name__,
                list(cls.supported_languages.keys()),
            )
            return language
        else:
            raise ValueError(
                f"Unsupported language: {language!r}.  Must be one of "
                f"{list(cls.supported_languages.keys())}."
            )

    @classmethod
    def get_speech_to_text_config(
        cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"]
    ) -> SpeechToTextConfig:
        """Get the speech to text config for the ASR model."""
        ...

    @classmethod
    def get_num_audio_tokens(
        cls,
        audio_duration_s: float,
        stt_config: SpeechToTextConfig,
        model_config: ModelConfig,
    ) -> int | None:
        """
        Map from audio duration to number of audio tokens produced by the ASR
        model, without running a forward pass.
        This is used for estimating the amount of processing for this audio.
        """
        return None

    @classmethod
    def post_process_output(cls, text: str) -> str:
        """
        Post-process the raw model output text.

        Some ASR models output structured formats (e.g., language tags,
        special tokens) that need to be stripped before returning to the user.

        Args:
            text: Raw decoded text from the model.

        Returns:
            Cleaned transcription text.
        """
        return text

    @classmethod
    def get_language_detection_prompt(
        cls,
        audio: np.ndarray,
        stt_config: SpeechToTextConfig,
    ) -> PromptType:
        """Return a prompt that triggers language detection.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

    @classmethod
    def parse_language_detection_output(
        cls,
        token_ids: list[int],
        tokenizer: object,
    ) -> str:
        """Parse the detected language from model output token IDs.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

    @classmethod
    def get_language_token_ids(
        cls,
        tokenizer: object,
    ) -> list[int] | None:
        """Return token IDs that represent valid language tokens.

        Used to constrain language detection to only produce valid language tokens.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

`no_space_languages = {'ja', 'zh'}` `class-attribute` ¶

Languages that don't need a space between words. For example, Japanese (ja) and Chinese (zh) don't need a space between words.

`supports_explicit_language_detection = False` `class-attribute` ¶

Transcription models that require an explicit language detection step (e.g. Whisper needs a separate forward pass to predict the language token) should set this to True and implement :meth:get_language_detection_prompt and :meth:parse_language_detection_output and :meth:get_language_token_ids.

`supports_segment_timestamp = False` `class-attribute` ¶

Enables the segment timestamp option for supported models by setting this to True.

`supports_transcription_only = False` `class-attribute` ¶

Transcription models can opt out of text generation by setting this to True.

`get_generation_prompt(stt_params)` `classmethod` ¶

Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_generation_prompt(
    cls,
    stt_params: SpeechToTextParams,
) -> PromptType:
    """Get the prompt for the ASR model.
    The model has control over the construction, as long as it
    returns a valid PromptType."""
    ...

`get_language_detection_prompt(audio, stt_config)` `classmethod` ¶

Return a prompt that triggers language detection.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_language_detection_prompt(
    cls,
    audio: np.ndarray,
    stt_config: SpeechToTextConfig,
) -> PromptType:
    """Return a prompt that triggers language detection.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

`get_language_token_ids(tokenizer)` `classmethod` ¶

Return token IDs that represent valid language tokens.

Used to constrain language detection to only produce valid language tokens.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_language_token_ids(
    cls,
    tokenizer: object,
) -> list[int] | None:
    """Return token IDs that represent valid language tokens.

    Used to constrain language detection to only produce valid language tokens.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

`get_num_audio_tokens(audio_duration_s, stt_config, model_config)` `classmethod` ¶

Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_num_audio_tokens(
    cls,
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> int | None:
    """
    Map from audio duration to number of audio tokens produced by the ASR
    model, without running a forward pass.
    This is used for estimating the amount of processing for this audio.
    """
    return None

`get_speech_to_text_config(model_config, task_type)` `classmethod` ¶

Get the speech to text config for the ASR model.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_speech_to_text_config(
    cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"]
) -> SpeechToTextConfig:
    """Get the speech to text config for the ASR model."""
    ...

`parse_language_detection_output(token_ids, tokenizer)` `classmethod` ¶

Parse the detected language from model output token IDs.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def parse_language_detection_output(
    cls,
    token_ids: list[int],
    tokenizer: object,
) -> str:
    """Parse the detected language from model output token IDs.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

`post_process_output(text)` `classmethod` ¶

Post-process the raw model output text.

Some ASR models output structured formats (e.g., language tags, special tokens) that need to be stripped before returning to the user.

Parameters:

text ¶
(str) –

Raw decoded text from the model.

Returns:

str –

Cleaned transcription text.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def post_process_output(cls, text: str) -> str:
    """
    Post-process the raw model output text.

    Some ASR models output structured formats (e.g., language tags,
    special tokens) that need to be stripped before returning to the user.

    Args:
        text: Raw decoded text from the model.

    Returns:
        Cleaned transcription text.
    """
    return text

`validate_language(language)` `classmethod` ¶

Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def validate_language(cls, language: str | None) -> str | None:
    """
    Ensure the language specified in the transcription request
    is a valid ISO 639-1 language code. If the request language is
    valid, but not natively supported by the model, trigger a
    warning (but not an exception).
    """
    if language is None or language in cls.supported_languages:
        return language
    elif language in cls.get_other_languages():
        logger.warning(
            "Language %r is not natively supported by %s; "
            "results may be less accurate. Supported languages: %r",
            language,
            cls.__name__,
            list(cls.supported_languages.keys()),
        )
        return language
    else:
        raise ValueError(
            f"Unsupported language: {language!r}.  Must be one of "
            f"{list(cls.supported_languages.keys())}."
        )

`VllmModelForPooling` ¶

Bases: VllmModel[T_co], Protocol[T_co]

The interface required for all pooling models in vLLM.

Attributes:

attn_type (AttnTypeStr) –

Indicates the
default_seq_pooling_type (SequencePoolingType) –

Indicates the vllm.config.pooler.PoolerConfig.seq_pooling_type
default_tok_pooling_type (TokenPoolingType) –

Indicates the vllm.config.pooler.PoolerConfig.tok_pooling_type
is_pooling_model (Literal[True]) –

A flag that indicates this model supports pooling.
pooler (Pooler) –

The pooler is only called on TP rank 0.
score_type (ScoreType) –

Indicates the

Source code in vllm/model_executor/models/interfaces_base.py

@runtime_checkable
class VllmModelForPooling(VllmModel[T_co], Protocol[T_co]):
    """The interface required for all pooling models in vLLM."""

    is_pooling_model: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pooling.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    default_seq_pooling_type: ClassVar[SequencePoolingType] = "LAST"
    """
    Indicates the [vllm.config.pooler.PoolerConfig.seq_pooling_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.default_pooling_type][]
    decorator to conveniently set this field.
    """

    default_tok_pooling_type: ClassVar[TokenPoolingType] = "ALL"
    """
    Indicates the [vllm.config.pooler.PoolerConfig.tok_pooling_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.default_pooling_type][]
    decorator to conveniently set this field.
    """

    attn_type: ClassVar[AttnTypeStr] = "decoder"
    """
    Indicates the
    [vllm.config.model.ModelConfig.attn_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.attn_type][]
    decorator to conveniently set this field.
    """

    score_type: ClassVar[ScoreType] = "bi-encoder"
    """
    Indicates the
    [vllm.config.model.ModelConfig.score_type][]
    to use by default.

    Scoring API handles score/rerank for:\n
    - "classify" task (score_type: cross-encoder models)\n
    - "embed" task (score_type: bi-encoder models)\n
    - "token_embed" task (score_type: late interaction models)\n

    score_type defaults to bi-encoder, then the Score API uses the "embed" task.\n
    If you set score_type to cross-encoder via 
    [vllm.model_executor.models.interfaces.SupportsCrossEncoding][], 
    then the Score API uses the "score" task.\n
    If you set score_type to late-interaction via 
    [vllm.model_executor.models.interfaces.SupportsLateInteraction][], 
    then the Score API uses the "token_embed" task.\n
    """

    pooler: Pooler
    """The pooler is only called on TP rank 0."""

`attn_type = 'decoder'` `class-attribute` ¶

Indicates the vllm.config.model.ModelConfig.attn_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.attn_type decorator to conveniently set this field.

`default_seq_pooling_type = 'LAST'` `class-attribute` ¶

Indicates the vllm.config.pooler.PoolerConfig.seq_pooling_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.

`default_tok_pooling_type = 'ALL'` `class-attribute` ¶

Indicates the vllm.config.pooler.PoolerConfig.tok_pooling_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.

`is_pooling_model = True` `class-attribute` ¶

A flag that indicates this model supports pooling.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

`pooler` `instance-attribute` ¶

The pooler is only called on TP rank 0.

`score_type = 'bi-encoder'` `class-attribute` ¶

Indicates the vllm.config.model.ModelConfig.score_type to use by default.

Scoring API handles score/rerank for:

"classify" task (score_type: cross-encoder models)
"embed" task (score_type: bi-encoder models)
"token_embed" task (score_type: late interaction models)

score_type defaults to bi-encoder, then the Score API uses the "embed" task.

If you set score_type to cross-encoder via vllm.model_executor.models.interfaces.SupportsCrossEncoding, then the Score API uses the "score" task.

If you set score_type to late-interaction via vllm.model_executor.models.interfaces.SupportsLateInteraction, then the Score API uses the "token_embed" task.

`VllmModelForTextGeneration` ¶

Bases: VllmModel[T], Protocol[T]

The interface required for all generative models in vLLM.

Methods:

compute_logits –

Return None if TP rank > 0.

Source code in vllm/model_executor/models/interfaces_base.py

@runtime_checkable
class VllmModelForTextGeneration(VllmModel[T], Protocol[T]):
    """The interface required for all generative models in vLLM."""

    def compute_logits(
        self,
        hidden_states: T,
    ) -> T | None:
        """Return `None` if TP rank > 0."""
        ...

`compute_logits(hidden_states)` ¶

Return None if TP rank > 0.

Source code in vllm/model_executor/models/interfaces_base.py

def compute_logits(
    self,
    hidden_states: T,
) -> T | None:
    """Return `None` if TP rank > 0."""
    ...

vllm.model_executor.models ¶

HasInnerState ¶

has_inner_state = True class-attribute ¶

SupportsLoRA ¶

supports_lora = True class-attribute ¶

SupportsMRoPE ¶

supports_mrope = True class-attribute ¶

get_mrope_input_positions(input_tokens, mm_features) ¶

input_tokens ¶

mm_features ¶

SupportsMultiModal ¶

_has_oov_mm_tokens = False class-attribute instance-attribute ¶

_language_model_names = [] class-attribute instance-attribute ¶

_processor_factory class-attribute ¶

_tower_model_names = [] class-attribute instance-attribute ¶

requires_raw_input_tokens = False class-attribute ¶

supports_encoder_tp_data = False class-attribute ¶

supports_multimodal = True class-attribute ¶

supports_multimodal_raw_input_only = False class-attribute ¶

_mark_composite_model(vllm_config, *, language_targets, tower_targets) ¶

_mark_language_model(vllm_config, *, targets=None) ¶

_mark_tower_model(vllm_config, modalities, *, targets=None) ¶

configure_mm_token_handling(vocab_size, mm_token_ids) ¶

embed_input_ids(input_ids, multimodal_embeddings=None, *, is_multimodal=None) ¶

embed_multimodal(**kwargs) ¶

get_language_model() ¶

get_num_mm_connector_tokens(num_vision_tokens) ¶

get_num_mm_encoder_tokens(num_image_tokens) ¶

get_placeholder_str(modality, i) classmethod ¶

SupportsPP ¶

supports_pp = True class-attribute ¶

forward(input_ids, positions, *, intermediate_tensors) ¶

make_empty_intermediate_tensors(batch_size, dtype, device) ¶

SupportsTranscription ¶

no_space_languages = {'ja', 'zh'} class-attribute ¶

supports_explicit_language_detection = False class-attribute ¶

supports_segment_timestamp = False class-attribute ¶

supports_transcription_only = False class-attribute ¶

get_generation_prompt(stt_params) classmethod ¶

get_language_detection_prompt(audio, stt_config) classmethod ¶

get_language_token_ids(tokenizer) classmethod ¶

get_num_audio_tokens(audio_duration_s, stt_config, model_config) classmethod ¶

get_speech_to_text_config(model_config, task_type) classmethod ¶

parse_language_detection_output(token_ids, tokenizer) classmethod ¶

post_process_output(text) classmethod ¶

text ¶

validate_language(language) classmethod ¶

VllmModelForPooling ¶

attn_type = 'decoder' class-attribute ¶

default_seq_pooling_type = 'LAST' class-attribute ¶

default_tok_pooling_type = 'ALL' class-attribute ¶

is_pooling_model = True class-attribute ¶

pooler instance-attribute ¶

score_type = 'bi-encoder' class-attribute ¶

VllmModelForTextGeneration ¶

compute_logits(hidden_states) ¶

`vllm.model_executor.models` ¶

`HasInnerState` ¶

`has_inner_state = True` `class-attribute` ¶

`SupportsLoRA` ¶

`supports_lora = True` `class-attribute` ¶

`SupportsMRoPE` ¶

`supports_mrope = True` `class-attribute` ¶

`get_mrope_input_positions(input_tokens, mm_features)` ¶

`input_tokens` ¶

`mm_features` ¶

`SupportsMultiModal` ¶

`_has_oov_mm_tokens = False` `class-attribute` `instance-attribute` ¶

`_language_model_names = []` `class-attribute` `instance-attribute` ¶

`_processor_factory` `class-attribute` ¶

`_tower_model_names = []` `class-attribute` `instance-attribute` ¶

`requires_raw_input_tokens = False` `class-attribute` ¶

`supports_encoder_tp_data = False` `class-attribute` ¶

`supports_multimodal = True` `class-attribute` ¶

`supports_multimodal_raw_input_only = False` `class-attribute` ¶

`_mark_composite_model(vllm_config, *, language_targets, tower_targets)` ¶

`_mark_language_model(vllm_config, *, targets=None)` ¶

`_mark_tower_model(vllm_config, modalities, *, targets=None)` ¶

`configure_mm_token_handling(vocab_size, mm_token_ids)` ¶

`embed_input_ids(input_ids, multimodal_embeddings=None, *, is_multimodal=None)` ¶

`embed_multimodal(**kwargs)` ¶

`get_language_model()` ¶

`get_num_mm_connector_tokens(num_vision_tokens)` ¶

`get_num_mm_encoder_tokens(num_image_tokens)` ¶

`get_placeholder_str(modality, i)` `classmethod` ¶

`SupportsPP` ¶

`supports_pp = True` `class-attribute` ¶

`forward(input_ids, positions, *, intermediate_tensors)` ¶

`make_empty_intermediate_tensors(batch_size, dtype, device)` ¶

`SupportsTranscription` ¶

`no_space_languages = {'ja', 'zh'}` `class-attribute` ¶

`supports_explicit_language_detection = False` `class-attribute` ¶

`supports_segment_timestamp = False` `class-attribute` ¶

`supports_transcription_only = False` `class-attribute` ¶

`get_generation_prompt(stt_params)` `classmethod` ¶

`get_language_detection_prompt(audio, stt_config)` `classmethod` ¶

`get_language_token_ids(tokenizer)` `classmethod` ¶

`get_num_audio_tokens(audio_duration_s, stt_config, model_config)` `classmethod` ¶

`get_speech_to_text_config(model_config, task_type)` `classmethod` ¶

`parse_language_detection_output(token_ids, tokenizer)` `classmethod` ¶

`post_process_output(text)` `classmethod` ¶

`text` ¶

`validate_language(language)` `classmethod` ¶

`VllmModelForPooling` ¶

`attn_type = 'decoder'` `class-attribute` ¶

`default_seq_pooling_type = 'LAST'` `class-attribute` ¶

`default_tok_pooling_type = 'ALL'` `class-attribute` ¶

`is_pooling_model = True` `class-attribute` ¶

`pooler` `instance-attribute` ¶

`score_type = 'bi-encoder'` `class-attribute` ¶

`VllmModelForTextGeneration` ¶

`compute_logits(hidden_states)` ¶