Skip to content

vllm.model_executor.models

Modules:

  • AXK1

    Inference-only A.X K1 model.

  • adapters
  • afmoe

    Inference-only AfMoE model compatible with HuggingFace weights.

  • apertus

    Inference-only Apertus model compatible with HuggingFace weights.

  • arcee
  • arctic

    Inference-only Snowflake Arctic model.

  • aria
  • audioflamingo3
  • aya_vision
  • bagel

    Inference-only BAGEL model compatible with HuggingFace weights.

  • baichuan

    Inference-only BaiChuan model compatible with HuggingFace weights.

  • bailing_moe

    Inference-only BailingMoE model compatible with HuggingFace weights.

  • bailing_moe_linear
  • bamba

    Inference-only Bamba model.

  • bee
  • bert
  • blip

    Minimal implementation of BlipVisionModel intended to be only used

  • blip2
  • bloom

    Inference-only BLOOM model compatible with HuggingFace weights.

  • chameleon
  • chatglm

    Inference-only ChatGLM model compatible with THUDM weights.

  • cheers

    Inference-only Cheers (UMM) model compatible with HuggingFace weights.

  • clip
  • cohere2_moe
  • cohere2_vision

    Command-A-Vision (Cohere2Vision) multimodal model implementation for vLLM.

  • cohere_asr
  • cohere_eagle
  • colbert

    ColBERT late interaction model for retrieval and reranking.

  • colmodernvbert

    ColModernVBERT: multimodal late-interaction retrieval model.

  • colpali

    ColPali late interaction model for multi-modal retrieval and reranking.

  • colqwen3

    ColQwen3 late interaction model for multi-modal retrieval and reranking.

  • colqwen3_5

    ColQwen3.5 late interaction model for multi-modal retrieval and reranking.

  • commandr

    PyTorch Cohere model.

  • config
  • conformer_encoder

    Shared Conformer encoder components for FireRedASR2 and FireRedLID.

  • cosmos3
  • dbrx
  • deepencoder
  • deepencoder2
  • deepseek_eagle3

    Eagle3 speculative decoding model for DeepseekV2/V3 with MLP (no MoE).

  • deepseek_mtp
  • deepseek_ocr

    Inference-only Deepseek-OCR model compatible with HuggingFace weights.

  • deepseek_ocr2

    Inference-only Deepseek-OCR model compatible with HuggingFace weights.

  • deepseek_v2

    Inference-only DeepseekV2/DeepseekV3 model.

  • deepseek_vl2

    Inference-only Deepseek-VL2 model compatible with HuggingFace weights.

  • dots1

    Inference-only dots1 model.

  • dots_ocr
  • eagle2_5_vl
  • ernie45

    Inference-only Erine model compatible with HuggingFace weights.

  • ernie45_moe

    Inference-only ErineMoE model compatible with HuggingFace weights.

  • ernie45_vl

    Inference-only Ernie VL model compatible with HuggingFace weights.

  • ernie45_vl_moe

    Inference-only Erine VL model compatible with HuggingFace weights.

  • ernie_mtp

    Inference-only Ernie-MTP model.

  • exaone

    Inference-only Exaone model compatible with HuggingFace weights.

  • exaone4

    Inference-only Exaone model compatible with HuggingFace weights.

  • exaone4_5

    Inference-only EXAONE-4.5 model compatible with HuggingFace weights.

  • exaone4_5_mtp

    Inference-only EXAONE-4_5 MTP model.

  • exaone_moe

    Inference-only K-EXAONE-236B-A22B model compatible with HuggingFace weights.

  • exaone_moe_mtp

    Inference-only ExaoneMoe MTP model.

  • extract_hidden_states

    Hidden States Extractor Model.

  • fairseq2_llama

    Llama model for fairseq2 weights.

  • falcon

    PyTorch Falcon model.

  • falcon_h1

    Inference-only FalconH1 model.

  • fireredasr2
  • fireredlid

    FireRedLID – Language Identification model adapted for vLLM.

  • flex_olmo

    Inference-only FlexOlmo model compatible with HuggingFace weights.

  • funasr
  • funaudiochat

    Inference-only FunAudioChat model compatible with HuggingFace weights.

  • fuyu

    PyTorch Fuyu model.

  • gemma

    Inference-only Gemma model compatible with HuggingFace weights.

  • gemma3_mm
  • gemma3n
  • gemma3n_audio_utils

    Lightweight utility functions for Gemma3n audio processing.

  • gemma3n_mm
  • gemma4

    Gemma 4 model implementation for vLLM.

  • gemma4_mm

    Gemma 4 multimodal model (image + audio + video support).

  • gemma4_mtp

    Inference-only Gemma4 MTP (Multi-Token Prediction) model.

  • gemma4_unified

    Gemma 4 Unified multimodal model (encoder-free image + audio + video).

  • glm

    Inference-only HF format GLM-4 model compatible with THUDM weights.

  • glm4

    Inference-only GLM-4-0414 model compatible with HuggingFace weights.

  • glm4_1v

    Inference-only GLM-4.1V & GLM-4.6V-Flash, AutoGLM-Phone-9B model

  • glm4_moe

    Inference-only GLM-4.5, GLM-4.6, GLM-4.7 model

  • glm4_moe_lite

    Inference-only GLM-4.7-Flash model compatible with HuggingFace weights.

  • glm4_moe_lite_mtp

    Inference-only GLM-4.7-Flash MTP model compatible with HuggingFace weights.

  • glm4_moe_mtp

    Inference-only GLM-4.5, GLM-4.6, GLM-4.7 MTP

  • glm4v

    Inference-only CogAgent model compatible with THUDM weights.

  • glm_ocr

    Inference-only GLM-OCR model compatible with HuggingFace weights.

  • glm_ocr_mtp

    Inference-only GLM-OCR MTP model compatible with HuggingFace weights.

  • glmasr
  • glmasr_utils
  • gpt2

    Inference-only GPT-2 model compatible with HuggingFace weights.

  • gpt_bigcode

    Inference-only GPTBigCode model compatible with HuggingFace weights.

  • gpt_j

    Inference-only GPT-J model compatible with HuggingFace weights.

  • gpt_neox

    Inference-only GPT-NeoX model compatible with HuggingFace weights.

  • granite

    Inference-only IBM Granite model compatible with HuggingFace weights.

  • granite4_vision

    vLLM implementation of Granite 4 Vision.

  • granite_speech

    Inference-only IBM Granite speech model.

  • granite_speech_plus

    Inference-only IBM Granite Speech Plus model.

  • granitemoe

    Inference-only GraniteMoe model.

  • granitemoehybrid

    Inference-only GraniteMoeHybrid model.

  • granitemoeshared

    Inference-only GraniteMoeShared model.

  • gritlm
  • grok1

    Inference-only Grok (Grok1/Grok2) model.

  • hunyuan_v1

    Inference-only HunYuan model compatible with HuggingFace weights.

  • hunyuan_vision

    Inference-only HunYuan-VL model compatible with HuggingFace weights.

  • hy_v3

    Inference-only HY model compatible with HuggingFace weights.

  • hy_v3_mtp

    Inference-only HY V3 MTP model compatible with HuggingFace weights.

  • hyperclovax

    Inference-only HyperCLOVAX model compatible with HuggingFace weights.

  • hyperclovax_vision
  • hyperclovax_vision_v2

    HyperCLOVAX V2 (32B Think Model) Implementation.

  • idefics2_vision_model

    PyTorch Idefics2 model.

  • idefics3

    Inference-only Idefics3 model compatible with HuggingFace weights.

  • interfaces
  • interfaces_base
  • intern_vit
  • interns1
  • interns1_pro

    Inference-only InternS1Pro model compatible with HuggingFace weights.

  • interns1_vit
  • internvl
  • iquest_loopcoder

    Inference-only LoopCoder model compatible with HuggingFace weights.

  • isaac
  • jais2

    Inference-only Jais2 model compatible with HuggingFace weights.

  • jamba

    Inference-only Jamba model.

  • jina
  • kanana_v
  • keye
  • keye_vl1_5
  • kimi_audio

    Inference-only Kimi-Audio model compatible with HuggingFace weights.

  • kimi_k25

    Kimi-K2.5 Model Implementation for vLLM.

  • kimi_k25_vit

    Vision tower implementation for Kimi-K2.5 model.

  • kimi_linear
  • kimi_vl
  • laguna

    Inference-only Laguna model compatible with HuggingFace weights.

  • lfm2
  • lfm2_moe
  • lfm2_siglip2

    Implementation of Siglip2VisionModel intended to be only used

  • lfm2_vl
  • llama

    Inference-only LLaMA model compatible with HuggingFace weights.

  • llama4

    Inference-only LLaMA model compatible with HuggingFace weights.

  • llama4_eagle
  • llava
  • llava_next
  • llava_next_video
  • llava_onevision
  • longcat_flash

    Inference-only Flash model compatible with HuggingFace weights.

  • longcat_flash_mtp
  • mamba

    PyTorch MAMBA model.

  • mamba2

    PyTorch MAMBA2 model.

  • medusa
  • mellum
  • midashenglm

    Inference-only MiDashengLM model compatible with HuggingFace weights.

  • mimo

    Inference-only MiMo model compatible with HuggingFace weights.

  • mimo_audio

    MiMo audio: tokenizer, encoding utilities, and audio encoder.

  • mimo_mtp

    Inference-only MiMo-MTP model.

  • mimo_v2_mtp

    Inference-only MiMo-V2 MTP (Multi-Token Prediction) draft model.

  • mimo_v2_omni
  • minicpm

    Inference-only MiniCPM model compatible with HuggingFace weights.

  • minicpm3

    Inference-only MiniCPM3 model compatible with HuggingFace weights.

  • minicpm_eagle

    Inference-only EagleMiniCPM model compatible with HuggingFace weights.

  • minicpmo

    Inference-only MiniCPM-O model compatible with HuggingFace weights.

  • minicpmv

    Inference-only MiniCPM-V model compatible with HuggingFace weights.

  • minicpmv4_6

    Inference-only MiniCPM-V 4.6 model (MiniCPMV4_6ForConditionalGeneration).

  • minimax_m2

    Inference-only MiniMaxM2 model.

  • minimax_text_01

    Inference-only MiniMaxText01 model.

  • minimax_vl_01
  • mistral

    Mistral adaptation of the LLaMA architecture.

  • mistral3
  • mistral_large_3
  • mixtral

    Inference-only Mixtral model.

  • mllama4
  • mlp_speculator
  • molmo
  • molmo2
  • moondream3

    Inference-only Moondream3 model implementation.

  • moonvit
  • musicflamingo
  • nano_nemotron_vl
  • nemotron

    Inference-only Nemotron model compatible with HuggingFace weights.

  • nemotron_h

    Inference-only NemotronH model.

  • nemotron_h_mtp

    NemotronH-MTP model with attention layers.

  • nemotron_nas

    Inference-only deci model compatible with HuggingFace weights.

  • nemotron_parse
  • nemotron_vl
  • olmo

    Inference-only OLMo model compatible with HuggingFace weights.

  • olmo2

    Inference-only OLMo2 model compatible with HuggingFace weights.

  • olmo_hybrid

    Inference-only OLMo Hybrid model compatible with HuggingFace weights.

  • olmoe

    Inference-only OLMoE model compatible with HuggingFace weights.

  • opencua

    Inference-only OpenCUA-7B model compatible with HuggingFace weights.

  • openpangu_mtp
  • openpangu_vl
  • openvla
  • opt

    Inference-only OPT model compatible with HuggingFace weights.

  • orion

    Inference-only Orion-14B model compatible with HuggingFace weights.

  • ouro

    Inference-only Ouro model compatible with HuggingFace weights.

  • ovis

    PyTorch Ovis model.

  • ovis2_5

    PyTorch Ovis model.

  • paddleocr_vl
  • paligemma
  • parakeet

    Modules below used for the audio encoder component in: models/nano_nemotron_vl.py

  • param2moe
  • persimmon

    Inference-only persimmon model compatible with HuggingFace weights.

  • phi

    Inference-only Phi-1.5 model compatible with HuggingFace weights.

  • phi3

    Inference-only Phi3 model code inherit from Llama.py

  • phi3v
  • phi4mm
  • phi4mm_audio
  • phi4mm_utils
  • phi4siglip

    vLLM support for microsoft/Phi-4-reasoning-vision-15B.

  • phimoe

    Inference-only PhiMoE model.

  • pixtral
  • plamo2

    Inference-only PLaMo2 model.

  • plamo3

    Inference-only PLaMo3 model.

  • qianfan_ocr
  • qwen

    Inference-only QWen model compatible with HuggingFace weights.

  • qwen2

    Inference-only Qwen2 model compatible with HuggingFace weights.

  • qwen2_5_omni_thinker

    Inference-only Qwen2.5-Omni model (thinker part).

  • qwen2_5_vl

    Inference-only Qwen2.5-VL model compatible with HuggingFace weights.

  • qwen2_audio

    Inference-only Qwen2-Audio model compatible with HuggingFace weights.

  • qwen2_moe

    Inference-only Qwen2MoE model compatible with HuggingFace weights.

  • qwen2_rm

    Inference-only Qwen2-RM model compatible with HuggingFace weights.

  • qwen2_vl

    Inference-only Qwen2-VL model compatible with HuggingFace weights.

  • qwen3

    Inference-only Qwen3 model compatible with HuggingFace weights.

  • qwen3_5

    Inference-only Qwen3.5 Series compatible with HuggingFace weights.

  • qwen3_5_mtp

    Inference-only Qwen3_5 MTP model.

  • qwen3_asr

    Inference-only Qwen3-ASR model.

  • qwen3_asr_forced_aligner

    Inference-only Qwen3-ASR ForcedAligner model (token classification).

  • qwen3_asr_realtime

    Inference-only Qwen3-ASR realtime model.

  • qwen3_dflash
  • qwen3_moe

    Inference-only Qwen3MoE model compatible with HuggingFace weights.

  • qwen3_next

    Inference-only Qwen3Next model.

  • qwen3_next_mtp

    Inference-only Qwen3Next MTP model.

  • qwen3_omni_moe_thinker

    Inference-only Qwen3-Omni-Moe model (thinker part).

  • qwen3_vl

    Inference-only Qwen3VL model compatible with HuggingFace weights.

  • qwen3_vl_moe

    Inference-only Qwen3-VL-MoE model compatible with HuggingFace weights.

  • qwen_vl

    Inference-only Qwen-VL model compatible with HuggingFace weights.

  • radio
  • registry

    Whenever you add an architecture to this page, please also update

  • roberta
  • sarvam
  • seed_oss

    Inference-only SeedOss model compatible with HuggingFace weights.

  • siglip
  • siglip2navit

    Implementation of SiglipVisionModel intended to be only used

  • skyworkr1v
  • solar

    Inference-only Solar model compatible with HuggingFace weights.

  • stablelm

    Inference-only StableLM (https://github.com/Stability-AI/StableLM)

  • starcoder2

    PyTorch Starcoder2 model.

  • step1

    Shared Step decoder blocks and the Step1 text model.

  • step3_text

    Inference-only Jurassic model.

  • step3_vl
  • step3p5

    Inference-only Jurassic model.

  • step3p5_mtp
  • step3p7

    Inference-only Jurassic model.

  • step_vl

    This is basically a copy from perception_models/core/vision_encoder/pe.py

  • tarsier
  • terratorch

    Wrapper around Terratorch models

  • transformers

    Wrapper around transformers models

  • ultravox

    PyTorch Ultravox model.

  • utils
  • vision
  • voxtral
  • voxtral_realtime
  • voyage
  • whisper
  • whisper_causal
  • zamba2

    PyTorch Zamba2 model implementation for vLLM.

Classes:

HasInnerState

Bases: Protocol

The interface required for all models that has inner state.

Attributes:

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class HasInnerState(Protocol):
    """The interface required for all models that has inner state."""

    has_inner_state: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has inner state.
        Models that has inner state usually need access to the scheduler_config
        for max_num_seqs, etc. True for e.g. both Mamba and Jamba.
    """

has_inner_state = True class-attribute

A flag that indicates this model has inner state. Models that has inner state usually need access to the scheduler_config for max_num_seqs, etc. True for e.g. both Mamba and Jamba.

SupportsLoRA

Bases: Protocol

The interface required for all models that support LoRA.

Attributes:

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsLoRA(Protocol):
    """The interface required for all models that support LoRA."""

    supports_lora: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports LoRA.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """
    is_3d_moe_weight: ClassVar[bool] = False
    is_non_gated_moe: ClassVar[bool] = False
    # The `embedding_module` and `embedding_padding_modules`
    # are empty by default.
    embedding_modules: ClassVar[dict[str, str]] = {}
    packed_modules_mapping: dict[str, list[str]] = {}
    # Module prefixes to skip during LoRA loading (e.g., ["mtp."] for MTP layers)
    lora_skip_prefixes: ClassVar[list[str]] = []

supports_lora = True class-attribute

A flag that indicates this model supports LoRA.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

SupportsMRoPE

Bases: Protocol

The interface required for all models that support M-RoPE.

Methods:

Attributes:

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsMRoPE(Protocol):
    """The interface required for all models that support M-RoPE."""

    supports_mrope: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports M-RoPE.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def get_mrope_input_positions(
        self,
        input_tokens: list[int],
        mm_features: list["MultiModalFeatureSpec"],
    ) -> tuple[torch.Tensor, int]:
        """
        Get M-RoPE input positions and delta value for this specific model.

        This method should be implemented by each model that supports M-RoPE
        to provide model-specific logic for computing input positions.

        Args:
            input_tokens: List of input token IDs
            mm_features: Information about each multi-modal data item

        Returns:
            Tuple of `(llm_positions, mrope_position_delta)`
            - llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
            - mrope_position_delta: Delta for position calculations
        """
        ...

supports_mrope = True class-attribute

A flag that indicates this model supports M-RoPE.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_mrope_input_positions(input_tokens, mm_features)

Get M-RoPE input positions and delta value for this specific model.

This method should be implemented by each model that supports M-RoPE to provide model-specific logic for computing input positions.

Parameters:

Returns:

  • Tensor

    Tuple of (llm_positions, mrope_position_delta)

  • int
    • llm_positions: Tensor of shape [3, num_tokens] with T/H/W positions
  • tuple[Tensor, int]
    • mrope_position_delta: Delta for position calculations
Source code in vllm/model_executor/models/interfaces.py
def get_mrope_input_positions(
    self,
    input_tokens: list[int],
    mm_features: list["MultiModalFeatureSpec"],
) -> tuple[torch.Tensor, int]:
    """
    Get M-RoPE input positions and delta value for this specific model.

    This method should be implemented by each model that supports M-RoPE
    to provide model-specific logic for computing input positions.

    Args:
        input_tokens: List of input token IDs
        mm_features: Information about each multi-modal data item

    Returns:
        Tuple of `(llm_positions, mrope_position_delta)`
        - llm_positions: Tensor of shape `[3, num_tokens]` with T/H/W positions
        - mrope_position_delta: Delta for position calculations
    """
    ...

SupportsMultiModal

Bases: Protocol

The interface required for all multi-modal models.

Methods:

Attributes:

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsMultiModal(Protocol):
    """The interface required for all multi-modal models."""

    supports_multimodal: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports multi-modal inputs.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    supports_multimodal_raw_input_only: ClassVar[bool] = False
    """
    A flag that indicates this model supports multi-modal inputs and processes
    them in their raw form and not embeddings.
    """

    supports_encoder_tp_data: ClassVar[bool] = False
    """
    A flag that indicates whether this model supports
    `multimodal_config.mm_encoder_tp_mode="data"`.
    """

    requires_raw_input_tokens: ClassVar[bool] = False
    """
    A flag that indicates this model processes input id tokens
    in their raw form and not input embeddings.
    """

    _processor_factory: ClassVar[_ProcessorFactories]
    """
    Set internally by `MultiModalRegistry.register_processor`.
    """

    _language_model_names: list[str] = []
    """
    Set internally by `_mark_language_model`.
    """

    _tower_model_names: list[str] = []
    """
    Set internally by `_mark_tower_model`.
    """

    _has_oov_mm_tokens: bool = False
    """
    In general, this should be set at init time by invoking
    `configure_mm_token_handling` models & passing all potentially
    OOV multimodal tokens.
    """

    @classmethod
    def get_placeholder_str(cls, modality: str, i: int) -> str | None:
        """
        Get the placeholder text for the `i`th `modality` item in the prompt.
        """
        ...

    def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
        """
        Returns multimodal embeddings generated from multimodal kwargs
        to be merged with text embeddings.

        Note:
            The returned multimodal embeddings must be in the same order as
            the appearances of their corresponding multimodal data item in the
            input prompt.
        """
        ...

    def configure_mm_token_handling(self, vocab_size: int, mm_token_ids: list[int]):
        """Check if any multimodal tokens are out of vocabulary. If so, we will
        explicitly mask all multimodal tokens out when computing text embeddings,
        since the multimodal embeddings will be scattered over the results.
        """
        self._has_oov_mm_tokens = any(tok_id >= vocab_size for tok_id in mm_token_ids)
        logger.info(
            "Contains out of vocabulary multimodal tokens? %s",
            self._has_oov_mm_tokens,
        )

    def get_language_model(self) -> VllmModel:
        """
        Returns the underlying language model used for text generation.

        This is typically the `torch.nn.Module` instance responsible for
        processing the merged multimodal embeddings and producing hidden states

        Returns:
            torch.nn.Module: The core language model component.
        """
        # Cached
        if self in _language_model_by_module:
            return _language_model_by_module[self]

        if self._language_model_names:
            mod = self
            for attr in common_prefix(
                [name.split(".") for name in self._language_model_names]
            ):
                if attr:
                    mod = getattr(mod, attr)

            if mod is not self and hasattr(mod, "embed_input_ids"):
                _language_model_by_module[self] = mod
                return mod

        # Fallback
        for mod in self.children():
            if hasattr(mod, "embed_input_ids"):
                _language_model_by_module[self] = mod
                return mod

        raise NotImplementedError(
            f"No language model found in {type(self).__name__}! "
            "You should initialize it via `_mark_language_model`, "
            "and make sure `embed_input_ids` is implemented."
        )

    @contextmanager
    def _mark_language_model(
        self,
        vllm_config: VllmConfig,
        *,
        targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
    ):
        """
        Mark each child module that was assigned to this model during this context
        as a language model component.

        Language model components are automatically skipped in `--mm-encoder-only`
        mode.

        If `targets` is set, instead include descendants that are an instance
        of `targets`, even if they aren't direct children.
        """
        from .utils import StageMissingLayer, collect_children, no_init_weights

        mm_config = vllm_config.model_config.multimodal_config

        with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
            with (
                no_init_weights(
                    self,
                    lambda mod: StageMissingLayer("language_model", mod),
                    targets=targets,
                )
                if mm_config.mm_encoder_only
                else nullcontext()
            ):
                yield

        self._language_model_names = children_names

    @contextmanager
    def _mark_tower_model(
        self,
        vllm_config: VllmConfig,
        modalities: set[str] | str,
        *,
        targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
    ):
        """
        Mark each child module that was assigned to this model during this context
        as a tower model component.

        Tower model components are automatically skipped when `--limit-mm-per-prompt`
        is set to zero for all of their modalities.

        If `targets` is set, instead include descendants that are an instance
        of `targets`, even if they aren't direct children.
        """
        from .utils import StageMissingLayer, collect_children, no_init_weights

        if isinstance(modalities, str):
            modalities = {modalities}

        if modalities == {"image", "video"}:
            stage_name = "vision_tower"
        else:
            stage_name = "_".join([*modalities, "tower"])

        mm_config = vllm_config.model_config.multimodal_config

        with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
            with (
                no_init_weights(
                    self,
                    lambda mod: StageMissingLayer(stage_name, mod),
                    targets=targets,
                )
                if all(mm_config.get_limit_per_prompt(m) == 0 for m in modalities)
                else nullcontext()
            ):
                yield

        self._tower_model_names = children_names

    @contextmanager
    def _mark_composite_model(
        self,
        vllm_config: VllmConfig,
        *,
        language_targets: type[nn.Module] | tuple[type[nn.Module], ...],
        tower_targets: dict[str, type[nn.Module] | tuple[type[nn.Module], ...]],
    ):
        """
        Composite wrapper over `_mark_language_model` and
        `_mark_tower_model` by modality.
        """
        with ExitStack() as stack:
            stack.enter_context(
                self._mark_language_model(
                    vllm_config,
                    targets=language_targets,
                )
            )

            for modality, modality_targets in tower_targets.items():
                stack.enter_context(
                    self._mark_tower_model(
                        vllm_config,
                        modality,
                        targets=modality_targets,
                    )
                )

            yield

    def get_num_mm_encoder_tokens(self, num_image_tokens: int) -> int:
        """
        Implement this function to enable LoRA support
        for the tower module of the multi-modal model.
        Given the number of image tokens, output the number of
        multi-modal encoder tokens.
        """
        ...

    def get_num_mm_connector_tokens(self, num_vision_tokens: int) -> int:
        """
        Implement this function to enable LoRA support
        for the connector module of the multi-modal model.
        Given the number of vision tokens, output the number of
        multi-modal connector tokens.
        """
        ...

    @overload
    def embed_input_ids(self, input_ids: Tensor) -> Tensor: ...

    @overload
    def embed_input_ids(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings,
        *,
        is_multimodal: torch.Tensor,
    ) -> Tensor: ...

    def _embed_text_input_ids(
        self,
        input_ids: Tensor,
        embed_input_ids: Callable[[Tensor], Tensor],
        *,
        is_multimodal: Tensor | None,
    ) -> Tensor:
        if is_multimodal is not None and self._has_oov_mm_tokens:
            # Force all input IDs to be in vocab; we do this instead of squeezing
            # to ensure that any external configuration requiring offset tracking,
            # e.g., LoRA, are applied correctly regardless of whether or not
            # we have multimodal tokens.
            in_vocab_ids = input_ids.masked_fill(
                is_multimodal.to(device=input_ids.device, non_blocking=True), 0
            )
            return embed_input_ids(in_vocab_ids)

        return embed_input_ids(input_ids)

    def embed_input_ids(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings | None = None,
        *,
        is_multimodal: Tensor | None = None,
    ) -> Tensor:
        """
        Apply token embeddings to `input_ids`.

        If `multimodal_embeddings` is passed, scatter them into
        `input_ids` according to the mask `is_multimodal`.

        NOTE: If this model has multimodal tokens that are of vocabulary
        (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied
        and masked to 0 during the forward pass for the text embeddings.
        """
        from .utils import _merge_multimodal_embeddings

        # Get text embeddings first; multimodal embeddings will clobber
        # any invalid contents in the indices of multimodal embeddings
        # for the in vocabulary and out of vocabulary case.
        inputs_embeds = self._embed_text_input_ids(
            input_ids,
            self.get_language_model().embed_input_ids,
            is_multimodal=is_multimodal,
        )

        if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
            return inputs_embeds

        return _merge_multimodal_embeddings(
            inputs_embeds=inputs_embeds,
            multimodal_embeddings=multimodal_embeddings,
            is_multimodal=_require_is_multimodal(is_multimodal),
        )

_has_oov_mm_tokens = False class-attribute instance-attribute

In general, this should be set at init time by invoking configure_mm_token_handling models & passing all potentially OOV multimodal tokens.

_language_model_names = [] class-attribute instance-attribute

Set internally by _mark_language_model.

_processor_factory class-attribute

Set internally by MultiModalRegistry.register_processor.

_tower_model_names = [] class-attribute instance-attribute

Set internally by _mark_tower_model.

requires_raw_input_tokens = False class-attribute

A flag that indicates this model processes input id tokens in their raw form and not input embeddings.

supports_encoder_tp_data = False class-attribute

A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data".

supports_multimodal = True class-attribute

A flag that indicates this model supports multi-modal inputs.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

supports_multimodal_raw_input_only = False class-attribute

A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.

_mark_composite_model(vllm_config, *, language_targets, tower_targets)

Composite wrapper over _mark_language_model and _mark_tower_model by modality.

Source code in vllm/model_executor/models/interfaces.py
@contextmanager
def _mark_composite_model(
    self,
    vllm_config: VllmConfig,
    *,
    language_targets: type[nn.Module] | tuple[type[nn.Module], ...],
    tower_targets: dict[str, type[nn.Module] | tuple[type[nn.Module], ...]],
):
    """
    Composite wrapper over `_mark_language_model` and
    `_mark_tower_model` by modality.
    """
    with ExitStack() as stack:
        stack.enter_context(
            self._mark_language_model(
                vllm_config,
                targets=language_targets,
            )
        )

        for modality, modality_targets in tower_targets.items():
            stack.enter_context(
                self._mark_tower_model(
                    vllm_config,
                    modality,
                    targets=modality_targets,
                )
            )

        yield

_mark_language_model(vllm_config, *, targets=None)

Mark each child module that was assigned to this model during this context as a language model component.

Language model components are automatically skipped in --mm-encoder-only mode.

If targets is set, instead include descendants that are an instance of targets, even if they aren't direct children.

Source code in vllm/model_executor/models/interfaces.py
@contextmanager
def _mark_language_model(
    self,
    vllm_config: VllmConfig,
    *,
    targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
):
    """
    Mark each child module that was assigned to this model during this context
    as a language model component.

    Language model components are automatically skipped in `--mm-encoder-only`
    mode.

    If `targets` is set, instead include descendants that are an instance
    of `targets`, even if they aren't direct children.
    """
    from .utils import StageMissingLayer, collect_children, no_init_weights

    mm_config = vllm_config.model_config.multimodal_config

    with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
        with (
            no_init_weights(
                self,
                lambda mod: StageMissingLayer("language_model", mod),
                targets=targets,
            )
            if mm_config.mm_encoder_only
            else nullcontext()
        ):
            yield

    self._language_model_names = children_names

_mark_tower_model(vllm_config, modalities, *, targets=None)

Mark each child module that was assigned to this model during this context as a tower model component.

Tower model components are automatically skipped when --limit-mm-per-prompt is set to zero for all of their modalities.

If targets is set, instead include descendants that are an instance of targets, even if they aren't direct children.

Source code in vllm/model_executor/models/interfaces.py
@contextmanager
def _mark_tower_model(
    self,
    vllm_config: VllmConfig,
    modalities: set[str] | str,
    *,
    targets: type[nn.Module] | tuple[type[nn.Module], ...] | None = None,
):
    """
    Mark each child module that was assigned to this model during this context
    as a tower model component.

    Tower model components are automatically skipped when `--limit-mm-per-prompt`
    is set to zero for all of their modalities.

    If `targets` is set, instead include descendants that are an instance
    of `targets`, even if they aren't direct children.
    """
    from .utils import StageMissingLayer, collect_children, no_init_weights

    if isinstance(modalities, str):
        modalities = {modalities}

    if modalities == {"image", "video"}:
        stage_name = "vision_tower"
    else:
        stage_name = "_".join([*modalities, "tower"])

    mm_config = vllm_config.model_config.multimodal_config

    with collect_children(self, targets=targets) as children_names:  # noqa: SIM117
        with (
            no_init_weights(
                self,
                lambda mod: StageMissingLayer(stage_name, mod),
                targets=targets,
            )
            if all(mm_config.get_limit_per_prompt(m) == 0 for m in modalities)
            else nullcontext()
        ):
            yield

    self._tower_model_names = children_names

configure_mm_token_handling(vocab_size, mm_token_ids)

Check if any multimodal tokens are out of vocabulary. If so, we will explicitly mask all multimodal tokens out when computing text embeddings, since the multimodal embeddings will be scattered over the results.

Source code in vllm/model_executor/models/interfaces.py
def configure_mm_token_handling(self, vocab_size: int, mm_token_ids: list[int]):
    """Check if any multimodal tokens are out of vocabulary. If so, we will
    explicitly mask all multimodal tokens out when computing text embeddings,
    since the multimodal embeddings will be scattered over the results.
    """
    self._has_oov_mm_tokens = any(tok_id >= vocab_size for tok_id in mm_token_ids)
    logger.info(
        "Contains out of vocabulary multimodal tokens? %s",
        self._has_oov_mm_tokens,
    )

embed_input_ids(input_ids, multimodal_embeddings=None, *, is_multimodal=None)

embed_input_ids(input_ids: Tensor) -> Tensor
embed_input_ids(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings,
    *,
    is_multimodal: torch.Tensor,
) -> Tensor

Apply token embeddings to input_ids.

If multimodal_embeddings is passed, scatter them into input_ids according to the mask is_multimodal.

NOTE: If this model has multimodal tokens that are of vocabulary (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied and masked to 0 during the forward pass for the text embeddings.

Source code in vllm/model_executor/models/interfaces.py
def embed_input_ids(
    self,
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings | None = None,
    *,
    is_multimodal: Tensor | None = None,
) -> Tensor:
    """
    Apply token embeddings to `input_ids`.

    If `multimodal_embeddings` is passed, scatter them into
    `input_ids` according to the mask `is_multimodal`.

    NOTE: If this model has multimodal tokens that are of vocabulary
    (i.e., self._has_oov_mm_tokens=True), the input_ids will be copied
    and masked to 0 during the forward pass for the text embeddings.
    """
    from .utils import _merge_multimodal_embeddings

    # Get text embeddings first; multimodal embeddings will clobber
    # any invalid contents in the indices of multimodal embeddings
    # for the in vocabulary and out of vocabulary case.
    inputs_embeds = self._embed_text_input_ids(
        input_ids,
        self.get_language_model().embed_input_ids,
        is_multimodal=is_multimodal,
    )

    if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
        return inputs_embeds

    return _merge_multimodal_embeddings(
        inputs_embeds=inputs_embeds,
        multimodal_embeddings=multimodal_embeddings,
        is_multimodal=_require_is_multimodal(is_multimodal),
    )

embed_multimodal(**kwargs)

Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.

Note

The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.

Source code in vllm/model_executor/models/interfaces.py
def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
    """
    Returns multimodal embeddings generated from multimodal kwargs
    to be merged with text embeddings.

    Note:
        The returned multimodal embeddings must be in the same order as
        the appearances of their corresponding multimodal data item in the
        input prompt.
    """
    ...

get_language_model()

Returns the underlying language model used for text generation.

This is typically the torch.nn.Module instance responsible for processing the merged multimodal embeddings and producing hidden states

Returns:

  • VllmModel

    torch.nn.Module: The core language model component.

Source code in vllm/model_executor/models/interfaces.py
def get_language_model(self) -> VllmModel:
    """
    Returns the underlying language model used for text generation.

    This is typically the `torch.nn.Module` instance responsible for
    processing the merged multimodal embeddings and producing hidden states

    Returns:
        torch.nn.Module: The core language model component.
    """
    # Cached
    if self in _language_model_by_module:
        return _language_model_by_module[self]

    if self._language_model_names:
        mod = self
        for attr in common_prefix(
            [name.split(".") for name in self._language_model_names]
        ):
            if attr:
                mod = getattr(mod, attr)

        if mod is not self and hasattr(mod, "embed_input_ids"):
            _language_model_by_module[self] = mod
            return mod

    # Fallback
    for mod in self.children():
        if hasattr(mod, "embed_input_ids"):
            _language_model_by_module[self] = mod
            return mod

    raise NotImplementedError(
        f"No language model found in {type(self).__name__}! "
        "You should initialize it via `_mark_language_model`, "
        "and make sure `embed_input_ids` is implemented."
    )

get_num_mm_connector_tokens(num_vision_tokens)

Implement this function to enable LoRA support for the connector module of the multi-modal model. Given the number of vision tokens, output the number of multi-modal connector tokens.

Source code in vllm/model_executor/models/interfaces.py
def get_num_mm_connector_tokens(self, num_vision_tokens: int) -> int:
    """
    Implement this function to enable LoRA support
    for the connector module of the multi-modal model.
    Given the number of vision tokens, output the number of
    multi-modal connector tokens.
    """
    ...

get_num_mm_encoder_tokens(num_image_tokens)

Implement this function to enable LoRA support for the tower module of the multi-modal model. Given the number of image tokens, output the number of multi-modal encoder tokens.

Source code in vllm/model_executor/models/interfaces.py
def get_num_mm_encoder_tokens(self, num_image_tokens: int) -> int:
    """
    Implement this function to enable LoRA support
    for the tower module of the multi-modal model.
    Given the number of image tokens, output the number of
    multi-modal encoder tokens.
    """
    ...

get_placeholder_str(modality, i) classmethod

Get the placeholder text for the ith modality item in the prompt.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_placeholder_str(cls, modality: str, i: int) -> str | None:
    """
    Get the placeholder text for the `i`th `modality` item in the prompt.
    """
    ...

SupportsPP

Bases: Protocol

The interface required for all models that support pipeline parallel.

Methods:

Attributes:

  • supports_pp (Literal[True]) –

    A flag that indicates this model supports pipeline parallel.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsPP(Protocol):
    """The interface required for all models that support pipeline parallel."""

    supports_pp: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pipeline parallel.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def make_empty_intermediate_tensors(
        self,
        batch_size: int,
        dtype: torch.dtype,
        device: torch.device,
    ) -> IntermediateTensors:
        """Called when PP rank > 0 for profiling purposes."""
        ...

    def forward(
        self,
        input_ids: Tensor | None,
        positions: Tensor,
        *,
        intermediate_tensors: IntermediateTensors | None,
    ) -> IntermediateTensors | None:
        """
        Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
        PP rank > 0.

        Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
        for the last PP rank.
        """
        ...

supports_pp = True class-attribute

A flag that indicates this model supports pipeline parallel.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

forward(input_ids, positions, *, intermediate_tensors)

Accept IntermediateTensors when PP rank > 0.

Return IntermediateTensors only for the last PP rank.

Source code in vllm/model_executor/models/interfaces.py
def forward(
    self,
    input_ids: Tensor | None,
    positions: Tensor,
    *,
    intermediate_tensors: IntermediateTensors | None,
) -> IntermediateTensors | None:
    """
    Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
    PP rank > 0.

    Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
    for the last PP rank.
    """
    ...

make_empty_intermediate_tensors(batch_size, dtype, device)

Called when PP rank > 0 for profiling purposes.

Source code in vllm/model_executor/models/interfaces.py
def make_empty_intermediate_tensors(
    self,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device,
) -> IntermediateTensors:
    """Called when PP rank > 0 for profiling purposes."""
    ...

SupportsTranscription

Bases: Protocol

The interface required for all models that support transcription.

Methods:

Attributes:

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsTranscription(Protocol):
    """The interface required for all models that support transcription."""

    # Mapping from ISO639_1 language codes: language names
    supported_languages: ClassVar[Mapping[str, str]]

    supports_transcription: ClassVar[Literal[True]] = True

    supports_transcription_only: ClassVar[bool] = False
    """
    Transcription models can opt out of text generation by setting this to
    `True`.
    """
    supports_segment_timestamp: ClassVar[bool] = False
    """
    Enables the segment timestamp option for supported models by setting this to `True`.
    """

    supports_explicit_language_detection: ClassVar[bool] = False
    """
    Transcription models that require an explicit language detection step
    (e.g. Whisper needs a separate forward pass to predict the language
    token) should set this to ``True`` and implement
    :meth:`get_language_detection_prompt` and
    :meth:`parse_language_detection_output` and
    :meth:`get_language_token_ids`.
    """

    no_space_languages: ClassVar[set[str]] = {"ja", "zh"}
    """
    Languages that don't need a space between words.
    For example, Japanese (ja) and Chinese (zh) don't need a space between words.
    """

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        # language codes in supported_languages
        # that don't exist in the full language map
        invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
        if invalid:
            raise ValueError(
                f"{cls.__name__}.supported_languages contains invalid "
                f"language codes: {sorted(invalid)}\n. "
                f"Valid choices are: {sorted(LANGUAGES.keys())}"
            )

    @classmethod
    def get_generation_prompt(
        cls,
        stt_params: SpeechToTextParams,
    ) -> PromptType:
        """Get the prompt for the ASR model.
        The model has control over the construction, as long as it
        returns a valid PromptType."""
        ...

    @classmethod
    def get_other_languages(cls) -> Mapping[str, str]:
        # other possible language codes from the whisper map
        return {k: v for k, v in LANGUAGES.items() if k not in cls.supported_languages}

    @classmethod
    def validate_language(cls, language: str | None) -> str | None:
        """
        Ensure the language specified in the transcription request
        is a valid ISO 639-1 language code. If the request language is
        valid, but not natively supported by the model, trigger a
        warning (but not an exception).
        """
        if language is None or language in cls.supported_languages:
            return language
        elif language in cls.get_other_languages():
            logger.warning(
                "Language %r is not natively supported by %s; "
                "results may be less accurate. Supported languages: %r",
                language,
                cls.__name__,
                list(cls.supported_languages.keys()),
            )
            return language
        else:
            raise ValueError(
                f"Unsupported language: {language!r}.  Must be one of "
                f"{list(cls.supported_languages.keys())}."
            )

    @classmethod
    def get_speech_to_text_config(
        cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"]
    ) -> SpeechToTextConfig:
        """Get the speech to text config for the ASR model."""
        ...

    @classmethod
    def get_num_audio_tokens(
        cls,
        audio_duration_s: float,
        stt_config: SpeechToTextConfig,
        model_config: ModelConfig,
    ) -> int | None:
        """
        Map from audio duration to number of audio tokens produced by the ASR
        model, without running a forward pass.
        This is used for estimating the amount of processing for this audio.
        """
        return None

    @classmethod
    def post_process_output(cls, text: str) -> str:
        """
        Post-process the raw model output text.

        Some ASR models output structured formats (e.g., language tags,
        special tokens) that need to be stripped before returning to the user.

        Args:
            text: Raw decoded text from the model.

        Returns:
            Cleaned transcription text.
        """
        return text

    @classmethod
    def get_language_detection_prompt(
        cls,
        audio: np.ndarray,
        stt_config: SpeechToTextConfig,
    ) -> PromptType:
        """Return a prompt that triggers language detection.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

    @classmethod
    def parse_language_detection_output(
        cls,
        token_ids: list[int],
        tokenizer: object,
    ) -> str:
        """Parse the detected language from model output token IDs.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

    @classmethod
    def get_language_token_ids(
        cls,
        tokenizer: object,
    ) -> list[int] | None:
        """Return token IDs that represent valid language tokens.

        Used to constrain language detection to only produce valid language tokens.

        Only needs to be implemented when
        ``supports_explicit_language_detection`` is ``True``.
        """
        raise NotImplementedError

no_space_languages = {'ja', 'zh'} class-attribute

Languages that don't need a space between words. For example, Japanese (ja) and Chinese (zh) don't need a space between words.

supports_explicit_language_detection = False class-attribute

Transcription models that require an explicit language detection step (e.g. Whisper needs a separate forward pass to predict the language token) should set this to True and implement :meth:get_language_detection_prompt and :meth:parse_language_detection_output and :meth:get_language_token_ids.

supports_segment_timestamp = False class-attribute

Enables the segment timestamp option for supported models by setting this to True.

supports_transcription_only = False class-attribute

Transcription models can opt out of text generation by setting this to True.

get_generation_prompt(stt_params) classmethod

Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_generation_prompt(
    cls,
    stt_params: SpeechToTextParams,
) -> PromptType:
    """Get the prompt for the ASR model.
    The model has control over the construction, as long as it
    returns a valid PromptType."""
    ...

get_language_detection_prompt(audio, stt_config) classmethod

Return a prompt that triggers language detection.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_language_detection_prompt(
    cls,
    audio: np.ndarray,
    stt_config: SpeechToTextConfig,
) -> PromptType:
    """Return a prompt that triggers language detection.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

get_language_token_ids(tokenizer) classmethod

Return token IDs that represent valid language tokens.

Used to constrain language detection to only produce valid language tokens.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_language_token_ids(
    cls,
    tokenizer: object,
) -> list[int] | None:
    """Return token IDs that represent valid language tokens.

    Used to constrain language detection to only produce valid language tokens.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

get_num_audio_tokens(audio_duration_s, stt_config, model_config) classmethod

Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_num_audio_tokens(
    cls,
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> int | None:
    """
    Map from audio duration to number of audio tokens produced by the ASR
    model, without running a forward pass.
    This is used for estimating the amount of processing for this audio.
    """
    return None

get_speech_to_text_config(model_config, task_type) classmethod

Get the speech to text config for the ASR model.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_speech_to_text_config(
    cls, model_config: ModelConfig, task_type: Literal["transcribe", "translate"]
) -> SpeechToTextConfig:
    """Get the speech to text config for the ASR model."""
    ...

parse_language_detection_output(token_ids, tokenizer) classmethod

Parse the detected language from model output token IDs.

Only needs to be implemented when supports_explicit_language_detection is True.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def parse_language_detection_output(
    cls,
    token_ids: list[int],
    tokenizer: object,
) -> str:
    """Parse the detected language from model output token IDs.

    Only needs to be implemented when
    ``supports_explicit_language_detection`` is ``True``.
    """
    raise NotImplementedError

post_process_output(text) classmethod

Post-process the raw model output text.

Some ASR models output structured formats (e.g., language tags, special tokens) that need to be stripped before returning to the user.

Parameters:

  • text

    (str) –

    Raw decoded text from the model.

Returns:

  • str

    Cleaned transcription text.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def post_process_output(cls, text: str) -> str:
    """
    Post-process the raw model output text.

    Some ASR models output structured formats (e.g., language tags,
    special tokens) that need to be stripped before returning to the user.

    Args:
        text: Raw decoded text from the model.

    Returns:
        Cleaned transcription text.
    """
    return text

validate_language(language) classmethod

Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def validate_language(cls, language: str | None) -> str | None:
    """
    Ensure the language specified in the transcription request
    is a valid ISO 639-1 language code. If the request language is
    valid, but not natively supported by the model, trigger a
    warning (but not an exception).
    """
    if language is None or language in cls.supported_languages:
        return language
    elif language in cls.get_other_languages():
        logger.warning(
            "Language %r is not natively supported by %s; "
            "results may be less accurate. Supported languages: %r",
            language,
            cls.__name__,
            list(cls.supported_languages.keys()),
        )
        return language
    else:
        raise ValueError(
            f"Unsupported language: {language!r}.  Must be one of "
            f"{list(cls.supported_languages.keys())}."
        )

VllmModelForPooling

Bases: VllmModel[T_co], Protocol[T_co]

The interface required for all pooling models in vLLM.

Attributes:

Source code in vllm/model_executor/models/interfaces_base.py
@runtime_checkable
class VllmModelForPooling(VllmModel[T_co], Protocol[T_co]):
    """The interface required for all pooling models in vLLM."""

    is_pooling_model: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pooling.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    default_seq_pooling_type: ClassVar[SequencePoolingType] = "LAST"
    """
    Indicates the [vllm.config.pooler.PoolerConfig.seq_pooling_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.default_pooling_type][]
    decorator to conveniently set this field.
    """

    default_tok_pooling_type: ClassVar[TokenPoolingType] = "ALL"
    """
    Indicates the [vllm.config.pooler.PoolerConfig.tok_pooling_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.default_pooling_type][]
    decorator to conveniently set this field.
    """

    attn_type: ClassVar[AttnTypeStr] = "decoder"
    """
    Indicates the
    [vllm.config.model.ModelConfig.attn_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.attn_type][]
    decorator to conveniently set this field.
    """

    score_type: ClassVar[ScoreType] = "bi-encoder"
    """
    Indicates the
    [vllm.config.model.ModelConfig.score_type][]
    to use by default.

    Scoring API handles score/rerank for:\n
    - "classify" task (score_type: cross-encoder models)\n
    - "embed" task (score_type: bi-encoder models)\n
    - "token_embed" task (score_type: late interaction models)\n

    score_type defaults to bi-encoder, then the Score API uses the "embed" task.\n
    If you set score_type to cross-encoder via 
    [vllm.model_executor.models.interfaces.SupportsCrossEncoding][], 
    then the Score API uses the "score" task.\n
    If you set score_type to late-interaction via 
    [vllm.model_executor.models.interfaces.SupportsLateInteraction][], 
    then the Score API uses the "token_embed" task.\n
    """

    pooler: Pooler
    """The pooler is only called on TP rank 0."""

attn_type = 'decoder' class-attribute

Indicates the vllm.config.model.ModelConfig.attn_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.attn_type decorator to conveniently set this field.

default_seq_pooling_type = 'LAST' class-attribute

Indicates the vllm.config.pooler.PoolerConfig.seq_pooling_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.

default_tok_pooling_type = 'ALL' class-attribute

Indicates the vllm.config.pooler.PoolerConfig.tok_pooling_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.

is_pooling_model = True class-attribute

A flag that indicates this model supports pooling.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

pooler instance-attribute

The pooler is only called on TP rank 0.

score_type = 'bi-encoder' class-attribute

Indicates the vllm.config.model.ModelConfig.score_type to use by default.

Scoring API handles score/rerank for:

  • "classify" task (score_type: cross-encoder models)

  • "embed" task (score_type: bi-encoder models)

  • "token_embed" task (score_type: late interaction models)

score_type defaults to bi-encoder, then the Score API uses the "embed" task.

If you set score_type to cross-encoder via vllm.model_executor.models.interfaces.SupportsCrossEncoding, then the Score API uses the "score" task.

If you set score_type to late-interaction via vllm.model_executor.models.interfaces.SupportsLateInteraction, then the Score API uses the "token_embed" task.

VllmModelForTextGeneration

Bases: VllmModel[T], Protocol[T]

The interface required for all generative models in vLLM.

Methods:

Source code in vllm/model_executor/models/interfaces_base.py
@runtime_checkable
class VllmModelForTextGeneration(VllmModel[T], Protocol[T]):
    """The interface required for all generative models in vLLM."""

    def compute_logits(
        self,
        hidden_states: T,
    ) -> T | None:
        """Return `None` if TP rank > 0."""
        ...

compute_logits(hidden_states)

Return None if TP rank > 0.

Source code in vllm/model_executor/models/interfaces_base.py
def compute_logits(
    self,
    hidden_states: T,
) -> T | None:
    """Return `None` if TP rank > 0."""
    ...