`vllm.inputs.engine` ¶

Schema and utilities for inputs to the engine client (LLMEngine/AsyncLLM).

Classes:

EmbedsInput –

Represents embeddings-based input to the engine.
EncoderDecoderInput –

A rendered EncoderDecoderPrompt
MultiModalEncDecInput –

Represents multi-modal input to the engine for encoder-decoder models.
MultiModalInput –

Represents multi-modal input to the engine.
TokensInput –

Represents token-based input to the engine.

Functions:

embeds_input –

Construct EmbedsInput
tokens_input –

Construct TokensInput

Attributes:

DecoderEngineInput (TypeAlias) –

A rendered DecoderPrompt
DecoderOnlyEngineInput (TypeAlias) –

A rendered DecoderOnlyPrompt
EncoderInput (TypeAlias) –

A rendered EncoderPrompt
EngineInput (TypeAlias) –

A rendered PromptType
MultiModalHashes (TypeAlias) –

A dictionary containing per-item hashes for each modality.
MultiModalPlaceholders (TypeAlias) –

A dictionary containing per-item placeholder ranges for each modality.
SingletonInput (TypeAlias) –

A rendered SingletonPrompt

`DecoderEngineInput = TokensInput | MultiModalInput` `module-attribute` ¶

A rendered DecoderPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.

`DecoderOnlyEngineInput = TokensInput | EmbedsInput | MultiModalInput` `module-attribute` ¶

A rendered DecoderOnlyPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.

`EncoderInput = TokensInput | MultiModalEncDecInput` `module-attribute` ¶

A rendered EncoderPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.

`EngineInput = DecoderOnlyEngineInput | EncoderDecoderInput` `module-attribute` ¶

A rendered PromptType which can be passed to LLMEngine.add_request or AsyncLLM.add_request.

`MultiModalHashes = Mapping[str, list[str]]` `module-attribute` ¶

A dictionary containing per-item hashes for each modality.

`MultiModalPlaceholders = Mapping[str, Sequence['PlaceholderRange']]` `module-attribute` ¶

A dictionary containing per-item placeholder ranges for each modality.

`SingletonInput = DecoderOnlyEngineInput | MultiModalEncDecInput` `module-attribute` ¶

A rendered SingletonPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.

`EmbedsInput` ¶

Bases: _InputOptions

Represents embeddings-based input to the engine.

Attributes:

is_token_ids (NotRequired[list[bool]]) –

Per-position mask for mixed-mode inputs. True means the position
prompt (NotRequired[str]) –

The prompt text corresponding to the token IDs, if available.
prompt_embeds (Tensor) –

The embeddings of the prompt.
prompt_token_ids (NotRequired[list[int]]) –

Token IDs of the rendered prompt. Only set for mixed-mode inputs
type (Literal['embeds']) –

The type of input.

Source code in vllm/inputs/engine.py

class EmbedsInput(_InputOptions):
    """Represents embeddings-based input to the engine."""

    type: Literal["embeds"]
    """The type of input."""

    prompt_embeds: "torch.Tensor"
    """The embeddings of the prompt."""

    prompt: NotRequired[str]
    """The prompt text corresponding to the token IDs, if available."""

    prompt_token_ids: NotRequired[list[int]]
    """Token IDs of the rendered prompt. Only set for mixed-mode inputs
    (chat completion with `prompt_embeds` content parts). When present,
    `is_token_ids` MUST also be present and have the same length. 
    For pure-embeds inputs this field is absent."""

    is_token_ids: NotRequired[list[bool]]
    """Per-position mask for mixed-mode inputs. `True` means the position
    is a real token ID (use the model's embedding layer); `False` means
    the position uses a pre-computed embedding row from `prompt_embeds`.
    Length MUST equal `len(prompt_token_ids)`.
    For pure-embeds inputs this field is absent."""

`is_token_ids` `instance-attribute` ¶

Per-position mask for mixed-mode inputs. True means the position is a real token ID (use the model's embedding layer); False means the position uses a pre-computed embedding row from prompt_embeds. Length MUST equal len(prompt_token_ids). For pure-embeds inputs this field is absent.

`prompt` `instance-attribute` ¶

The prompt text corresponding to the token IDs, if available.

`prompt_embeds` `instance-attribute` ¶

The embeddings of the prompt.

`prompt_token_ids` `instance-attribute` ¶

Token IDs of the rendered prompt. Only set for mixed-mode inputs (chat completion with prompt_embeds content parts). When present, is_token_ids MUST also be present and have the same length. For pure-embeds inputs this field is absent.

`type` `instance-attribute` ¶

The type of input.

`EncoderDecoderInput` ¶

Bases: TypedDict

A rendered EncoderDecoderPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.

Attributes:

arrival_time (NotRequired[float]) –

The time when the input was received (before rendering).
decoder_prompt (DecoderEngineInput) –

The inputs for the decoder portion.
encoder_prompt (EncoderInput) –

The inputs for the encoder portion.

Source code in vllm/inputs/engine.py

class EncoderDecoderInput(TypedDict):
    """
    A rendered [`EncoderDecoderPrompt`][vllm.inputs.llm.EncoderDecoderPrompt]
    which can be passed to `LLMEngine.add_request` or `AsyncLLM.add_request`.
    """

    type: Literal["enc_dec"]

    encoder_prompt: EncoderInput
    """The inputs for the encoder portion."""

    decoder_prompt: DecoderEngineInput
    """The inputs for the decoder portion."""

    arrival_time: NotRequired[float]
    """The time when the input was received (before rendering)."""

`arrival_time` `instance-attribute` ¶

The time when the input was received (before rendering).

`decoder_prompt` `instance-attribute` ¶

The inputs for the decoder portion.

`encoder_prompt` `instance-attribute` ¶

The inputs for the encoder portion.

`MultiModalEncDecInput` ¶

Bases: MultiModalInput

Represents multi-modal input to the engine for encoder-decoder models.

Note

Even text-only encoder-decoder models are currently implemented as multi-modal models for convenience. (Example: https://github.com/vllm-project/bart-plugin)

Attributes:

encoder_prompt (NotRequired[str]) –

The prompt text corresponding to the encoder token IDs, if available.
encoder_prompt_token_ids (list[int]) –

The processed token IDs of the encoder prompt.

Source code in vllm/inputs/engine.py

class MultiModalEncDecInput(MultiModalInput):
    """
    Represents multi-modal input to the engine for encoder-decoder models.

    Note:
        Even text-only encoder-decoder models are currently implemented
        as multi-modal models for convenience.
        (Example: https://github.com/vllm-project/bart-plugin)
    """

    encoder_prompt_token_ids: list[int]
    """The processed token IDs of the encoder prompt."""

    encoder_prompt: NotRequired[str]
    """The prompt text corresponding to the encoder token IDs, if available."""

`encoder_prompt` `instance-attribute` ¶

The prompt text corresponding to the encoder token IDs, if available.

`encoder_prompt_token_ids` `instance-attribute` ¶

The processed token IDs of the encoder prompt.

`MultiModalInput` ¶

Bases: _InputOptions

Represents multi-modal input to the engine.

Attributes:

mm_hashes (MultiModalHashes) –

The hashes of the multi-modal data.
mm_kwargs (MultiModalKwargsOptionalItems) –

Keyword arguments to be directly passed to the model after batching.
mm_placeholders (MultiModalPlaceholders) –

For each modality, information about the placeholder tokens in
prompt (NotRequired[str]) –

The prompt text corresponding to the token IDs, if available.
prompt_token_ids (list[int]) –

The processed token IDs which includes placeholder tokens.
type (Literal['multimodal']) –

The type of input.

Source code in vllm/inputs/engine.py

class MultiModalInput(_InputOptions):
    """Represents multi-modal input to the engine."""

    type: Literal["multimodal"]
    """The type of input."""

    prompt_token_ids: list[int]
    """The processed token IDs which includes placeholder tokens."""

    prompt: NotRequired[str]
    """The prompt text corresponding to the token IDs, if available."""

    mm_kwargs: "MultiModalKwargsOptionalItems"
    """Keyword arguments to be directly passed to the model after batching."""

    mm_hashes: MultiModalHashes
    """The hashes of the multi-modal data."""

    mm_placeholders: MultiModalPlaceholders
    """
    For each modality, information about the placeholder tokens in
    `prompt_token_ids`.
    """

`mm_hashes` `instance-attribute` ¶

The hashes of the multi-modal data.

`mm_kwargs` `instance-attribute` ¶

Keyword arguments to be directly passed to the model after batching.

`mm_placeholders` `instance-attribute` ¶

For each modality, information about the placeholder tokens in prompt_token_ids.

`prompt` `instance-attribute` ¶

The prompt text corresponding to the token IDs, if available.

`prompt_token_ids` `instance-attribute` ¶

The processed token IDs which includes placeholder tokens.

`type` `instance-attribute` ¶

The type of input.

`TokensInput` ¶

Bases: _InputOptions

Represents token-based input to the engine.

Attributes:

prompt (NotRequired[str]) –

The prompt text corresponding to the token IDs, if available.
prompt_token_ids (list[int]) –

The token IDs of the prompt.
type (Literal['token']) –

The type of input.

Source code in vllm/inputs/engine.py

class TokensInput(_InputOptions):
    """Represents token-based input to the engine."""

    type: Literal["token"]
    """The type of input."""

    prompt_token_ids: list[int]
    """The token IDs of the prompt."""

    prompt: NotRequired[str]
    """The prompt text corresponding to the token IDs, if available."""

`prompt` `instance-attribute` ¶

The prompt text corresponding to the token IDs, if available.

`prompt_token_ids` `instance-attribute` ¶

The token IDs of the prompt.

`type` `instance-attribute` ¶

The type of input.

`_InputOptions` ¶

Bases: TypedDict

Additional options available to all SingletonInput types.

Attributes:

arrival_time (NotRequired[float]) –

The time when the input was received (before rendering).
cache_salt (NotRequired[str]) –

Optional cache salt to be used for prefix caching.

Source code in vllm/inputs/engine.py

class _InputOptions(TypedDict):
    """
    Additional options available to all
    [`SingletonInput`][vllm.inputs.engine.SingletonInput] types.
    """

    arrival_time: NotRequired[float]
    """The time when the input was received (before rendering)."""

    cache_salt: NotRequired[str]
    """Optional cache salt to be used for prefix caching."""

`arrival_time` `instance-attribute` ¶

The time when the input was received (before rendering).

`cache_salt` `instance-attribute` ¶

Optional cache salt to be used for prefix caching.

`_prepare_decoder_input_ids_for_generation(decoder_input_ids, decoder_start_token_id)` ¶

Prepare decoder_input_ids for generation with encoder-decoder models, according to GenerationMixin._prepare_decoder_input_ids_for_generation().

Source: https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/generation/utils.py

Source code in vllm/inputs/engine.py

def _prepare_decoder_input_ids_for_generation(
    decoder_input_ids: list[int],
    decoder_start_token_id: int,
) -> list[int]:
    """
    Prepare `decoder_input_ids` for generation with encoder-decoder models,
    according to `GenerationMixin._prepare_decoder_input_ids_for_generation()`.

    Source:
    https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/generation/utils.py
    """
    if len(decoder_input_ids) == 0 or decoder_input_ids[0] != decoder_start_token_id:
        decoder_input_ids = [decoder_start_token_id] + decoder_input_ids

    return decoder_input_ids

`embeds_input(prompt_embeds, *, prompt=None, cache_salt=None, prompt_token_ids=None, is_token_ids=None)` ¶

Construct EmbedsInput from optional values.

Source code in vllm/inputs/engine.py

def embeds_input(
    prompt_embeds: "torch.Tensor",
    *,
    prompt: str | None = None,
    cache_salt: str | None = None,
    prompt_token_ids: list[int] | None = None,
    is_token_ids: list[bool] | None = None,
) -> EmbedsInput:
    """
    Construct [`EmbedsInput`][vllm.inputs.engine.EmbedsInput]
    from optional values.
    """
    inputs = EmbedsInput(type="embeds", prompt_embeds=prompt_embeds)

    if prompt is not None:
        inputs["prompt"] = prompt
    if cache_salt is not None:
        inputs["cache_salt"] = cache_salt
    if prompt_token_ids is not None:
        inputs["prompt_token_ids"] = prompt_token_ids
    if is_token_ids is not None:
        inputs["is_token_ids"] = is_token_ids

    return inputs

`tokens_input(prompt_token_ids, *, prompt=None, cache_salt=None)` ¶

Construct TokensInput from optional values.

Source code in vllm/inputs/engine.py

def tokens_input(
    prompt_token_ids: list[int],
    *,
    prompt: str | None = None,
    cache_salt: str | None = None,
) -> TokensInput:
    """
    Construct [`TokensInput`][vllm.inputs.engine.TokensInput]
    from optional values.
    """
    inputs = TokensInput(type="token", prompt_token_ids=prompt_token_ids)

    if prompt is not None:
        inputs["prompt"] = prompt
    if cache_salt is not None:
        inputs["cache_salt"] = cache_salt

    return inputs

vllm.inputs.engine ¶

DecoderEngineInput = TokensInput | MultiModalInput module-attribute ¶

DecoderOnlyEngineInput = TokensInput | EmbedsInput | MultiModalInput module-attribute ¶

EncoderInput = TokensInput | MultiModalEncDecInput module-attribute ¶

EngineInput = DecoderOnlyEngineInput | EncoderDecoderInput module-attribute ¶

MultiModalHashes = Mapping[str, list[str]] module-attribute ¶

MultiModalPlaceholders = Mapping[str, Sequence['PlaceholderRange']] module-attribute ¶

SingletonInput = DecoderOnlyEngineInput | MultiModalEncDecInput module-attribute ¶

EmbedsInput ¶

is_token_ids instance-attribute ¶

prompt instance-attribute ¶

prompt_embeds instance-attribute ¶

prompt_token_ids instance-attribute ¶

type instance-attribute ¶

EncoderDecoderInput ¶

arrival_time instance-attribute ¶

decoder_prompt instance-attribute ¶

encoder_prompt instance-attribute ¶

MultiModalEncDecInput ¶

encoder_prompt instance-attribute ¶

encoder_prompt_token_ids instance-attribute ¶

MultiModalInput ¶

mm_hashes instance-attribute ¶

mm_kwargs instance-attribute ¶

mm_placeholders instance-attribute ¶

prompt instance-attribute ¶

prompt_token_ids instance-attribute ¶

type instance-attribute ¶

TokensInput ¶

prompt instance-attribute ¶

prompt_token_ids instance-attribute ¶

type instance-attribute ¶

_InputOptions ¶

arrival_time instance-attribute ¶

cache_salt instance-attribute ¶

_prepare_decoder_input_ids_for_generation(decoder_input_ids, decoder_start_token_id) ¶

embeds_input(prompt_embeds, *, prompt=None, cache_salt=None, prompt_token_ids=None, is_token_ids=None) ¶

tokens_input(prompt_token_ids, *, prompt=None, cache_salt=None) ¶

`vllm.inputs.engine` ¶

`DecoderEngineInput = TokensInput | MultiModalInput` `module-attribute` ¶

`DecoderOnlyEngineInput = TokensInput | EmbedsInput | MultiModalInput` `module-attribute` ¶

`EncoderInput = TokensInput | MultiModalEncDecInput` `module-attribute` ¶

`EngineInput = DecoderOnlyEngineInput | EncoderDecoderInput` `module-attribute` ¶

`MultiModalHashes = Mapping[str, list[str]]` `module-attribute` ¶

`MultiModalPlaceholders = Mapping[str, Sequence['PlaceholderRange']]` `module-attribute` ¶

`SingletonInput = DecoderOnlyEngineInput | MultiModalEncDecInput` `module-attribute` ¶

`EmbedsInput` ¶

`is_token_ids` `instance-attribute` ¶

`prompt` `instance-attribute` ¶

`prompt_embeds` `instance-attribute` ¶

`prompt_token_ids` `instance-attribute` ¶

`type` `instance-attribute` ¶

`EncoderDecoderInput` ¶

`arrival_time` `instance-attribute` ¶

`decoder_prompt` `instance-attribute` ¶

`encoder_prompt` `instance-attribute` ¶

`MultiModalEncDecInput` ¶

`encoder_prompt` `instance-attribute` ¶

`encoder_prompt_token_ids` `instance-attribute` ¶

`MultiModalInput` ¶

`mm_hashes` `instance-attribute` ¶

`mm_kwargs` `instance-attribute` ¶

`mm_placeholders` `instance-attribute` ¶

`prompt` `instance-attribute` ¶

`prompt_token_ids` `instance-attribute` ¶

`type` `instance-attribute` ¶

`TokensInput` ¶

`prompt` `instance-attribute` ¶

`prompt_token_ids` `instance-attribute` ¶

`type` `instance-attribute` ¶

`_InputOptions` ¶

`arrival_time` `instance-attribute` ¶

`cache_salt` `instance-attribute` ¶

`_prepare_decoder_input_ids_for_generation(decoder_input_ids, decoder_start_token_id)` ¶

`embeds_input(prompt_embeds, *, prompt=None, cache_salt=None, prompt_token_ids=None, is_token_ids=None)` ¶

`tokens_input(prompt_token_ids, *, prompt=None, cache_salt=None)` ¶