vllm.inputs.engine ¶
Schema and utilities for inputs to the engine client (LLMEngine/AsyncLLM).
Classes:
-
EmbedsInput–Represents embeddings-based input to the engine.
-
EncoderDecoderInput–A rendered
EncoderDecoderPrompt -
MultiModalEncDecInput–Represents multi-modal input to the engine for encoder-decoder models.
-
MultiModalInput–Represents multi-modal input to the engine.
-
TokensInput–Represents token-based input to the engine.
Functions:
-
embeds_input–Construct
EmbedsInput -
tokens_input–Construct
TokensInput
Attributes:
-
DecoderEngineInput(TypeAlias) –A rendered
DecoderPrompt -
DecoderOnlyEngineInput(TypeAlias) –A rendered
DecoderOnlyPrompt -
EncoderInput(TypeAlias) –A rendered
EncoderPrompt -
EngineInput(TypeAlias) –A rendered
PromptType -
MultiModalHashes(TypeAlias) –A dictionary containing per-item hashes for each modality.
-
MultiModalPlaceholders(TypeAlias) –A dictionary containing per-item placeholder ranges for each modality.
-
SingletonInput(TypeAlias) –A rendered
SingletonPrompt
DecoderEngineInput = TokensInput | MultiModalInput module-attribute ¶
A rendered DecoderPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.
DecoderOnlyEngineInput = TokensInput | EmbedsInput | MultiModalInput module-attribute ¶
A rendered DecoderOnlyPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.
EncoderInput = TokensInput | MultiModalEncDecInput module-attribute ¶
A rendered EncoderPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.
EngineInput = DecoderOnlyEngineInput | EncoderDecoderInput module-attribute ¶
A rendered PromptType which can be passed to LLMEngine.add_request or AsyncLLM.add_request.
MultiModalHashes = Mapping[str, list[str]] module-attribute ¶
A dictionary containing per-item hashes for each modality.
MultiModalPlaceholders = Mapping[str, Sequence['PlaceholderRange']] module-attribute ¶
A dictionary containing per-item placeholder ranges for each modality.
SingletonInput = DecoderOnlyEngineInput | MultiModalEncDecInput module-attribute ¶
A rendered SingletonPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.
EmbedsInput ¶
Bases: _InputOptions
Represents embeddings-based input to the engine.
Attributes:
-
is_token_ids(NotRequired[list[bool]]) –Per-position mask for mixed-mode inputs.
Truemeans the position -
prompt(NotRequired[str]) –The prompt text corresponding to the token IDs, if available.
-
prompt_embeds(Tensor) –The embeddings of the prompt.
-
prompt_token_ids(NotRequired[list[int]]) –Token IDs of the rendered prompt. Only set for mixed-mode inputs
-
type(Literal['embeds']) –The type of input.
Source code in vllm/inputs/engine.py
is_token_ids instance-attribute ¶
Per-position mask for mixed-mode inputs. True means the position is a real token ID (use the model's embedding layer); False means the position uses a pre-computed embedding row from prompt_embeds. Length MUST equal len(prompt_token_ids). For pure-embeds inputs this field is absent.
prompt instance-attribute ¶
The prompt text corresponding to the token IDs, if available.
prompt_embeds instance-attribute ¶
The embeddings of the prompt.
prompt_token_ids instance-attribute ¶
Token IDs of the rendered prompt. Only set for mixed-mode inputs (chat completion with prompt_embeds content parts). When present, is_token_ids MUST also be present and have the same length. For pure-embeds inputs this field is absent.
type instance-attribute ¶
The type of input.
EncoderDecoderInput ¶
Bases: TypedDict
A rendered EncoderDecoderPrompt which can be passed to LLMEngine.add_request or AsyncLLM.add_request.
Attributes:
-
arrival_time(NotRequired[float]) –The time when the input was received (before rendering).
-
decoder_prompt(DecoderEngineInput) –The inputs for the decoder portion.
-
encoder_prompt(EncoderInput) –The inputs for the encoder portion.
Source code in vllm/inputs/engine.py
MultiModalEncDecInput ¶
Bases: MultiModalInput
Represents multi-modal input to the engine for encoder-decoder models.
Note
Even text-only encoder-decoder models are currently implemented as multi-modal models for convenience. (Example: https://github.com/vllm-project/bart-plugin)
Attributes:
-
encoder_prompt(NotRequired[str]) –The prompt text corresponding to the encoder token IDs, if available.
-
encoder_prompt_token_ids(list[int]) –The processed token IDs of the encoder prompt.
Source code in vllm/inputs/engine.py
MultiModalInput ¶
Bases: _InputOptions
Represents multi-modal input to the engine.
Attributes:
-
mm_hashes(MultiModalHashes) –The hashes of the multi-modal data.
-
mm_kwargs(MultiModalKwargsOptionalItems) –Keyword arguments to be directly passed to the model after batching.
-
mm_placeholders(MultiModalPlaceholders) –For each modality, information about the placeholder tokens in
-
prompt(NotRequired[str]) –The prompt text corresponding to the token IDs, if available.
-
prompt_token_ids(list[int]) –The processed token IDs which includes placeholder tokens.
-
type(Literal['multimodal']) –The type of input.
Source code in vllm/inputs/engine.py
mm_hashes instance-attribute ¶
The hashes of the multi-modal data.
mm_kwargs instance-attribute ¶
Keyword arguments to be directly passed to the model after batching.
mm_placeholders instance-attribute ¶
For each modality, information about the placeholder tokens in prompt_token_ids.
prompt instance-attribute ¶
The prompt text corresponding to the token IDs, if available.
prompt_token_ids instance-attribute ¶
The processed token IDs which includes placeholder tokens.
type instance-attribute ¶
The type of input.
TokensInput ¶
Bases: _InputOptions
Represents token-based input to the engine.
Attributes:
-
prompt(NotRequired[str]) –The prompt text corresponding to the token IDs, if available.
-
prompt_token_ids(list[int]) –The token IDs of the prompt.
-
type(Literal['token']) –The type of input.
Source code in vllm/inputs/engine.py
_InputOptions ¶
Bases: TypedDict
Additional options available to all SingletonInput types.
Attributes:
-
arrival_time(NotRequired[float]) –The time when the input was received (before rendering).
-
cache_salt(NotRequired[str]) –Optional cache salt to be used for prefix caching.
Source code in vllm/inputs/engine.py
_prepare_decoder_input_ids_for_generation(decoder_input_ids, decoder_start_token_id) ¶
Prepare decoder_input_ids for generation with encoder-decoder models, according to GenerationMixin._prepare_decoder_input_ids_for_generation().
Source: https://github.com/huggingface/transformers/blob/v5.1.0/src/transformers/generation/utils.py
Source code in vllm/inputs/engine.py
embeds_input(prompt_embeds, *, prompt=None, cache_salt=None, prompt_token_ids=None, is_token_ids=None) ¶
Construct EmbedsInput from optional values.
Source code in vllm/inputs/engine.py
tokens_input(prompt_token_ids, *, prompt=None, cache_salt=None) ¶
Construct TokensInput from optional values.