vllm.v1.outputs ¶
Classes:
-
AsyncModelRunnerOutput– -
LogprobsTensors– -
ModelRunnerOutput– -
RoutedExpertsLists–CPU-side routed experts, the form :meth:
RoutedExpertsManager.store_batch -
RoutedExpertsTensors–Device-side snapshot of routed experts data, pending async D2H.
Functions:
-
make_empty_encoder_model_runner_output–Create a ModelRunnerOutput stub that contains the correct
AsyncModelRunnerOutput ¶
Bases: ABC
Methods:
-
get_output–Get the ModelRunnerOutput for this async output.
Source code in vllm/v1/outputs.py
get_output() abstractmethod ¶
Get the ModelRunnerOutput for this async output.
This is a blocking call that waits until the results are ready, which might involve copying device tensors to the host. This method should only be called once per AsyncModelRunnerOutput.
Source code in vllm/v1/outputs.py
LogprobsTensors ¶
Bases: NamedTuple
Methods:
-
empty_cpu–Create empty LogprobsTensors on CPU.
-
filter–Filter the logprobs tensors with the given bool mask.
Source code in vllm/v1/outputs.py
empty_cpu(num_positions, num_tokens_per_position) staticmethod ¶
Create empty LogprobsTensors on CPU.
Source code in vllm/v1/outputs.py
filter(mask) ¶
Filter the logprobs tensors with the given bool mask.
Source code in vllm/v1/outputs.py
ModelRunnerOutput dataclass ¶
Methods:
-
with_kv_conn_output_only–Return ModelRunnerOutput containing the provided KVConnectorOutput,
Source code in vllm/v1/outputs.py
with_kv_conn_output_only(kv_connector_output) staticmethod ¶
Return ModelRunnerOutput containing the provided KVConnectorOutput, otherwise empty. Returns None if kv_connector_output is passed as None.
Source code in vllm/v1/outputs.py
RoutedExpertsLists ¶
Bases: NamedTuple
CPU-side routed experts, the form :meth:RoutedExpertsManager.store_batch consumes.
Batched per scheduler step: the leading dim is the number of tokens scheduled across all requests in this step (total_num_scheduled_tokens), not per-request tokens. slot_mapping[i] tells the scheduler which physical KV-cache slot row i of routing_data belongs to.
Source code in vllm/v1/outputs.py
RoutedExpertsTensors ¶
Bases: NamedTuple
Device-side snapshot of routed experts data, pending async D2H.
Produced by :class:GPUModelRunner at the end of each async-scheduled step. The copy stream waits on the default stream, then issues non-blocking D2H via :meth:to_cpu_nonblocking into a pinned CPU buffer; :class:AsyncGPUModelRunnerOutput.get_output synchronizes the copy before the scheduler reads it.
Sliced to total_num_scheduled_tokens (step-level, across all requests — NOT per-request). Both routing_data and slot_mapping must be private clones when sourced from shared capturer / prepare-input buffers, so the next forward pass / _prepare_inputs on the default stream does not race with a D2H still pending on the copy stream.
Methods:
-
to_cpu_nonblocking–Issue non-blocking D2H on the current stream.
-
tolists–Convert to the numpy-backed form consumed by the scheduler.
Source code in vllm/v1/outputs.py
to_cpu_nonblocking() ¶
Issue non-blocking D2H on the current stream.
NOTE: non_blocking=True only delivers true overlap when the CPU target is pinned. The current fallback here allocates a new pageable CPU tensor per call, which silently degrades to a synchronous copy; acceptable because the sync happens on the dedicated copy stream, not the default stream.
Source code in vllm/v1/outputs.py
tolists() ¶
Convert to the numpy-backed form consumed by the scheduler.
.cpu() is a no-op when the tensor is already on CPU, so this is cheap for the post-D2H case; for raw device tensors it will synchronously block, which is only reached in tests.
Source code in vllm/v1/outputs.py
make_empty_encoder_model_runner_output(scheduler_output) ¶
Create a ModelRunnerOutput stub that contains the correct per-request bookkeeping but no generated data yet.