vllm.utils.multi_stream_utils ¶
Functions:
-
execute_in_parallel–Run default_fn on the current stream and aux_fns concurrently on
-
maybe_execute_in_parallel–Run two functions potentially in parallel on separate CUDA streams.
execute_in_parallel(default_fn, aux_fns, start_event, done_events, aux_streams=None, enable=False) ¶
Run default_fn on the current stream and aux_fns concurrently on aux_streams.
Generalizes maybe_execute_in_parallel to N aux callables. Slots where aux_fns[i] is None are skipped (no stream switch, no event record); their corresponding entry in the returned aux_results list is None.
start_event fans out from the current stream to every launched aux stream; done_events[i] is recorded after aux_fns[i] so the current stream joins before returning. Falls back to sequential execution on the current stream when aux_streams is None or enable is False; in that case default_fn runs first, then aux_fns in order.
Parameters:
-
(default_fn¶Callable[[], Any]) –Callable for the default (current) stream.
-
(aux_fns¶list[Callable[[], Any] | None]) –Per-aux callables; entries may be None to skip.
-
(start_event¶Event) –CUDA event recorded on the current stream before default_fn so each launched aux stream can wait on it.
-
(done_events¶list[Event]) –One CUDA event per aux slot, recorded after the corresponding aux_fn. Length must match aux_fns.
-
(aux_streams¶list[Stream] | None, default:None) –Per-aux CUDA streams. Length must match aux_fns. Multi-stream is disabled when None.
-
(enable¶bool, default:False) –Opt-in switch for the multi-stream path. Defaults to False, so callers that pass aux_streams must also pass enable=True (typically gated by an env var) to actually overlap. When False, execution falls back to sequential on the current stream.
Returns:
-
Any–Tuple of (default_result, aux_results) where aux_results[i] is the
-
list[Any]–result of aux_fns[i] (or None when skipped).
Source code in vllm/utils/multi_stream_utils.py
maybe_execute_in_parallel(fn0, fn1, event0, event1, aux_stream=None) ¶
Run two functions potentially in parallel on separate CUDA streams.
When aux_stream is provided, fn0 runs on the current (default) stream and fn1 runs on aux_stream, synchronized via CUDA events. When aux_stream is None, both functions execute sequentially on the current stream.
This design follows TensorRT-LLM's maybe_execute_in_parallel pattern (tensorrt_llm/_torch/modules/multi_stream_utils.py).
Parameters:
-
(fn0¶Callable[[], Any]) –Callable for the default stream.
-
(fn1¶Callable[[], Any]) –Callable for the auxiliary stream.
-
(event0¶Event) –CUDA event recorded before fn0 so aux_stream can wait.
-
(event1¶Event) –CUDA event recorded after fn1 so default stream can wait.
-
(aux_stream¶Stream | None, default:None) –The second CUDA stream for fn1. Multi-stream is disabled when aux_stream is None.
Returns: