vllm.model_executor.offloader.base ¶
Base classes for model parameter offloading.
Classes:
-
BaseOffloader–Base class for model parameter offloading strategies.
-
NoopOffloader–No-op offloader that returns modules as-is without any offloading.
Functions:
-
create_offloader–Create an offloader based on the offload configuration.
-
get_offloader–Get the global offloader instance.
-
set_offloader–Set the global offloader instance.
-
should_pin_memory–Check if pinned memory should be used for weight offloading.
BaseOffloader ¶
Bases: ABC
Base class for model parameter offloading strategies.
Offloaders control how model parameters are stored and loaded during inference. Different strategies trade memory for compute/transfer time.
Methods:
-
join_after_forward–Join streams after forward. Override in subclasses.
-
post_init–Called after model construction completes.
-
sync_prev_onload–Sync previous onload operations. Override in subclasses.
-
wrap_modules–Wrap modules with offloading logic.
Source code in vllm/model_executor/offloader/base.py
_start_prefetch(layer_idx) ¶
_wait_for_layer(layer_idx) ¶
join_after_forward() ¶
post_init() ¶
Called after model construction completes.
Offloaders can use this to: - Finalize parameter storage - Start initial prefetching - Allocate shared resources
sync_prev_onload() ¶
wrap_modules(modules_generator) abstractmethod ¶
Wrap modules with offloading logic.
Parameters:
-
(modules_generator¶Generator[Module, None, None]) –Generator yielding modules to potentially offload.
Returns:
Source code in vllm/model_executor/offloader/base.py
NoopOffloader ¶
Bases: BaseOffloader
No-op offloader that returns modules as-is without any offloading.
Methods:
-
wrap_modules–Return modules unchanged.
Source code in vllm/model_executor/offloader/base.py
create_offloader(offload_config) ¶
Create an offloader based on the offload configuration.
Uses the explicit offload_backend selector. When set to "auto", selects prefetch if offload_group_size > 0, UVA if cpu_offload_gb > 0, otherwise noop.
Source code in vllm/model_executor/offloader/base.py
get_offloader() ¶
set_offloader(instance) ¶
Set the global offloader instance.
Source code in vllm/model_executor/offloader/base.py
should_pin_memory() ¶
Check if pinned memory should be used for weight offloading.
Combines the platform capability check with the user override env var. On unified-memory systems (e.g. GH200) pinned memory eats into GPU memory, so users can disable it via VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY.