vllm.v1.simple_kv_offload.cuda_mem_ops ¶
Low-level CUDA/HIP memory helpers: pinning and batch DMA transfers.
Functions:
-
copy_blocks–Copy blocks via cuMemcpyBatchAsync / hipMemcpyBatchAsync.
-
pin_tensor–Pin a CPU tensor via cudaHostRegister.
_resolve_batch_memcpy() ¶
Resolve the platform batch-memcpy entry point (one-time).
- CUDA:
cuMemcpyBatchAsyncviacuGetProcAddress(uses srcAccessOrder=STREAM via one attributes entry). - ROCm:
hipMemcpyBatchAsyncfrom libamdhip64 (ROCm 7.1+). ROCm 7.2.1 or 7.2.2 rejects any call withnumAttrs > 0(see ROCm/clr @ rocm-7.2.1 hipamd/src/hip_memory.cpp:2819-2822), so we call withnumAttrs=0.
Raises RuntimeError if the symbol is unavailable (older CUDA driver, ROCm < 7.1, unusual install). The connector requires the batch API.
Source code in vllm/v1/simple_kv_offload/cuda_mem_ops.py
copy_blocks(src_block_ids, dst_block_ids, params) ¶
Copy blocks via cuMemcpyBatchAsync / hipMemcpyBatchAsync.
Source code in vllm/v1/simple_kv_offload/cuda_mem_ops.py
pin_tensor(tensor) ¶
Pin a CPU tensor via cudaHostRegister.
This bypasses PyTorch's CUDACachingHostAllocator which rounds every pin_memory=True allocation up to the next power of 2 (e.g. 100 GB becomes 128 GB).