vllm.v1.kv_cache_interface ¶
Classes:
-
ChunkedLocalAttentionSpec– -
CrossAttentionSpec–KV cache spec for cross-attention layers in encoder-decoder models.
-
FullAttentionSpec–When hybrid allocator is disabled and the model contains both full
-
HiddenStateCacheSpec–Marker for hidden-state cache layers used by extract_hidden_states.
-
KVCacheConfig–The KV cache configuration of a model.
-
KVCacheGroupSpec–Represents a group of model layers that share the same KV cache block table.
-
KVCacheSpec–A base class for specifying the KV cache format of one layer.
-
KVCacheTensor–A class for specifying how the workers should initialize the KV cache.
-
KVQuantMode–KV cache quantization mode.
-
SinkFullAttentionSpec– -
SlidingWindowMLASpec–Sliding window attention with MLA cache format.
-
SlidingWindowSpec– -
TQFullAttentionSpec–FullAttentionSpec with TQ-aware page size.
-
UniformTypeKVCacheSpecs–A KV cache spec for multiple layers with the same type of attention. Here,
Functions:
-
get_kv_quant_mode–Map a
kv_cache_dtypestring to a :class:KVQuantMode. -
kv_cache_uses_per_token_head_scales–Return True if kv_cache_dtype needs per-token-head scales.
ChunkedLocalAttentionSpec dataclass ¶
Bases: AttentionSpec
Methods:
-
max_admission_blocks_per_request–Per-request admission cap, in blocks.
Source code in vllm/v1/kv_cache_interface.py
max_admission_blocks_per_request(max_num_batched_tokens, max_model_len) ¶
Per-request admission cap, in blocks.
Single source of truth for both startup pool sizing (max_memory_usage_bytes) and the runtime admission gate, so requests admitted by startup can also be admitted at runtime.
Source code in vllm/v1/kv_cache_interface.py
CrossAttentionSpec dataclass ¶
Bases: AttentionSpec
KV cache spec for cross-attention layers in encoder-decoder models.
Source code in vllm/v1/kv_cache_interface.py
FullAttentionSpec dataclass ¶
Bases: AttentionSpec
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size.
Methods:
-
merge–Merge a list of FullAttentionSpec objects into a single
Attributes:
-
sliding_window(int | None) –Default to None for not using sliding window attention.
Source code in vllm/v1/kv_cache_interface.py
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 | |
sliding_window = None class-attribute instance-attribute ¶
Default to None for not using sliding window attention.
merge(specs) classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
HiddenStateCacheSpec dataclass ¶
Bases: MLAAttentionSpec
Marker for hidden-state cache layers used by extract_hidden_states.
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig dataclass ¶
The KV cache configuration of a model.
Attributes:
-
kv_cache_groups(list[KVCacheGroupSpec]) –The kv cache groups of the model.
-
kv_cache_tensors(list[KVCacheTensor]) –How should model runner initialize the KV cache tensors for each layer
-
num_blocks(int) –The number of KV cache blocks
Source code in vllm/v1/kv_cache_interface.py
kv_cache_groups instance-attribute ¶
The kv cache groups of the model. For models with only one type of attention, there is only one group that contains all layers. For models with multiple types of attention, there will be multiple groups, see _get_kv_cache_config_uniform_page_size for more details.
kv_cache_tensors instance-attribute ¶
How should model runner initialize the KV cache tensors for each layer
num_blocks instance-attribute ¶
The number of KV cache blocks
KVCacheGroupSpec dataclass ¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec dataclass ¶
A base class for specifying the KV cache format of one layer.
Methods:
-
copy_with_new_block_size–Create a new KVCacheSpec from self but replacing the block size.
-
is_uniform_with_collection–Whether this KVCacheSpec is uniform with all specs of all layers.
-
max_memory_usage_bytes–The maximum possible memory usage of this KV cache in bytes.
-
merge–Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Attributes:
-
page_size_bytes(int) –The size of a page with
block_sizetokens in bytes.
Source code in vllm/v1/kv_cache_interface.py
page_size_bytes property ¶
copy_with_new_block_size(block_size) ¶
Create a new KVCacheSpec from self but replacing the block size.
is_uniform_with_collection(kv_cache_specs) ¶
Whether this KVCacheSpec is uniform with all specs of all layers.
Source code in vllm/v1/kv_cache_interface.py
max_memory_usage_bytes(vllm_config) ¶
The maximum possible memory usage of this KV cache in bytes.
Returns:
-
int–The KV cache size in bytes
merge(specs) classmethod ¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor dataclass ¶
A class for specifying how the workers should initialize the KV cache.
Source code in vllm/v1/kv_cache_interface.py
KVQuantMode ¶
Bases: IntEnum
KV cache quantization mode.
Used by attention backends and kernels to dispatch quantization logic without string matching on kv_cache_dtype.
Attributes:
-
is_nvfp4(bool) –True for NVFP4 packed quantization mode.
-
is_per_token_head(bool) –True for any per-token-head quantization mode.
Source code in vllm/v1/kv_cache_interface.py
SinkFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
Methods:
-
merge–Merge a list of FullAttentionSpec objects into a single
Source code in vllm/v1/kv_cache_interface.py
merge(specs) classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowMLASpec dataclass ¶
Bases: SlidingWindowSpec
Sliding window attention with MLA cache format.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec dataclass ¶
Bases: AttentionSpec
Methods:
-
max_admission_blocks_per_request–Per-request admission cap, in blocks.
Source code in vllm/v1/kv_cache_interface.py
max_admission_blocks_per_request(max_num_batched_tokens, max_model_len) ¶
Per-request admission cap, in blocks.
Single source of truth for both startup pool sizing (max_memory_usage_bytes) and the runtime admission gate. Per-request real-held blocks plateau at this bound because SlidingWindowManager.remove_skipped_blocks runs from allocate_slots before each chunk's get_num_blocks_to_allocate.
Source code in vllm/v1/kv_cache_interface.py
TQFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
FullAttentionSpec with TQ-aware page size.
Python equivalent of the C++ TQ4FullAttentionSpec. Overrides real_page_size_bytes to use TQ slot bytes instead of the raw head_size * dtype formula.
Source code in vllm/v1/kv_cache_interface.py
UniformTypeKVCacheSpecs dataclass ¶
Bases: KVCacheSpec
A KV cache spec for multiple layers with the same type of attention. Here, same types means always need the same number of token slots. For example, sliding window attentions with different window sizes are not the same type and should not be merged into one UniformTypeKVCacheSpecs.
Methods:
-
from_specs–Return a SameTypeKVCacheSpecs object if all layers have the same type
-
is_uniform_type–Whether all layers have the same type of KV cache spec.
Source code in vllm/v1/kv_cache_interface.py
from_specs(kv_cache_specs) classmethod ¶
Return a SameTypeKVCacheSpecs object if all layers have the same type of KV cache spec. Return None if not.
Source code in vllm/v1/kv_cache_interface.py
is_uniform_type(kv_cache_specs) classmethod ¶
Whether all layers have the same type of KV cache spec.
Uses the registry to determine grouping base classes, so custom specs that inherit from FullAttentionSpec are treated as full attention.
Source code in vllm/v1/kv_cache_interface.py
get_kv_quant_mode(kv_cache_dtype) ¶
Map a kv_cache_dtype string to a :class:KVQuantMode.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_uses_per_token_head_scales(kv_cache_dtype) ¶
Return True if kv_cache_dtype needs per-token-head scales.