vllm.config.attention ¶
Classes:
-
AttentionConfig–Configuration for attention mechanisms in vLLM.
AttentionConfig ¶
Configuration for attention mechanisms in vLLM.
Methods:
-
compute_hash–Provide a hash that uniquely identifies all the configs
-
validate_backend_before–Enable parsing of the
backendenum type from string. -
validate_mla_prefill_backend_before–Enable parsing of the
mla_prefill_backendenum type from string.
Attributes:
-
backend(AttentionBackendEnum | None) –Attention backend to use. Use "auto" or None for automatic selection.
-
disable_flashinfer_q_quantization(bool) –If set, when using fp8 kv, do not quantize Q to fp8.
-
flash_attn_max_num_splits_for_cuda_graph(int) –Flash Attention max number splits for cuda graph decode.
-
flash_attn_version(Literal[2, 3, 4] | None) –Force vllm to use a specific flash-attention version (2, 3, or 4).
-
flex_attn_block_m(int | None) –Triton kernel BLOCK_M tile size for flex attention.
-
flex_attn_block_n(int | None) –Triton kernel BLOCK_N tile size for flex attention.
-
flex_attn_kv_block_size(int | None) –Logical KV block size for the flex attention block mask.
-
flex_attn_q_block_size(int | None) –Logical Q block size for the flex attention block mask.
-
mla_prefill_backend(MLAPrefillBackendEnum | None) –MLA prefill backend to use. If None, will be selected automatically.
-
tq_max_kv_splits_for_cuda_graph(int) –TurboQuant max NUM_KV_SPLITS for cuda graph decode.
-
use_fp4_indexer_cache(bool) –If set, use fp4 indexer cache for dsv32 family model (not support yet)
-
use_non_causal(bool) –Whether to use non-causal (bidirectional) attention.
-
use_prefill_decode_attention(bool) –Use separate prefill and decode kernels for attention instead of
-
use_prefill_query_quantization(bool) –If set, quantize query for attention in prefill.
-
use_trtllm_attention(bool | None) –If set to True/False, use or don't use the TRTLLM attention backend
Source code in vllm/config/attention.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
backend = None class-attribute instance-attribute ¶
Attention backend to use. Use "auto" or None for automatic selection.
disable_flashinfer_q_quantization = False class-attribute instance-attribute ¶
If set, when using fp8 kv, do not quantize Q to fp8.
flash_attn_max_num_splits_for_cuda_graph = 32 class-attribute instance-attribute ¶
Flash Attention max number splits for cuda graph decode.
flash_attn_version = None class-attribute instance-attribute ¶
Force vllm to use a specific flash-attention version (2, 3, or 4). Only valid when using the flash-attention backend.
flex_attn_block_m = None class-attribute instance-attribute ¶
Triton kernel BLOCK_M tile size for flex attention. Must be a power of 2 >= 16. If None and VLLM_BATCH_INVARIANT=1, defaults to 16.
flex_attn_block_n = None class-attribute instance-attribute ¶
Triton kernel BLOCK_N tile size for flex attention. Must be a power of 2 >= 16. If None and VLLM_BATCH_INVARIANT=1, defaults to 16.
flex_attn_kv_block_size = None class-attribute instance-attribute ¶
Logical KV block size for the flex attention block mask. Must be a power of 2 and divisible by flex_attn_block_n. If None, uses the default (kv_cache_block_size on PyTorch >= 2.9, 128 otherwise).
flex_attn_q_block_size = None class-attribute instance-attribute ¶
Logical Q block size for the flex attention block mask. Must be a power of 2 and divisible by flex_attn_block_m. If None, uses the default (16 on PyTorch >= 2.9, 128 otherwise).
mla_prefill_backend = None class-attribute instance-attribute ¶
MLA prefill backend to use. If None, will be selected automatically. Valid options: FLASH_ATTN (FA3/FA4), FLASHINFER, TRTLLM_RAGGED.
tq_max_kv_splits_for_cuda_graph = 32 class-attribute instance-attribute ¶
TurboQuant max NUM_KV_SPLITS for cuda graph decode. Fixes the split count so grid dimensions are constant across captures, and buffers can be pre-allocated to avoid inflating the memory estimate.
use_fp4_indexer_cache = False class-attribute instance-attribute ¶
If set, use fp4 indexer cache for dsv32 family model (not support yet)
use_non_causal = False class-attribute instance-attribute ¶
Whether to use non-causal (bidirectional) attention.
use_prefill_decode_attention = False class-attribute instance-attribute ¶
Use separate prefill and decode kernels for attention instead of the unified triton kernel.
use_prefill_query_quantization = False class-attribute instance-attribute ¶
If set, quantize query for attention in prefill.
use_trtllm_attention = None class-attribute instance-attribute ¶
If set to True/False, use or don't use the TRTLLM attention backend in flashinfer. If None, auto-detect the attention backend in flashinfer.
compute_hash() ¶
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/attention.py
validate_backend_before(value) classmethod ¶
Enable parsing of the backend enum type from string.
The special value "auto" is treated as None, which triggers automatic backend selection.
Source code in vllm/config/attention.py
validate_mla_prefill_backend_before(value) classmethod ¶
Enable parsing of the mla_prefill_backend enum type from string.