vllm.v1.attention.ops.triton_unified_attention ¶
_cast_kv_tile(data, Q, tensor_scale, KV_QUANT_MODE) ¶
Cast a loaded KV tile to Q's dtype, dequantizing if needed.
Modes handled inside the core kernel:
KV_QUANT_MODE == 0(NONE) and2(INT8 per-token-head) and3(FP8 per-token-head): plain cast. Per-token-head modes apply their scales separately on S/P inside the loop.KV_QUANT_MODE == 1(FP8 per-tensor): dequantize using the tensor-wide scale, unless Q is also FP8 and the caller folds the scales into the attention score and output accumulator.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_get_tile_size(head_size, sliding_window, element_size, is_prefill) ¶
Select tile size with Gemma3-specific optimization.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_is_gemma3_attention(head_size, sliding_window) ¶
Detect Gemma3 models via unique (head_size, sliding_window) signature.
Gemma3 models are the only ones using sliding_window=1024 with head_size 128 (27B) or 256 (1B, 4B, 12B). Other SWA models use different window sizes (Mistral=4096, Phi-3=2047).
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_load_kv_tile_td(cache_ptr, physical_block_idx_scalar, kv_head_idx, offset_in_block, stride_cache_0, stride_cache_1, stride_cache_2, stride_cache_3, BLOCK_SIZE, TILE_SIZE, HEAD_SIZE, HEAD_SIZE_PADDED) ¶
Load a KV cache tile via tensor descriptor.
Returns shape (TILE_SIZE, HEAD_SIZE_PADDED). Caller transposes for K. Tensor descriptors zero-pad reads beyond the shape boundary, so HEAD_SIZE_PADDED > HEAD_SIZE is handled correctly.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_load_q_td(query_ptr, q_block_local_len, query_stride_0, query_stride_1, cur_batch_in_all_start_index, q_block_local_idx, kv_head_idx, num_queries_per_kv, BLOCK_Q, BLOCK_M, HEAD_SIZE, HEAD_SIZE_PADDED) ¶
Load Q via a 2D tensor descriptor.
Caller guarantees (via the wrapper's use_td_qo gate): * HEAD_SIZE == HEAD_SIZE_PADDED (head_size is a power of 2), * num_queries_per_kv is a power of 2, * the num_queries_per_kv heads of the current KV group are contiguous in memory (query_stride_1 == HEAD_SIZE, which is the default vLLM query layout).
Under those preconditions the inner two axes are flattened into one row of size num_queries_per_kv * HEAD_SIZE with stride 1, which avoids the non-power-of-2 block_shape error from the Triton tensor-descriptor validator. Returns (BLOCK_M, HEAD_SIZE_PADDED).
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_store_output_td(base_ptr, acc, q_block_local_len, stride_token, stride_head, num_queries_per_kv, BLOCK_Q, HEAD_SIZE, HEAD_SIZE_PADDED) ¶
Store an output tile via a tensor descriptor.
The 2D and 3D epilogues differ only in base_ptr and the (stride_token, stride_head) pair: 2D writes directly to the flat output buffer, 3D writes to a single per-segment slice of segm_output_ptr. Descriptor shape / block_shape / reshape are the same in both modes, so share one helper.