vllm.v1.attention.ops.common ¶
Classes:
-
CPTritonContext–The CPTritonContext is used to avoid recompilation of the Triton JIT.
Functions:
-
correct_attn_out–Correct the attention output using the all-gathered lses.
-
cp_lse_ag_out_ar–cp_attn_out: [ B, H, D ]
-
cp_lse_ag_out_rs–cp_attn_out: [ B, H, D ]
-
pack_seq_triton–Pack sequences of different lengths into a batched tensor.
-
unpack_seq_triton–Unpack a packed decode query tensor back to the original format.
CPTritonContext ¶
The CPTritonContext is used to avoid recompilation of the Triton JIT.
Source code in vllm/v1/attention/ops/common.py
_correct_attn_cp_out_kernel(outputs_ptr, new_output_ptr, lses_ptr, vlse_ptr, outputs_stride_B, outputs_stride_H, outputs_stride_D, lses_stride_N, lses_stride_B, lses_stride_H, lse_idx, HEAD_DIM, N_ROUNDED, IS_BASE_E) ¶
Apply the all-gathered lses to correct each local rank's attention output. we still need perform a cross-rank reduction to obtain the final attention output.
Parameters:
-
(outputs_ptr¶PointerType) –Pointer to input tensor of shape [ B, H, D ]
-
(lses_ptr¶PointerType) –Pointer to input tensor of shape [ N, B, H ]
-
(new_output_ptr¶PointerType) –Pointer to output tensor of shape [ B, H, D ]
-
(vlse_ptr¶PointerType) –Pointer to output tensor of shape [ B, H ]
Source code in vllm/v1/attention/ops/common.py
_cp_lse_common(cp_attn_out, cp_attn_lse, cp_group, ctx=None, is_lse_base_on_e=True) ¶
cp_attn_out: [ B, H, D ] cp_attn_lse: [ B, H ]
Source code in vllm/v1/attention/ops/common.py
correct_attn_out(out, lses, cp_rank, ctx, is_lse_base_on_e=True) ¶
Correct the attention output using the all-gathered lses.
Parameters:
-
(out¶Tensor) –Tensor of shape [ B, H, D ]
-
(lses¶Tensor) –Tensor of shape [ N, B, H ]
-
(cp_rank¶int) –Current rank in the context-parallel group
-
(ctx¶CPTritonContext) –Triton context to avoid recompilation
Returns:
Source code in vllm/v1/attention/ops/common.py
cp_lse_ag_out_ar(cp_attn_out, cp_attn_lse, cp_group, ctx=None, return_lse=False, is_lse_base_on_e=True) ¶
cp_attn_out: [ B, H, D ] cp_attn_lse: [ B, H ]
Source code in vllm/v1/attention/ops/common.py
cp_lse_ag_out_rs(cp_attn_out, cp_attn_lse, cp_group, ctx=None, return_lse=False, is_lse_base_on_e=True) ¶
cp_attn_out: [ B, H, D ] cp_attn_lse: [ B, H ]
Source code in vllm/v1/attention/ops/common.py
pack_seq_triton(x, lengths, pad_value=-float('inf'), block_t=64, block_d=64) ¶
Pack sequences of different lengths into a batched tensor.
Supports float dtypes (any, via fp32 pad) and torch.uint8 (exact-byte pad — e.g. MXFP4 packed nibbles or ue8m0 scale bytes). For uint8 inputs pad_value must be an integer in [0, 255].
Parameters:
-
(x¶Tensor) –[N, ...] — input tensor where N is total number of tokens.
-
(lengths¶Tensor) –[B] — sequence lengths for each batch.
-
(pad_value¶float | int, default:-float('inf')) –value to use for padding. Defaults to
-infwhich is only sensible for float dtypes; pass0(or any byte) for uint8 inputs. -
(block_t¶int, default:64) –block size for time dimension.
-
(block_d¶int, default:64) –block size for feature dimension.
Returns:
-
packed(Tensor) –[B, Lmax, ...] — packed tensor.
Source code in vllm/v1/attention/ops/common.py
unpack_seq_triton(packed_tensor, lengths, block_t=64, block_d=64) ¶
Unpack a packed decode query tensor back to the original format. Efficient Triton implementation.
Parameters:
-
(packed_tensor¶Tensor) –[B, Lmax, ...] - packed tensor from pack_seq_triton
-
(lengths¶Tensor) –[B] - sequence lengths for each batch
-
(block_t¶int, default:64) –block size for time dimension
-
(block_d¶int, default:64) –block size for feature dimension
Returns:
-
unpacked_tensor(Tensor) –[N, ...] where N = sum(lengths)