vllm.models.deepseek_v4.common.ops.fused_inv_rope_fp8_quant ¶
Fused inverse RoPE + block-scaled FP8 quantization kernel for DeepseekV4 attention.
Output scale format is pre-transformed (MN-major TMA-aligned; FP32 on SM90, INT32-packed UE8M0 on SM100) so fp8_einsum skips transform_sf_into_required_layout.
Functions:
-
fused_inv_rope_fp8_quant–Fused inverse RoPE + block-scaled FP8 quantization.
fused_inv_rope_fp8_quant(o, positions, cos_sin_cache, n_groups, heads_per_group, nope_dim=448, rope_dim=64, quant_group_size=128, tma_aligned_scales=False) ¶
Fused inverse RoPE + block-scaled FP8 quantization.
Parameters:
-
(o¶Tensor) –Attention output [num_tokens, num_heads, head_dim] bf16.
-
(positions¶Tensor) –Token positions [num_tokens] int64.
-
(cos_sin_cache¶Tensor) –Precomputed [max_pos, rope_dim] with cos||sin.
-
(n_groups¶int) –Number of output groups.
-
(heads_per_group¶int) –Heads per group.
-
(nope_dim¶int, default:448) –Non-RoPE dimensions per head (default 448).
-
(rope_dim¶int, default:64) –RoPE dimensions per head (default 64).
-
(quant_group_size¶int, default:128) –FP8 quantization block size (default 128).
-
(tma_aligned_scales¶bool, default:False) –Output INT32 packed UE8M0 for SM100 (True) or FP32 for SM90 (False).
Returns:
-
o_fp8(Tensor) –[T, G, D] float8_e4m3fn, strides (D, T*D, 1).
-
o_scale(Tensor) –Pre-transformed scale tensor for fp8_einsum.