vllm.models.deepseek_v4.nvidia.ops.o_proj ¶
Functions:
-
compute_fp8_einsum_recipe–fp8_einsum recipe + scale layout for the current GPU arch.
-
deep_gemm_fp8_o_proj–O projection: inverse RoPE + FP8 quant + einsum + wo_b.
compute_fp8_einsum_recipe() ¶
fp8_einsum recipe + scale layout for the current GPU arch.
SM90: FP32 block scales stay [g, r/128, d/128] → sfb_gran_mn=128. SM100: INT32 packed scales become [g, r, ...] → sfb_gran_mn=1.
Returns (einsum_recipe, tma_aligned_scales) for deep_gemm_fp8_o_proj.
Source code in vllm/models/deepseek_v4/nvidia/ops/o_proj.py
deep_gemm_fp8_o_proj(o, positions, cos_sin_cache, wo_a, wo_b, *, n_groups, heads_per_group, nope_dim, rope_dim, o_lora_rank, einsum_recipe, tma_aligned_scales) ¶
O projection: inverse RoPE + FP8 quant + einsum + wo_b.
Shared by the FlashMLA and FlashInfer CUDA backends. einsum_recipe / tma_aligned_scales come from compute_fp8_einsum_recipe.