vllm.model_executor.layers.quantization.utils.flashinfer_fp4_moe ¶
Utility helpers for NVFP4 + FlashInfer fused-MoE path
Functions:
-
reorder_w1w3_to_w3w1–Re-order the concatenated
[w1, w3]tensors to[w3, w1]
interleave_linear_and_gate(x, group_size=64, dim=-1) ¶
Interleave gate and linear weight rows for CuteDSL wrapper.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
is_flashinfer_fp4_cutlass_moe_available() ¶
Return True when FlashInfer CUTLASS NV-FP4 kernels can be used.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
prepare_nvfp4_moe_layer_for_flashinfer_cutedsl(layer, w13, w13_scale, w13_scale_2, a13_scale, w2, w2_scale, w2_scale_2, a2_scale) ¶
Prepare weights for the CuteDSL wrapper-based NvFP4 MoE backend.
Converts weight scale factors to MMA layout expected by CuteDslMoEWrapper, and interleaves w13 gate/linear rows.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
reorder_w1w3_to_w3w1(weight, scale, dim=-2) ¶
Re-order the concatenated [w1, w3] tensors to [w3, w1]