vllm.model_executor.layers.quantization.utils.nvfp4_utils ¶
Functions:
-
pad_nvfp4_activation_for_cutlass–Pad packed FP4 activations to match the K-dimension padding applied to weights.
-
pad_nvfp4_weight_for_cutlass–Pad packed NVFP4 weights so that both N (rows) and K (columns) satisfy
-
slice_nvfp4_output–Slice the output tensor to remove padding in N dimension if weight was padded.
-
swizzle_blockscale–Pad and block-interleave the FP4 block-scales so that they match the data
pad_nvfp4_activation_for_cutlass(x_fp4, weights_padding_bytes) ¶
Pad packed FP4 activations to match the K-dimension padding applied to weights. The padding is in bytes (tensor dimension), not FP4 elements.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_utils.py
pad_nvfp4_weight_for_cutlass(weight, alignment=32) ¶
Pad packed NVFP4 weights so that both N (rows) and K (columns) satisfy the alignment constraints required by CUTLASS / FlashInfer FP4 kernels.
CUTLASS FP4 kernel requires both K and N matrix dimensions to be divisible by 32 for aligned memory access and efficient tensor core operations.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_utils.py
slice_nvfp4_output(out, output_size) ¶
Slice the output tensor to remove padding in N dimension if weight was padded.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_utils.py
swizzle_blockscale(scale) ¶
Pad and block-interleave the FP4 block-scales so that they match the data layout expected by the CUTLASS / FlashInfer kernels.
Parameters¶
scale: torch.Tensor
Returns¶
torch.Tensor The swizzled tensor with the same logical shape as scale.