vllm.model_executor.layers.quantization.utils.mxfp8_utils ¶
Functions:
-
dequant_mxfp8_to_bf16–Dequantize MXFP8 tensor to BF16.
-
mxfp8_e4m3_quantize_fake–Fake implementation for torch.compile tracing.
-
swizzle_mxfp8_scale–Swizzle MXFP8 scales from row-major 2D to F8_128x4 layout.
_mxfp8_e4m3_quantize_torch(x, is_sf_swizzled_layout=False) ¶
Naive MXFP8 quantization. For each block of 32 elements along the last dimension, compute a shared e8m0 scale (the biased exponent of the block-wise amax) and quantize each element to float8_e4m3fn.
Returns (quantized_values [same shape, fp8], scales uint8). Scale shape depends on is_sf_swizzled_layout: False -> [..., K//32] (row-major 2D) True -> [flat swizzled 1D]
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
dequant_mxfp8_to_bf16(x, scales) ¶
Dequantize MXFP8 tensor to BF16.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
mxfp8_e4m3_quantize_fake(x, is_sf_swizzled_layout=False, alignment=0) ¶
Fake implementation for torch.compile tracing.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
swizzle_mxfp8_scale(sf, M, K) ¶
Swizzle MXFP8 scales from row-major 2D to F8_128x4 layout.