vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils ¶
Functions:
-
dequantize_to_dtype–Dequantize the fp4 tensor back to high precision.
_dequantize_nvfp4_kernel(fp4_ptr, scale_ptr, global_scale_ptr, output_ptr, rows_per_batch, num_blocks, BLOCK_SIZE, has_batch_global_scale, TILE_BLOCKS) ¶
Triton kernel for NVFP4 dequantization (swizzle=False).
Optimized with 2D tile processing + interleave for coalesced stores.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_e2m1_inline(magnitude) ¶
Inline E2M1 lookup using binary tree - 3 levels instead of 7 sequential.
Maps 3-bit magnitude to float: [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0] Uses bit decomposition for fewer comparisons.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_e2m1_lookup(magnitude) ¶
Lookup E2M1 float value from 3-bit magnitude.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_nvfp4_quant_dequant_kernel(input_ptr, output_ptr, global_scale_ptr, k, num_blocks, BLOCK_SIZE, FP4_MAX_RECIPROCAL, TILE_BLOCKS) ¶
Fused NVFP4 quantize-dequantize kernel.
Uses a 2D grid (rows x tiles) to parallelize across both rows and quantization groups within a row. Each program handles TILE_BLOCKS groups at once using vectorized 2D operations.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_round_to_fp4(x) ¶
Round float values to the nearest E2M1 representable value.
Matches the thresholds in the Python cast_to_fp4 exactly.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_triton_dequantize_nvfp4(tensor_fp4, tensor_sf, global_scale, dtype, block_size=16) ¶
Dequantize NVFP4 using Triton (swizzle=False only).
Supports both 2D and 3D inputs: - 2D: [m, packed_k] -> [m, k] - 3D: [dim0, m, packed_k] -> [dim0, m, k]
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
_triton_nvfp4_quant_dequant(x, global_scale, block_size) ¶
Triton-accelerated NVFP4 quantize-dequantize.
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
dequantize_to_dtype(tensor_fp4, tensor_sf, global_scale, dtype, block_size=16, swizzle=True) ¶
Dequantize the fp4 tensor back to high precision.
Supports both 2D and 3D inputs: - 2D: [m, packed_k] -> [m, k] - 3D: [dim0, m, packed_k] -> [dim0, m, k]
Source code in vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py
ref_nvfp4_quant_dequant(x, global_scale, block_size) ¶
NVFP4 quantize-dequantize operation.
global_scale is expected to have a single element.