vllm.model_executor.layers.quantization.utils.quant_utils ¶
This file is used for /tests and /benchmarks
Classes:
-
GroupShape–This class describes the quantization group shape.
-
QuantKey–Class for identifying the type of quantization.
-
ScaleDesc–Class for describing a single quantization scaling factor.
Functions:
-
convert_bf16_scales_to_fp8–Convert a BF16 scale tensor into the pair of (fp8_scales, channel_scales)
-
convert_packed_uint4b8_to_signed_int4_inplace–Convert int4b8 (packed to int32) to signed int4
-
get_and_maybe_dequant_weights–Return layer's unquantized weights in [out, in] layout
-
get_fp8_min_max–Get the min and max values for FP8 quantization.
-
prep_scale_for_group_broadcast–Prepare the input quantization scale for group broadcasting.
-
scaled_quantize–Args:
GroupShape ¶
Bases: _GroupShape
This class describes the quantization group shape. It includes static members for common shapes (per-tensor, per-token).
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
QuantKey dataclass ¶
Class for identifying the type of quantization. dtype: quantized data type scale: scale descriptor scale2: second-level scale descriptor symmetric: symmetric if True, asymmetric if False
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
ScaleDesc dataclass ¶
Class for describing a single quantization scaling factor. dtype: data type of the scale static: static scale if True, dynamic if False group_shape: group shape of the scale
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_bf16_scales_to_fp8(quant_fp8, scales) ¶
Convert a BF16 scale tensor into the pair of (fp8_scales, channel_scales) expected by W4A8 GEMM kernels.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_packed_uint4b8_to_signed_int4_inplace(t) ¶
Convert int4b8 (packed to int32) to signed int4
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_and_maybe_dequant_weights(layer, out_dtype=torch.float32) ¶
Return layer's unquantized weights in [out, in] layout
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_fp8_min_max() ¶
Get the min and max values for FP8 quantization.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
prep_scale_for_group_broadcast(scale, x, group_shape) ¶
Prepare the input quantization scale for group broadcasting.
Parameters:
-
(scale¶Tensor) –The scale tensor (scalar or 1D).
-
(x¶Tensor) –Target tensor whose shape determines broadcast dimensions.
-
(group_shape¶GroupShape | None) –GroupShape to broadcast over.
Returns:
-
Tensor–scale reshaped for correct broadcasting.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
scaled_quantize(x, group_shape, quant_dtype, compute_dtype=None) ¶
Parameters:
-
(x¶Tensor) –Input tensor to quantize
-
(group_shape¶GroupShape) –Shape of quantization groups
-
(quant_dtype¶dtype) –Target quantized dtype (e.g., torch.float8_e4m3fn)
-
(compute_dtype¶dtype | None, default:None) –Optional dtype for intermediate computations. If None, uses input dtype. Use torch.float32 for higher precision.