vllm.model_executor.layers.quantization.qutlass_utils ¶
Functions:
-
to_blocked–Rearrange a large matrix by breaking it into blocks and applying
-
triton_mx_block_rearrange–Rearranges an E8M0 tensor scale from row-major format to
-
triton_scale_swizzle–Rearranges tensor data from row-major to block-scaled swizzle format.
to_blocked(input_matrix, backend='triton') ¶
Rearrange a large matrix by breaking it into blocks and applying the rearrangement pattern.
See
https://docs.nvidia.com/cuda/cublas/index.html#d-block-scaling-factors-layout
Parameters:
-
(input_matrix¶Tensor) –Input tensor of shape (H, W)
-
(backend¶Literal['torch', 'triton'], default:'triton') –"torch" (PyTorch path) or "triton" (Triton kernel)
Returns:
-
Tensor–Rearranged tensor of shape (32cdiv(H,128), 16cdiv(W,4))
Source code in vllm/model_executor/layers/quantization/qutlass_utils.py
triton_mx_block_rearrange(scale_tensor) ¶
Rearranges an E8M0 tensor scale from row-major format to block-scaled swizzle format.
This format is suitable for Tmem as described in NVIDIA documentation: https://docs.nvidia.com/cuda/cublas/index.html#d-block-scaling-factors-layout
Parameters:
Returns:
-
Tensor–Rearranged tensor in block-scaled swizzle format
Source code in vllm/model_executor/layers/quantization/qutlass_utils.py
triton_scale_swizzle(scale_ptr, scale_rows, scale_cols, output_ptr, input_row_stride, output_block_stride, BLOCK_ROWS, BLOCK_COLS) ¶
Rearranges tensor data from row-major to block-scaled swizzle format.
Parameters:
-
(scale_ptr¶Tensor) –Pointer to the input scale tensor
-
(scale_rows¶int) –Number of rows in the scale tensor
-
(scale_cols¶int) –Number of columns in the scale tensor
-
(output_ptr¶Tensor) –Pointer to the output tensor
-
(input_row_stride¶int) –Stride between rows in the input tensor
-
(output_block_stride¶int) –Stride between blocks in the output tensor
-
(BLOCK_ROWS¶constexpr) –Number of rows in a tile (compile-time constant)
-
(BLOCK_COLS¶constexpr) –Number of columns in a tile (compile-time constant)