vllm.model_executor.layers.quantization.utils.marlin_utils ¶
Functions:
-
marlin_moe_intermediate_size–Given Marlin packed weight matrices w1_packed, and w2_packed,
-
moe_packed_to_marlin_zero_points–Convert compressed-tensors packed zero points to Marlin format.
marlin_moe_intermediate_size(w1_packed, w2_packed) ¶
Given Marlin packed weight matrices w1_packed, and w2_packed, return the MoE intermediate size N
Source code in vllm/model_executor/layers/quantization/utils/marlin_utils.py
moe_packed_to_marlin_zero_points(q_zp_packed, size_k, size_n, num_bits, is_a_8bit=False) ¶
Convert compressed-tensors packed zero points to Marlin format.
Unlike AWQ, compressed-tensors uses standard bit packing without interleaving, so we just unpack and apply Marlin permutation directly.