vllm.model_executor.layers.fused_moe.utils ¶
Functions:
-
count_expert_num_tokens–Count the number to tokens assigned to each expert.
_fp8_quantize(A, A_scale, per_act_token, block_shape=None) ¶
Perform fp8 quantization on the inputs. If a block_shape is provided, the output will be blocked.
Source code in vllm/model_executor/layers/fused_moe/utils.py
_int8_quantize(A, A_scale, per_act_token, block_shape=None) ¶
Perform int8 quantization on the inputs. If a block_shape is provided, the output will be blocked.
Source code in vllm/model_executor/layers/fused_moe/utils.py
_resize_cache(x, v) ¶
Shrink the given tensor and apply the given view to it. This is used to resize the intermediate fused_moe caches.
Source code in vllm/model_executor/layers/fused_moe/utils.py
count_expert_num_tokens(topk_ids, num_local_experts, expert_map) ¶
Count the number to tokens assigned to each expert.
Parameters: - topk_ids (torch.Tensor): Tensor mapping each token to its list of experts. - num_local_experts (int): Number of experts in this rank. - expert_map (Optional[torch.Tensor]): A tensor mapping expert indices from the global expert space to the local expert space of the expert parallel shard.
Returns: A tensor of size num_local_experts, where tensor[i] holds the number of tokens assigned to the ith expert.