vllm.model_executor.warmup.deep_gemm_warmup ¶
Warmup deep_gemm kernels. DeepGEMM JIT's the kernels. The warmup aims to JIT all the kernels that would be used during model execution beforehand.
_extract_data_from_fused_moe_module(m_) ¶
Extract weights, weight scales and num_topk from FusedMoE module.
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_extract_data_from_linear_base_module(m) ¶
Extract weights, weight scales and quantization block sizes from the given LinearBase module.
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_fp8_linear_may_use_deep_gemm(module) ¶
Return True if the input module/layer could be processed with DeepGEMM.
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_generate_optimal_warmup_m_values(max_tokens, n, device) ¶
Generate M values that cover all possible DeepGEMM kernel configurations. Reference: https://github.com/deepseek-ai/DeepGEMM/blob/79f48ee15a82dd5fad5cd9beaa393c1f755e6b55/csrc/jit_kernels/heuristics/common.hpp
Parameters:
-
(max_tokens¶int) –Maximum number of tokens to warmup for
-
(n¶int) –The actual N dimension from the weight tensor
-
(device¶device) –The torch device to get properties from.
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_get_fp8_gemm_nt_m_values(w, max_tokens) ¶
Get the M values to warmup for a given weight tensor.