vllm.model_executor.layers.fused_moe.router.gate_linear ¶
Classes:
-
GateLinear–MoE gate linear layer with multi-tier GEMM dispatch:
Functions:
-
fp32_router_gemm_dispatch_impl–Dynamically run fp32 specialized gemm if num_tokens <= FP32_MAX_TOKENS,
GateLinear ¶
Bases: ReplicatedLinear
MoE gate linear layer with multi-tier GEMM dispatch:
- DSV3 specialized kernel (SM90+, fp32 out, M<=16, H=7168, E=256/384)
- fp32 specialized kernel (SM90+, bf16/fp32 in, fp32 out, M<=32, H=3072, E=256)
- cuBLAS bf16×bf16→fp32 (SM90+ + bf16 weight + fp32 out_dtype)
- F.linear via ReplicatedLinear (ultimate fallback)
The out_dtype attribute is mutable and can be set after init (e.g. when the required dtype depends on the expert quantization method which is only known later).
Methods:
-
set_out_dtype–Set output dtype for the router logits after init.
Source code in vllm/model_executor/layers/fused_moe/router/gate_linear.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
set_out_dtype(out_dtype) ¶
Set output dtype for the router logits after init.
Useful when the required dtype depends on the expert quantization method which is only known after the gate is constructed.
Source code in vllm/model_executor/layers/fused_moe/router/gate_linear.py
fp32_router_gemm_dispatch_impl(x, weight) ¶
Dynamically run fp32 specialized gemm if num_tokens <= FP32_MAX_TOKENS, otherwise fall back to F.linear. This must be wrapped in a custom op because our torch.compile integration does not support runtime dispatching on num_tokens.