vllm.utils.flashinfer ¶
Compatibility wrapper for FlashInfer API changes.
Users of vLLM should always import only these wrappers.
Functions:
-
can_use_trtllm_attention–Check if the current configuration supports TRTLLM attention.
-
has_flashinfer–Return
Trueif flashinfer-python package is available. -
has_flashinfer_b12x_gemm–Return True if FlashInfer b12x FP4 GEMM backend is available (SM120+).
-
has_flashinfer_b12x_moe–Return
Trueif FlashInfer CuteDSL SM12x fused MoE is available. -
has_flashinfer_comm–Return
Trueif FlashInfer comm module is available. -
has_flashinfer_cutedsl_grouped_gemm_nt_masked–Return
Trueif FlashInfer CUTLASS fused MoE is available. -
has_flashinfer_cutedsl_moe_nvfp4–Return
Trueif FlashInfer cute_dsl_fused_moe_nvfp4 is available. -
has_flashinfer_cutlass_fused_moe–Return
Trueif FlashInfer CUTLASS fused MoE is available. -
has_flashinfer_fp8_blockscale_gemm–Return
Trueif FlashInfer block-scale FP8 GEMM is available. -
has_flashinfer_moe–Return
Trueif FlashInfer MoE module is available. -
has_flashinfer_nvlink_one_sided–Return
Trueif FlashInfer trtllm_moe_alltoall module is available. -
has_flashinfer_nvlink_two_sided–Return
Trueif FlashInfer mnnvl all2all is available. -
has_nvidia_artifactory–Return
Trueif NVIDIA's artifactory is accessible. -
is_flashinfer_cudnn_fp8_prefill_attn_supported–Check if FP8 ViT attention is supported on this platform.
-
is_flashinfer_fp8_blockscale_gemm_supported–Return
Trueif FlashInfer block-scale FP8 GEMM is supported. -
supports_trtllm_attention–TRTLLM attention is supported if the platform is SM100,
-
use_trtllm_attention–Return
Trueif TRTLLM attention is used.
_flashinfer_concat_mla_k(k, k_nope, k_pe) ¶
Custom op wrapper for flashinfer's concat_mla_k.
This is an in-place operation that concatenates k_nope and k_pe into k.
The kernel is optimized for DeepSeek V3 dimensions: - num_heads=128 - nope_dim=128 - rope_dim=64
Key optimizations: - Warp-based processing with software pipelining - Vectorized memory access (int2 for nope, int for rope) - L2 prefetching for next row while processing current - Register reuse for rope values across all heads
Parameters:
-
(k¶Tensor) –Output tensor, shape [num_tokens, num_heads, nope_dim + rope_dim]. Modified in-place.
-
(k_nope¶Tensor) –The nope part of k, shape [num_tokens, num_heads, nope_dim].
-
(k_pe¶Tensor) –The rope part of k (shared), shape [num_tokens, 1, rope_dim]. This is broadcast to all heads.
Source code in vllm/utils/flashinfer.py
_get_submodule(module_name) ¶
Safely import a submodule and return it, or None if not available.
_lazy_import_wrapper(module_name, attr_name, fallback_fn=_missing) ¶
Create a lazy import wrapper for a specific function.
Source code in vllm/utils/flashinfer.py
_missing(*_, **__) ¶
Placeholder for unavailable FlashInfer backend.
Source code in vllm/utils/flashinfer.py
can_use_trtllm_attention(num_qo_heads, num_kv_heads) ¶
Check if the current configuration supports TRTLLM attention.
Source code in vllm/utils/flashinfer.py
flashinfer_mm_mxfp8(a, b, block_scale_a, block_scale_b, out_dtype, backend='cutlass') ¶
MXFP8 MM helper - mirrors flashinfer_scaled_fp4_mm API.
Takes non-transposed weights and handles transpose internally.
CRITICAL: mm_mxfp8 CUTLASS kernel requires SWIZZLED 1D scales for optimal performance and accuracy. Both input and weight scales should be in swizzled format from FlashInfer's mxfp8_quantize(is_sf_swizzled_layout=True).
Source code in vllm/utils/flashinfer.py
force_use_trtllm_attention() ¶
This function should only be called during initialization stage when vllm config is set. Return None if --attention-config.use_trtllm_attention is not set, return True if TRTLLM attention is forced to be used, return False if TRTLLM attention is forced to be not used.
Source code in vllm/utils/flashinfer.py
has_flashinfer() cached ¶
Return True if flashinfer-python package is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_b12x_gemm() cached ¶
Return True if FlashInfer b12x FP4 GEMM backend is available (SM120+).
Source code in vllm/utils/flashinfer.py
has_flashinfer_b12x_moe() cached ¶
Return True if FlashInfer CuteDSL SM12x fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_comm() cached ¶
Return True if FlashInfer comm module is available.
has_flashinfer_cubin() cached ¶
Return True if flashinfer-cubin package is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_cutedsl() cached ¶
Return True if FlashInfer cutedsl module is available.
has_flashinfer_cutedsl_grouped_gemm_nt_masked() cached ¶
Return True if FlashInfer CUTLASS fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_cutedsl_moe_nvfp4() cached ¶
Return True if FlashInfer cute_dsl_fused_moe_nvfp4 is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_cutlass_fused_moe() cached ¶
Return True if FlashInfer CUTLASS fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_fp8_blockscale_gemm() cached ¶
Return True if FlashInfer block-scale FP8 GEMM is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_moe() cached ¶
Return True if FlashInfer MoE module is available.
has_flashinfer_nvlink_one_sided() cached ¶
Return True if FlashInfer trtllm_moe_alltoall module is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_nvlink_two_sided() cached ¶
Return True if FlashInfer mnnvl all2all is available.
Source code in vllm/utils/flashinfer.py
has_flashinfer_trtllm_fused_moe() cached ¶
Return True if FlashInfer TRTLLM fused MoE is available.
Source code in vllm/utils/flashinfer.py
has_nvidia_artifactory() cached ¶
Return True if NVIDIA's artifactory is accessible.
This checks connectivity to the kernel inference library artifactory which is required for downloading certain cubin kernels like TRTLLM FHMA.
Source code in vllm/utils/flashinfer.py
is_flashinfer_cudnn_fp8_prefill_attn_supported() cached ¶
Check if FP8 ViT attention is supported on this platform.
Requires native FP8 hardware support, the FlashInfer cuDNN backend, and cuDNN >= 9.17.1.
Source code in vllm/utils/flashinfer.py
is_flashinfer_fp8_blockscale_gemm_supported() cached ¶
Return True if FlashInfer block-scale FP8 GEMM is supported.
Source code in vllm/utils/flashinfer.py
supports_trtllm_attention() cached ¶
TRTLLM attention is supported if the platform is SM100, NVIDIA artifactory is accessible, and batch-invariant mode is not enabled.
Source code in vllm/utils/flashinfer.py
use_trtllm_attention(num_qo_heads, num_kv_heads, num_tokens, max_seq_len, dcp_world_size, kv_cache_dtype, q_dtype, is_prefill, force_use_trtllm=None, has_sinks=False, has_spec=False) ¶
Return True if TRTLLM attention is used.
Source code in vllm/utils/flashinfer.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 | |