vllm.compilation.passes.fusion.sequence_parallelism ¶
Classes:
-
SequenceParallelismPass–This pass enables sequence parallelism for models.
Functions:
-
get_sequence_parallelism_threshold–Calculate the minimum token threshold for applying sequence parallelism.
SequenceParallelismPass ¶
Bases: VllmPatternMatcherPass
This pass enables sequence parallelism for models. It identifies patterns where an AllReduce operation is followed by an RMSNorm (or RMSNorm and then Quantization) operation. These patterns are replaced with a ReduceScatter operation, followed by a local RMSNorm/Quantization, and then an AllGather operation.
The general transformation is: Input -> AllReduce -> RMSNorm -> Output becomes Input -> ReduceScatter -> RMSNorm -> AllGather -> Output
While this pass itself does not directly yield performance improvements, it lays the groundwork for subsequent fusion passes, such as GEMM + ReduceScatter and AllGather + GEMM fusions. These fusions can significantly reduce communication overhead and improve overall model performance.
This pass is only supported when compiling the whole graph (fullgraph mode, i.e. using Inductor graph partition or empty splitting_ops). Piecewise compilation is not supported because the residual tensor gets split across TP ranks, causing size mismatches at subgraph boundaries.
This pass splits up the residual tensor across TP ranks and hence divides its size. The pattern matcher starts at the end of the graph (last layer first), so when each replacement inserts a residual slice, the preceding layer has not been replaced yet and the slice is correct. Once the preceding layer IS replaced, its residual output shrinks and the slice becomes semantically incorrect (out-of-bounds indices for rank > 0). The graph is never executed in this intermediate state — NoOpEliminationPass removes these slices based on symbolic shape equality (input shape == output shape) before the graph is compiled.
Methods:
-
is_applicable_for_range–Determines if sequence parallelism should be applied for the given
Source code in vllm/compilation/passes/fusion/sequence_parallelism.py
495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 | |
is_applicable_for_range(compile_range) ¶
Determines if sequence parallelism should be applied for the given compile range.
SP is only beneficial for larger batch sizes where the communication overhead is amortized. For small batches, the overhead of splitting and gathering tensors across TP ranks outweighs the benefits.
Returns False (SP disabled) when: - min_token_num is None (SP disabled for this device/config) - The compile range starts below the minimum token threshold
Source code in vllm/compilation/passes/fusion/sequence_parallelism.py
_SequenceParallelPatternHelper ¶
Helper for sequence parallelism patterns.
Source code in vllm/compilation/passes/fusion/sequence_parallelism.py
get_sequence_parallelism_threshold(hidden_size, tp_size, element_size) ¶
Calculate the minimum token threshold for applying sequence parallelism.
Returns None if sequence parallelism should not be applied based on model size.
Branching logic based on device capability: - Check if hidden_size >= SP_MIN_HIDDEN_SIZE[device_capability] - If not, returns None (SP disabled for small models on this device) - If yes, calculates threshold based on per-GPU size
min_token_num = (min_per_gpu_size_mb * tp_size * MiB) //
(hidden_size * element_size)