vllm.distributed.eplb.eplb_utils ¶
Utility functions for EPLB (Expert Parallel Load Balancing).
Classes:
-
CpuGpuEvent–Combines a CUDA event with a CPU threading event to enforce record->wait
Functions:
-
override_envs_for_eplb–Override environment variables for EPLB when specific conditions are met.
CpuGpuEvent ¶
Combines a CUDA event with a CPU threading event to enforce record->wait ordering across two threads.
This class is designed for exactly two threads: one producer that calls record() and one consumer that calls wait(). Using it with more than two threads is not supported and will produce undefined behavior.
CUDA events alone are insufficient for cross-thread synchronization because waiting on an unrecorded CUDA event is a no-op. The wait will return immediately instead of blocking. This class adds a threading.Event so that the waiting thread blocks on the CPU side until record() is called, at which point the CUDA event is guaranteed to be in-flight and event.wait() will correctly synchronize the GPU stream.
Methods:
-
record–Unblocks the waiting thread after calling event.record().
-
wait–Blocks the calling thread until record finishes. Used to guarantee that the
Source code in vllm/distributed/eplb/eplb_utils.py
record(stream=None) ¶
Unblocks the waiting thread after calling event.record().
Should only be called by the main thread.
Source code in vllm/distributed/eplb/eplb_utils.py
wait(stream=None) ¶
Blocks the calling thread until record finishes. Used to guarantee that the record kernel is called before wait.
Should only be called by the Async Eplb thread.
Source code in vllm/distributed/eplb/eplb_utils.py
override_envs_for_eplb(parallel_config, moe_backend=None) ¶
Override environment variables for EPLB when specific conditions are met.
Parameters:
-
(parallel_config¶ParallelConfig) –The parallel configuration object.
-
(moe_backend¶str | None, default:None) –The configured MoE backend (e.g.
deep_gemm_mega_moe).