vllm.config.kv_transfer ¶
Classes:
-
KVTransferConfig–Configuration for distributed KV cache transfer.
KVTransferConfig ¶
Configuration for distributed KV cache transfer.
Methods:
-
compute_hash–WARNING: Whenever a new field is added to this config,
Attributes:
-
enable_permute_local_kv(bool) –Experiment feature flag to enable HND to NHD KV Transfer
-
engine_id(str | None) –The engine id for KV transfers.
-
kv_buffer_device(str) –The device used by kv connector to buffer the KV cache. Choices are
-
kv_buffer_size(float) –The buffer size for TorchDistributedConnector. Measured in number of
-
kv_connector(str | None) –The KV connector for vLLM to transmit KV caches between vLLM instances.
-
kv_connector_extra_config(dict[str, Any]) –any extra config that the connector may need.
-
kv_connector_module_path(str | None) –The Python module path to dynamically load the KV connector from.
-
kv_ip(str) –The KV connector ip, used to build distributed connection.
-
kv_load_failure_policy(Literal['recompute', 'fail']) –Policy for handling KV cache load failures.
-
kv_parallel_size(int) –The number of parallel instances for KV cache transfer.
-
kv_port(int) –The KV connector port, used to build distributed connection.
-
kv_rank(int | None) –The rank of this vLLM instance in the KV cache transfer. Typical value:
-
kv_role(KVRole | None) –Whether this vLLM instance produces, consumes KV cache, or both. Choices
Source code in vllm/config/kv_transfer.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
enable_permute_local_kv = False class-attribute instance-attribute ¶
Experiment feature flag to enable HND to NHD KV Transfer
engine_id = None class-attribute instance-attribute ¶
The engine id for KV transfers.
kv_buffer_device = field(default_factory=kv_buffer_device_default_factory) class-attribute instance-attribute ¶
The device used by kv connector to buffer the KV cache. Choices are 'cuda', 'cpu' and 'xpu'.
kv_buffer_size = 1000000000.0 class-attribute instance-attribute ¶
The buffer size for TorchDistributedConnector. Measured in number of bytes. Recommended value: 1e9 (about 1GB).
kv_connector = None class-attribute instance-attribute ¶
The KV connector for vLLM to transmit KV caches between vLLM instances.
kv_connector_extra_config = field(default_factory=dict) class-attribute instance-attribute ¶
any extra config that the connector may need.
kv_connector_module_path = None class-attribute instance-attribute ¶
The Python module path to dynamically load the KV connector from. Only supported in V1.
kv_ip = '127.0.0.1' class-attribute instance-attribute ¶
The KV connector ip, used to build distributed connection.
kv_load_failure_policy = 'fail' class-attribute instance-attribute ¶
Policy for handling KV cache load failures. 'recompute': reschedule the request to recompute failed blocks 'fail': immediately fail the request with an error finish reason (default)
kv_parallel_size = 1 class-attribute instance-attribute ¶
The number of parallel instances for KV cache transfer.
kv_port = 14579 class-attribute instance-attribute ¶
The KV connector port, used to build distributed connection.
kv_rank = None class-attribute instance-attribute ¶
The rank of this vLLM instance in the KV cache transfer. Typical value: 0 for prefill instance, 1 for decode instance. Currently only 1P1D is supported.
kv_role = None class-attribute instance-attribute ¶
Whether this vLLM instance produces, consumes KV cache, or both. Choices are 'kv_producer', 'kv_consumer', and 'kv_both'.
compute_hash() ¶
WARNING: Whenever a new field is added to this config, ensure that it is included in the factors list if it affects the computation graph.
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.