vllm.v1.attention.ops.merge_attn_states ¶
Functions:
-
merge_attn_states–Merge partial attention outputs from prefix (KV cache) and suffix
merge_attn_states(output, prefix_output, prefix_lse, suffix_output, suffix_lse, output_lse=None, prefill_tokens_with_context=None, output_scale=None) ¶
Merge partial attention outputs from prefix (KV cache) and suffix (new tokens) into a single output tensor using the log-sum-exp (LSE) rescaling method described in section 2.2 of https://www.arxiv.org/pdf/2501.01005.
For tokens that have prefix context (token index < prefill_tokens_with_context), the prefix and suffix partial outputs are combined as a weighted sum. For tokens without prefix context, the suffix output is copied directly.
Parameters:
-
(output¶Tensor) –Output tensor of shape [NUM_TOKENS, NUM_HEADS, HEAD_SIZE].
-
(prefix_output¶Tensor) –Partial attention output over the prefix (KV cache), shape [NUM_TOKENS, NUM_HEADS, HEAD_SIZE].
-
(prefix_lse¶Tensor) –Log-sum-exp values for the prefix attention, shape [NUM_HEADS, NUM_TOKENS].
-
(suffix_output¶Tensor) –Partial attention output over the suffix (new KV), shape [NUM_TOKENS, NUM_HEADS, HEAD_SIZE].
-
(suffix_lse¶Tensor) –Log-sum-exp values for the suffix attention, shape [NUM_HEADS, NUM_TOKENS].
-
(output_lse¶Tensor | None, default:None) –Optional tensor to store the merged LSE values, shape [NUM_HEADS, NUM_TOKENS]. If None, LSE is not written out.
-
(prefill_tokens_with_context¶int | None, default:None) –Number of prefill tokens that have prefix context and therefore require merging. Tokens at indices
= this value are decode or context-free prefill tokens whose output is taken directly from suffix_output. If None, all tokens are treated as having context.
-
(output_scale¶Tensor | None, default:None) –Optional scalar tensor for FP8 static quantization. When provided, output must be FP8 dtype.