Hacker News · Feb 15, 2026 · Collected from RSS
Article URL: https://huggingface.co/blog/continuous_batching Comments URL: https://news.ycombinator.com/item?id=47028545 Points: 8 # Comments: 2
Back to Articles Attention KV-cache Chunked prefill Continuous batching Conclusion TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput. If you've ever used Qwen, Claude, or any other AI chatbot, you've probably noticed something: it takes a while for the first word of the response to appear, and then words appear one-by-one on your screen with (hopefully) a regular and fast-paced frequency. That's because at the heart of it, all LLMs are just fancy next token predictors. An LLM first processes your entire prompt to produce one new token. Then it keeps adding tokens one by one, each time reading everything that came before, until it decides generation is over. This generation process is computationally expensive: it requires passing the input through billions of parameters for each token generated. To make these models practical for real-world applications, particularly when serving many users simultaneously, researchers and engineers have developed a range of efficient inference techniques.One of the most impactful optimizations is continuous batching, which attempts to maximize performance by processing multiple conversations in parallel and swapping them out when they are done. To understand how continuous batching works and why it's so effective in high-load serving scenarios, we'll build up from the fundamentals of how LLMs process tokens. Attention The attention mechanism is the central piece of how LLMs work. A language model processes text by breaking it down into pieces that we call tokens. We can conceptually think of "tokens" as "words", but sometimes a word might be composed of several tokens. For each token sequence, the network computes a prediction of what the next token should be. Many operations in the network are token-wise: each token is processed independently, and the output for a given token depends only on that token's content, not on any other tokens in the sequence. Operations like this include layer normalization or matrix multiplication. However, to create connections between words in a sentence, we need operations where tokens can influence each other. This is where attention comes in. Attention layers are the only place where different tokens interact with each other. Understanding how a network connects tokens together means understanding attention. Let's see how this works in practice, in the case where there is only one input prompt. Consider the initial prompt I am sure this project, tokenized as 7 tokens: [<bos>, I, am, sure, this, pro, ject]. The <bos>, or "Beginning of Sequence", is a special token we add at the start of the prompt to tell the language model that a new conversation starts here. Each token is represented inside the network with a vector of length d (the hidden dimension). Therefore, the seven incoming tokens form a tensor xx with shape [1,7,d]\left[1, 7, d \right]. 1 is the number of sequences, or batch size, which is just one in our case. 7 is the sequence length, and d is the hidden dimension, or the size of each token representation. Going forward, we'll use nn instead of 7 as the sequence length. Input tensor x x is then projected by three matrices: the query projection Wq W_q , the key projection Wk W_k and the value projection Wv W_v . This produces three tensors Q Q , K K and V V , all of shape [1,n,A] \left[1, n , A \right] , where A A is the dimension of the attention head. We call them the query, key and value states, respectively. This is represented on the left in the figure below. Next, tensors Q Q and K K are multiplied together to measure similarity between tokens, producing a tensor of shape [1,n,n] \left[ 1, n , n \right] . This is why we say that attention has quadratic complexity in sequence length. Computing QKT QK^T requires O(n2d) \mathcal{O} \left( n^2 d \right) operations, so the cost is a square of n n the sequence length. It is represented on the right in the figure above. We then apply a boolean attention mask to QKT QK^T to control which tokens can interact, as represented in the figure below. In this figure, the attention mask is a causal mask, meaning each token only interacts with tokens that came before it. This follows the intuition that a cause must come before its consequence, hence the name causal mask. The attention mask is crucial because it dictates all token interactions in the network. Set all attention mask values to False and no token will ever interact with another in the whole network. We'll examine attention masks more closely in a few paragraphs. Finally, after applying the attention mask, we take a token-wise softmax (which is the same as saying a row-wise softmax) and multiply the result by the value projection V V to get the output of one attention head, of shape [1,n,A] \left[ 1, n , A \right] . We offer a visual summary of the whole process in the following figure. We are going to use a lot of attention visualization in this post, so to simplify things, we are going to condense the figure above just a bit. Why this matters: In continuous batching, Q Q , K K , and V V can have different numbers of tokens because, as we'll see, we'll be processing different stages (prefill and decode) at the same time. To make it more general, let's say Q Q has shape [1,nQ,A] \left[1, n_Q , A \right] , K K has shape [1,nK,A] \left[ 1, n_K , A \right] , and V V has shape [1,nV,A] \left[ 1, n_V , A \right] . The attention scores QKT QK^T then have shape [1,nQ,nK] \left[ 1, n_Q , n_K \right] , and the attention mask has the same shape since it's applied point-wise to the scores. After applying the attention mask and row-wise softmax, we multiply by V V . Since we're multiplying a matrix of shape [1,nQ,nK] \left[ 1, n_Q , n_K \right] by one of shape [1,nV,A] \left[ 1, n_V , A \right] , the inner dimensions must match: nK=nV n_K = n_V . This means V V and K K always have the same length, so we can simplify our visualizations by only showing K K .Don't worry if this seems abstract: the figures will make it concrete. Furthermore, since we know that the attention mask is applied to QKT QK^T , we know they have the same shape. Instead of representing the attention scores, we will represent the attention mask in its place. Finally, since Q Q , K K and V V are direct projections of x x , no need to represent x x . This gives the simplified figure where we only represent Q Q , K K and the attention mask: This representation also underlines how we can read an attention mask. We read the mask row-by-row, which is the same as reading token-by-token: each row corresponds to one token's attention computation. A green square at position (row i, column j) means True: token j can influence token i. A white square means False: no interaction allowed. For example, look at the third row for token "am". The "I" column is green, so "I" influences the computation of "am". The "pro" column is white, so "pro" doesn't influence "am" . This is causal masking at work: future tokens can't affect past ones. The last layer of the model outputs a token prediction for each input token. In our context, generating the continuation of a single prompt, we only care about the next token prediction from the last token. The last token is "ject" in the figure above, and the associated prediction is "will". The process we just described, where we take an entire input sequence, pass it through multiple attention layers and compute a score for the next token, is called prefill. This is because, as we'll see in a moment, much of the computation we performed can be cached and reused – hence, we are prefilling the cache. Thanks to the use of this cache, sequence generation can proceed using much less compute in a phase called decoding. In the decoding phase, generating one new token will be much faster than the initial full-sequence computation. Let's see why. To continue generation, we begin a new forward pass, which would naively look like this: To compute the attention scores of the new token, we still need the key and value projections of the previous tokens. So we need to repeat the matrix multiplication of the old tokens (in grey in the figure above) with Wk W_k and Wv W_v to retrieve a result that was already computed once before. In other terms, we are wasting compute. Let's see how we can avoid that. KV-cache Right off the bat, we notice that the last token does not impact the attention calculation of the other tokens: This follows the idea of the causal mask: since "will" comes after all previous tokens, it does not change their attention calculation. For text generation, causal attention is by far the most common, so we will focus on that case from now on. Keep in mind that non-causal attention schemes can also be used, especially when dealing with images. Considering we only need the next-token prediction for the "will" token, we can simplify the attention mechanism by only computing the output for this token. Moreover, we already computed the K K and V V states for the tokens "<bos>", … , "ject" during the previous forward pass: if they have been stored, we do not need to recompute them again. This is the KV cache: the list of key and value states created during generation. It essentially allows one to reduce the compute cost of generating token n+1 n+1 from O(n2) \mathcal{O} \left( n^2 \right) to O(n) \mathcal{O} \left( n \right) by avoiding recomputation of key and value projections, while paying a memory cost of O(n) \mathcal{O} \left( n \right) . In the figure above, only the tokens in white are computed: instead of computing the keys and values for 8 tokens, we compute them for 1. You can see that through KV caching, a lot of compute is saved.You can check this post for more visualizations of KV caching, or this one for a practical implementation example. Let's be a bit more specific about the cache size, because it's a good opportunity to examine the shapes present in our mo