LLM Inference Introduction: the end-to-end flow and vector math

19 minute read

Published: Last Updated:

llm

Large Language Models (LLMs) are (mostly) autoregressive models: they predict the next token given the tokens so far, and repeat this loop until stopping. This post focuses on inference: the overall flow, what vectors get computed, and how the attention mechanism combines information across tokens. We also focus on the what data is generated in each phase of an LLC and how they are used in other phases of an inference process.


2. Inference: the end-to-end loop

At a high level, inference is:

  1. Tokenize the prompt into token IDs.
  2. Embed token IDs into vectors (plus positional information).
  3. Run $L$ Transformer blocks (attention + MLP).
  4. Produce logits for the next token; convert to probabilities.
  5. Select the next token (argmax or sampling), append it, and repeat.

The loop runs until an end condition: max tokens, an EOS token, or a stopping rule.

2.1 ASCII overview

Prompt text
  |
  v
Tokenize  --->  token IDs:  [t1, t2, ..., tn]
  |
  v
Embed + Position  --->  X0 in R^{n x d}
  |
  v
Transformer blocks (repeat L times)
  |
  v
Final hidden state H in R^{n x d}
  |
  v
Logits for next token (use last position hn)
  |
  v
Softmax / sampling  --->  next token id
  |
  v
Detokenize (for display)

3. Tokenize and Positions embedding

Here, thare are $n$ tokens each of which are embedded in a $d$-dimensional space. Along with that positions are also embedded here.

3.1 Tokenization (text → IDs)

Modern LLMs typically use a subword tokenizer (BPE, Unigram, WordPiece, etc.). The prompt becomes a sequence of token IDs:

\[(t_1, t_2, \dots, t_n), \quad t_i \in \{1, \dots, \|V\|\}\]

where $|V|$ is vocabulary size.

Practical notes:

  • Tokenization defines the model’s “alphabet.” It impacts context length, efficiency, and what counts as a single step of generation.
  • Special tokens (BOS/EOS, system delimiters, etc.) are often inserted by the serving stack.

3.2 Turning token IDs into vectors

3.3 Embedding lookup

The model stores an embedding table $E \in \mathbb{R}^{|V| \times d}$.

For each token ID $t_i$, we take a row of this table:

\[x_i = E[t_i] \in \mathbb{R}^d\]

Stacking all tokens gives:

\[X = \begin{bmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_n^T \end{bmatrix} \in \mathbb{R}^{n \times d}\]

3.4 Positional information

Transformers need position information because attention alone is permutation-invariant. Common approaches:

  • Absolute positional embeddings: $X \leftarrow X + P$ with $P \in \mathbb{R}^{n \times d}$.
  • Rotary / relative position methods (e.g., RoPE): position is injected into the attention computation (often into $Q$ and $K$) rather than added to $X$.

Either way, the result is an initial matrix $X_0 \in \mathbb{R}^{n \times d}$.

Toy example ($n=3, d=4$):

  • Tokens: "I like tea" $\rightarrow [101, 205, 309]$
  • Token embeddings:
\[e_1=[0.2, 0.1, 0.0, 0.4],\quad e_2=[0.0, 0.3, 0.5, 0.1],\quad e_3=[0.6, 0.2, 0.1, 0.0]\]
  • Positional vectors:
\[p_1=[0.01, 0.00, 0.02, 0.00],\quad p_2=[0.00, 0.01, 0.00, 0.02],\quad p_3=[0.02, 0.00, 0.01, 0.00]\]
  • Input to first block (absolute position style):
\[x_i = e_i + p_i\]

So,

\[x_1=[0.21, 0.10, 0.02, 0.40],\quad x_2=[0.00, 0.31, 0.50, 0.12],\quad x_3=[0.62, 0.20, 0.11, 0.00]\]

These rows stack to form $X_0 \in \mathbb{R}^{3 \times 4}$.

Quick clarification:

  • If there are only 3 positions, you still have 3 positional vectors ($p_1,p_2,p_3$), but each vector is in $\mathbb{R}^d$ (here $d=4$). So “3” controls how many rows, and “4” controls vector width.
  • In training, token embeddings are updated by backpropagation.
  • Positional information depends on method: learned absolute positional embeddings are also updated; fixed sinusoidal encodings are not; RoPE (Rotary Positional Embedding) uses a fixed rotation rule (typically no learned per-position table).

4. Inside one Transformer block (MHA + FFN)

A transformer block is the key block for the success of LLMs. At a high-level, it mixes information across tokens (via attention) and then applies a nonlinear transformation to each token vector (via the MLP/FFN). The mixing step allows the model to build up contextual understanding, due to the presence of positional encoding. While the MLP step allows it to create new features and representations that are not just linear combinations of the input.

Input: X in R^{n x d}

  X
  |  (linear projections)
  +--> Q = X W_Q   in R^{n x d_k}
  +--> K = X W_K   in R^{n x d_k}
  +--> V = X W_V   in R^{n x d_v}
  |
  +--> A = softmax(Q K^T / sqrt(d_k) + causal mask)  in R^{n x n}
  |
  +--> H_attn = A V   in R^{n x d_v}  --(proj back to d)-->  R^{n x d}
  |
  +--> residual add  (X + attn_out)  -> X'
  |
  +--> FFN/MLP applied per token: X' -> FFN(X')  in R^{n x d}
  |
  +--> residual add  (X' + ffn_out)  -> Output X_next in R^{n x d}

4.1 What a Transformer block computes (vectors everywhere)

Each block maps $X_\ell \mapsto X_{\ell+1}$ using (i) multi-head self-attention and (ii) an MLP, typically with residual connections and layer normalization. Here, $l$ indexes the block number, and $X_\ell \in \mathbb{R}^{n \times d}$ is the hidden state matrix at that layer.

The exact ordering varies (pre-norm vs post-norm), but the core computations are consistent.

4.2 Self-attention: from $X$ to context-mixed vectors

Given input $X \in \mathbb{R}^{n \times d}$, one attention head computes:

\(Q = X W_Q, \quad K = X W_K, \quad V = X W_V\) Here, $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d_v}$ are learned projection matrices. The input $X$ is linearly projected into three different spaces to produce the query ($Q$), key ($K$), and value ($V$) matrices. Each of these matrices has dimensions that depend on the number of tokens ($n$) and the respective dimensions for queries, keys, and values ($d_k$ and $d_v$).

This step is important at inference time because it transforms the input token representations into a form that allows the model to learn different representations for queries, keys, and values, which is crucial for capturing complex relationships between tokens.

The attention score matrix is:

\[S = \frac{Q K^T}{\sqrt{d_k}} + M\]

where $M$ is a mask. For causal (autoregressive) inference, the mask prevents attending to future tokens:

\[M_{ij} = \begin{cases} 0 & j \le i \\ -\infty & j > i \end{cases}\]

Turn scores into weights row-wise:

\[A = \operatorname{softmax}(S) \in \mathbb{R}^{n \times n}\]

and mix value vectors:

\[H = A V \in \mathbb{R}^{n \times d_v}\]

Multi-head attention repeats this in parallel and concatenates heads, followed by an output projection $W_O$ back to $d$.

4.3 More on Multi-Head Attention (MHA)

Why multiple heads? A single attention head can only learn one “type” of relationship between positions (e.g., syntactic adjacency). With $h$ heads, the model can attend to different aspects simultaneously — one head might track subject–verb agreement, another might focus on nearby tokens, yet another on coreference.

Concretely, if the model dimension is $d$ and there are $h$ heads, each head works with $d_k = d_v = d / h$. Each head $i$ has its own learned projections $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$ and produces its own output $\text{head}_i \in \mathbb{R}^{n \times d_v}$. These are concatenated and linearly projected:

\[\text{MHA}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\, W_O, \quad W_O \in \mathbb{R}^{d \times d}\]

Running toy example ($n=3$, $d=4$, $h=2$ heads, "I like tea"):

Let the input after embedding + position be:

\[X = \begin{bmatrix} 1 & 0 & 0.5 & 0.5 \\ 0 & 1 & 0.5 & -0.5 \\ 1 & 1 & 0 & 0 \end{bmatrix} \in \mathbb{R}^{3 \times 4}\]

With $h=2$ heads, we split $d=4$ into two $d_k=2$ subspaces. Using identity projections within each head for simplicity:

  • Head 1 uses columns 1–2: $X^{(1)} = \begin{bmatrix} 1 & 0 \ 0 & 1 \ 1 & 1 \end{bmatrix}$
  • Head 2 uses columns 3–4: $X^{(2)} = \begin{bmatrix} 0.5 & 0.5 \ 0.5 & -0.5 \ 0 & 0 \end{bmatrix}$

Each head computes $Q^{(i)} K^{(i)T}$ independently (with $Q=K=V=X^{(i)}$):

Head 1 — raw scores:

\[Q^{(1)} K^{(1)T} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}\]

Head 2 — raw scores:

\[Q^{(2)} K^{(2)T} = \begin{bmatrix} 0.50 & 0.00 & 0 \\ 0.00 & 0.50 & 0 \\ 0 & 0 & 0 \end{bmatrix}\]

Notice Head 1 gives high scores when both dimensions align (“tea” ↔ “I”), while Head 2 gives scores based on the secondary features — each head captures a different relationship.

The causal mask and softmax are applied to each head independently (see §4.4).

Toy example continued — applying MHA to our $d=4$, $h=2$ example:

After causal masking and softmax (details in §4.4), suppose the two heads produce:

\[\text{head}_1 = \begin{bmatrix} 1.00 & 0.00 \\ 0.33 & 0.67 \\ 0.83 & 0.83 \end{bmatrix}, \quad \text{head}_2 = \begin{bmatrix} 0.50 & 0.50 \\ 0.25 & -0.25 \\ 0.00 & 0.00 \end{bmatrix}\]

Concatenate along the column axis:

\[\text{Concat}(\text{head}_1, \text{head}_2) = \begin{bmatrix} 1.00 & 0.00 & 0.50 & 0.50 \\ 0.33 & 0.67 & 0.25 & -0.25 \\ 0.83 & 0.83 & 0.00 & 0.00 \end{bmatrix} \in \mathbb{R}^{3 \times 4}\]

Multiply by $W_O \in \mathbb{R}^{4 \times 4}$ (using identity for simplicity) to get the final attention output and then add the residual:

\[\text{MHA}(X) = \text{Concat} \cdot W_O \in \mathbb{R}^{3 \times 4}\]

Key takeaway: Head 1 captured positional/syntactic similarity (columns 1–2), while Head 2 captured a different feature subspace (columns 3–4). The concat + projection merges both views back into a single $d$-dimensional representation per token.

4.4 The causal mask: why it matters

In autoregressive generation the model must not peek at future tokens — token $i$ can only attend to positions $1, \dots, i$. The causal mask enforces this by adding $-\infty$ to illegal positions before softmax, which drives those attention weights to zero.

Toy mask example ($n=4$ tokens: The cat sat down):

\[M = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{bmatrix}\]

Reading row by row:

  • Row 1 (“The”): can only see itself → attention is $[1, 0, 0, 0]$.
  • Row 2 (“cat”): can see “The” and itself → weights spread over positions 1–2.
  • Row 3 (“sat”): can see “The”, “cat”, itself → weights over 1–3.
  • Row 4 (“down”): can see all four tokens.

After adding $M$ to the raw score matrix $QK^T/\sqrt{d_k}$, softmax turns every $-\infty$ entry into $0$, ensuring no information leaks from the future.

Why is this critical?

  • Without the mask, the model could “cheat” during training by reading the token it is supposed to predict, making the learned weights useless.
  • At inference the future tokens don’t exist yet, so the mask matches reality — but it must be present during training so the model learns correct causal distributions.

Toy example continued — applying the $3 \times 3$ causal mask to both heads of our running example:

\[M = \begin{bmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{bmatrix}\]

After masking, row 1 keeps only position 1, row 2 keeps positions 1–2, row 3 keeps all three.

Head 1 (scaled scores from §4.2 divided by $\sqrt{2}$):

Softmax (row-wise):

\[A^{(1)} \approx \begin{bmatrix} 1.00 & 0 & 0 \\ 0.33 & 0.67 & 0 \\ 0.17 & 0.17 & 0.66 \end{bmatrix}\]

(Row 1: one valid entry → weight 1. Row 2: $\text{softmax}([0.00, 0.71]) \approx [0.33, 0.67]$. Row 3: $\text{softmax}([0.71, 0.71, 1.41]) \approx [0.17, 0.17, 0.66]$.)

Output $\text{head}_1 = A^{(1)} V^{(1)}$:

\[h_1^{(1)} = 1.00 \cdot [1,0] = [1.00,\; 0.00]\] \[h_2^{(1)} = 0.33 \cdot [1,0] + 0.67 \cdot [0,1] = [0.33,\; 0.67]\] \[h_3^{(1)} = 0.17 \cdot [1,0] + 0.17 \cdot [0,1] + 0.66 \cdot [1,1] = [0.83,\; 0.83]\]

Head 2 (scaled scores from §4.2 divided by $\sqrt{2}$):

Softmax (row-wise):

\[A^{(2)} \approx \begin{bmatrix} 1.00 & 0 & 0 \\ 0.38 & 0.62 & 0 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}\]

(Head 2’s scores are smaller, so the softmax weights are more uniform.)

Output $\text{head}_2 = A^{(2)} V^{(2)}$:

\[h_1^{(2)} = 1.00 \cdot [0.5, 0.5] = [0.50,\; 0.50]\] \[h_2^{(2)} = 0.38 \cdot [0.5, 0.5] + 0.62 \cdot [0.5, -0.5] = [0.50,\; -0.12]\] \[h_3^{(2)} = 0.33 \cdot [0.5, 0.5] + 0.33 \cdot [0.5, -0.5] + 0.33 \cdot [0, 0] = [0.33,\; 0.00]\]

Notice the difference: Head 1 gives “tea” ($h_3$) a balanced blend $[0.83, 0.83]$ driven by syntactic features, while Head 2 gives it $[0.33, 0.00]$ — a different mix from the secondary feature subspace. This is exactly why multiple heads help: each captures a distinct relationship.

The residual connection then adds the original input back: $X’ = X + \text{MHA}(X)$, so information is never lost. The FFN runs next (§4.5).

4.5 MLP: per-token nonlinearity

The feed-forward network (MLP) applies independently to each position:

\[\operatorname{MLP}(x) = W_2\,\sigma(W_1 x + b_1) + b_2\]

where $\sigma$ is often GELU or a gated variant (e.g., SwiGLU). This increases representational capacity beyond linear mixing.

4.6 FFN (feed-forward network): the “feature factory”

In many write-ups, the per-token MLP is called the FFN layer. It is conceptually simple but does a lot of the model’s nonlinear work.

Typical shape choices use an expansion factor $m \gg d$ (often $m \approx 4d$):

\[\operatorname{FFN}(X) = W_2\,\sigma(X W_1 + b_1) + b_2\]

with $W_1 \in \mathbb{R}^{d \times m}$ and $W_2 \in \mathbb{R}^{m \times d}$.

Key points for inference:

  • The FFN is position-wise: it does not mix tokens (no $n\times n$ interaction). It transforms each token vector independently.
  • The compute is dominated by dense GEMMs ($n\times d$ by $d\times m$, then $n\times m$ by $m\times d$), which GPUs handle efficiently.
  • Gated variants (e.g., SwiGLU) change the exact formula but keep the same idea: expand $d \to m$, apply a nonlinearity/gate, then project back to $d$.

4.7 Why this is “vector math” end-to-end

Every token position carries a hidden vector in $\mathbb{R}^d$.

  • Attention forms weighted sums of other tokens’ value vectors (contextual mixing).
  • MLP transforms each token vector nonlinearly (feature creation).
  • Residual paths preserve information flow and stabilize deep networks.

5. From final vectors to the next token

After $L$ blocks, we have $H \in \mathbb{R}^{n \times d}$. To predict the next token, we typically use the last position’s vector $h_n \in \mathbb{R}^d$.

Compute logits over the vocabulary:

\[z = h_n W_U + b, \quad W_U \in \mathbb{R}^{d \times \|V\|}\]

Many models tie weights so that $W_U \approx E^T$.

Convert logits to probabilities with temperature $T$:

\[p(y = v \mid \text{context}) = \operatorname{softmax}\left(\frac{z}{T}\right)_v\]

Then choose the next token:

  • Greedy: $\arg\max_v p(v)$
  • Sampling: sample from $p$, often with top-$k$, nucleus (top-$p$), or other constraints

This yields a new token ID $t_{n+1}$, appended to the context, and the loop repeats.


6. Toy example: single-head causal attention

This is not a “real model,” just a numerically small example to make the vector flow concrete.

Assume three tokens: I, like, tea with embedding dimension $d=2$. Let their embeddings be:

\[x_1 = [1, 0],\quad x_2 = [0, 1],\quad x_3 = [1, 1]\]

Let’s make the attention projections the identity (so $Q=K=V=X$). The input matrix is:

\[X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}\]

6.1 Scores for position 3

For token 3, scores are dot products with all keys:

\[\\text{raw scores} = [x_3\cdot x_1,\; x_3\cdot x_2,\; x_3\cdot x_3] = [1,\; 1,\; 2]\]

Scale by $\sqrt{d}=\sqrt{2}$:

\[s = [0.707,\; 0.707,\; 1.414]\]

Apply softmax to get attention weights (approx.):

\[a \approx [0.248,\; 0.248,\; 0.503]\]

6.2 Output vector for position 3

Now mix value vectors (here $V=X$):

\[h_3 = 0.248\,x_1 + 0.248\,x_2 + 0.503\,x_3 \approx [0.751,\; 0.751]\]

Interpretation: token 3 (“tea”) is represented by a vector that blends information from I, like, and itself, with the strongest weight on itself.

6.3 An ASCII “attention bar chart” for token 3

Token 3 attends to:

I     | ##########           (0.25)
like  | ##########           (0.25)
tea   | #################### (0.50)

6.4 From vector to logits (tiny vocabulary)

Suppose the model is deciding between next tokens: tea, coffee, and .. Let their output embeddings (or columns of $W_U$) be:

\[e_{tea}=[1,1],\quad e_{coffee}=[0.8,0.7],\quad e_{.}=[0,0]\]

Use dot products as logits (a simplified tied-weights intuition):

\[z_{tea}=h_3\cdot e_{tea}=1.502\quad z_{coffee}=h_3\cdot e_{coffee}=1.127\quad z_{.}=0\]

After softmax, tea gets the highest probability, but coffee is also plausible.


7. What changes during real inference

Real deployments add a few important engineering layers:

  • KV cache: during autoregressive decoding, you don’t recompute $K$ and $V$ for all past tokens each step. You cache them per layer and only compute the new token’s projections.
  • Batching: serving stacks batch multiple requests/tokens to use GPU efficiently.
  • Precision: inference often uses FP16/BF16, sometimes INT8/FP8 quantization.
  • Decoding policy: sampling settings (temperature, top-$p$, repetition penalties) strongly affect outputs without changing model weights.

8. KV caching: decoding as append-only attention

The key observation is that in autoregressive decoding, at step $t$ you only need one new query (for the new position) but you need keys/values for all positions so far.

For each layer $\ell$, maintain cached matrices:

\[K^{(\ell)}_{\text{cache}} \in \mathbb{R}^{t \times d_k},\quad V^{(\ell)}_{\text{cache}} \in \mathbb{R}^{t \times d_v}\]

When a new token arrives, you compute only its projections and append:

\[k_t = x_t W_K,\; v_t = x_t W_V\quad\Rightarrow\quad K_{\text{cache}} \leftarrow [K_{\text{cache}};\, k_t],\; V_{\text{cache}} \leftarrow [V_{\text{cache}};\, v_t]\]

Then compute the new query $q_t = x_t W_Q$ and attend over the cached keys/values:

\[\alpha_t = \operatorname{softmax}\left(\frac{q_t (K_{\text{cache}})^T}{\sqrt{d_k}} + M_t\right) \in \mathbb{R}^{1\times t},\quad h_t = \alpha_t V_{\text{cache}} \in \mathbb{R}^{1\times d_v}\]

Here $M_t$ is the (trivial) causal mask for this single query row; in practice you just ensure the new token cannot attend to “future” positions (which do not exist yet).

8.1 ASCII timeline

Prefill (prompt length = n): compute all layers on all tokens

tokens:   [ 1   2   3  ...  n ]
cache K:  [k1  k2  k3  ...  kn]
cache V:  [v1  v2  v3  ...  vn]

Decode step n+1 (generate one token): compute only the new token’s q,k,v

new token:                  [ n+1 ]
append K:   [k1  k2  ...  kn | k(n+1)]
append V:   [v1  v2  ...  vn | v(n+1)]

attention for the new token:
q(n+1)  x  [k1..k(n+1)]^T  -> weights over positions 1..n+1  -> weighted sum of [v1..v(n+1)]

8.2 Why it matters

  • Compute: without caching, each new token would re-run attention over all previous tokens and recompute their $K,V$ repeatedly. With KV caching, you reuse $K,V$ and only do the “new query against all cached keys” work.
  • Asymptotics: prompt processing (“prefill”) is still quadratic in prompt length due to full self-attention, but token-by-token decoding avoids recomputing past projections and is much closer to linear-in-context per generated token.
  • Memory: caching costs memory proportional to context length, number of layers, and number of heads (this is often the main throughput limiter at long context).

9. What’s next

Now that you know what each inference step computes, see Intel TPP for LLM Inference for a deep dive into how these operations are accelerated on Intel CPUs — fused GEMM kernels, blocked weight layouts, flash attention with BRGEMM, and OpenMP parallelism.