Agentic LLMs and Mixture-of-Experts (MoE)
Published: Last Updated:
Why this post
The companion post on inference focuses on the token-by-token forward pass: embeddings, attention, FFN/MLP, logits, decoding, and KV caching.
This post extends that perspective in two directions:
- Agentic LLM systems: how we wrap the base model in a control loop (plan/act/observe) and why that changes “inference” at the system level.
- Mixture-of-Experts (MoE): how some models modify the FFN into a sparse set of expert FFNs routed per token, and what that implies for vector flow and serving.
Part A — Agentic LLMs (system-level inference)
What “agentic” means
An agentic LLM system is not just “one forward pass.” It is a controller that repeatedly:
- reads/updates a state (conversation + intermediate results),
- chooses the next action (think/tool-call/respond),
- executes actions in the outside world (tools, APIs, code, retrieval),
- and loops until it decides to stop.
The core LLM is still autoregressive, but the application defines additional transitions between model calls.
The basic agent loop
Inputs:
- user request
- system/developer instructions
- tool schemas + tool results
- optional memory / retrieved context
Loop:
[State St] -> (LLM call) -> action at
| |
| +--> a tool call (function name + JSON args)
| +--> a message to the user
| +--> a plan update / self-check / stop
v
execute action -> observe result -> update state -> [State St+1]
A practical way to view this is that agentic inference is a policy over actions, where the “environment” includes tools and persistent state.
What changes compared to a single LLM completion
1) The unit of work becomes a step, not a token.
A single step may involve:
- one LLM completion producing a tool call,
- a tool execution (search, database query, code run),
- and another LLM completion that uses the tool output.
So latency and cost become dominated by the number of loop iterations and tool round-trips, not only tokens.
2) The context grows with structured artifacts.
Besides user text, the prompt may contain:
- tool call JSON,
- tool outputs (tables, logs),
- intermediate plans,
- citations/notes,
- memory snippets.
Tokenization and attention/KV cache still apply, but the prompt content distribution changes.
3) Failure modes shift from “wrong next token” to “wrong action”.
Typical errors:
- choosing the wrong tool,
- calling a tool with incorrect arguments,
- missing an iteration (stopping too early),
- over-iterating (getting stuck),
- tool output misinterpretation.
This is why agent systems often include:
- explicit stop conditions,
- guardrails (schema validation, tool allowlists),
- deterministic tool invocation constraints,
- self-check or verification steps.
A minimal “ReAct-style” trace (toy)
User: Find the median of these numbers: 3, 10, 4, 9
Step 1 (LLM): decides to use a calculator tool
Tool call: median([3,10,4,9])
Tool result: 6.5
Step 2 (LLM): responds with explanation
Answer: Sort -> [3,4,9,10], median = (4+9)/2 = 6.5
Nothing about attention/FFN changes, but system-level inference is now a sequence of LLM calls coupled with deterministic computation.
Mental model: agentic = “LLM + state machine”
If you like formalizations, treat the LLM as a stochastic policy $\pi_\theta(a\mid s)$ over actions $a$ conditioned on state $s$ (the prompt). The agent loop implements the transition function $s’ = f(s, a, \text{obs})$.
Part B — Mixture-of-Experts (MoE) as a sparse FFN
Where MoE fits in the Transformer
In a dense Transformer, each block has:
- Multi-Head Attention (mix tokens)
- FFN/MLP (mix features within each token)
MoE typically replaces the dense FFN with a sparse collection of FFNs (experts), selected per token.
Dense FFN vs MoE FFN (ASCII)
Dense FFN (per token x):
x in R^d -> W1 (d->m) -> nonlinearity -> W2 (m->d) -> y in R^d
MoE FFN (per token x):
x in R^d -> router -> choose top-k experts
|-> Expert e1 FFN(x)
|-> Expert e2 FFN(x)
...
weighted sum -> y in R^d
The router: how experts are chosen
Let $x \in \mathbb{R}^d$ be a token’s hidden vector entering the FFN position.
A router (a small linear layer) computes logits over experts:
\[r = x W_R + b_R, \quad W_R \in \mathbb{R}^{d \times E}\]Convert to routing weights (often via softmax), then keep only the top-$k$ experts:
\[p = \operatorname{softmax}(r), \quad \text{select } \mathcal{T}=\text{TopK}(p, k)\]The MoE FFN output is:
\[y = \sum_{e\in \mathcal{T}} g_e \cdot \operatorname{FFN}_e(x)\]where $g_e$ are the (possibly renormalized) gates for the selected experts.
Toy routing example (numbers)
Suppose we have $E=4$ experts and choose $k=2$.
Router produces probabilities:
Experts: e1 e2 e3 e4
p(e|x): 0.05 0.60 0.10 0.25
Top-2: e2 e4
If we renormalize gates over the top-2, we get:
- $g_{e2} = 0.60 / (0.60+0.25) \approx 0.706$
- $g_{e4} = 0.25 / (0.60+0.25) \approx 0.294$
Then:
\[y \approx 0.706\,\operatorname{FFN}_{e2}(x) + 0.294\,\operatorname{FFN}_{e4}(x)\]Why MoE is useful
Intuition: MoE increases parameter count (many experts) without paying the full dense compute every token.
- Compute: each token only runs $k$ experts rather than all $E$.
- Capacity: total parameters increase, which can improve quality at similar FLOPs.
Serving implications (the “systems” side of MoE)
MoE changes inference engineering in ways that look different from dense models:
- Token-to-expert dispatch: at each layer, tokens are partitioned by chosen experts. That introduces data movement and synchronization overhead.
- Load balancing: if many tokens choose the same expert, that expert becomes a hotspot. Training often uses auxiliary losses to encourage balanced routing; inference still sees skew.
- Batching interaction: batching helps dense GEMMs; MoE can fragment batches because tokens route to different experts.
- Latency variance: routing decisions can produce variable per-step work depending on the distribution of experts and capacity limits.
A simplified view of the per-layer MoE FFN step:
Given a batch of token vectors X in R^{n x d}:
router(X) -> per-token expert IDs
group tokens by expert
run each expert FFN on its token group
scatter results back to original token order
MoE and KV cache
KV caching is primarily about attention (keys/values). MoE is about the FFN.
- KV cache does not eliminate the need to run the MoE FFN for each generated token.
- However, the decode stage still benefits from KV cache in attention, while MoE affects the per-token FFN compute and dispatch overhead.
Putting it together: two orthogonal ideas
It’s helpful to separate concerns:
- Agentic: changes how many model calls you make and what goes into the prompt (system-level loop).
- MoE: changes what happens inside a model call (sparse FFN per layer).
They can be combined: an agentic application may call an MoE model many times, and then your performance/safety concerns include both the agent loop behavior and MoE routing/serving characteristics.
Suggested next reading / extensions
If you want to extend this post further (still keeping it professional and concise), good additions are:
- A short section on tool-call schemas and why constrained decoding matters for reliable actions.
- A short section on MoE load balancing (auxiliary loss, capacity factor) and its inference consequences.