Middleware in Microsoft Agent Framework
By the time we have function tools, streaming, and a Foundry-backed agent in place, most of the interesting logic still lives in the agent itself: the system prompt, the tools, and the conversation. That is fine for a sample, but production agents quickly grow a second category of concerns that do not really belong in the prompt or the tools, things like logging every run, redacting PII before it leaves the process, blocking obvious abuse, timing function calls, retrying transient model errors, or rewriting the final response. Putting any of that into the agent definition makes the agent harder to read and harder to reuse.
Microsoft Agent Framework (MAF) handles this through middleware: small async functions or classes that wrap an agent’s execution across three layers. If you have written ASP.NET Core or Express middleware, the shape will feel familiar: a context object, a call_next callback, and the ability to do work before, after, or instead of the wrapped operation.
Three layers, three contexts
MAF exposes middleware at three points in the call stack. From outermost to innermost, they are:
- Agent middleware wraps a whole
agent.run(...)call. It sees the input messages, the agent options, and the final result. It fires once per run. - Function middleware wraps each tool invocation that happens during a run. If the agent calls three tools across two model turns, the function middleware fires three times.
- Chat middleware wraps each request to the underlying chat client. In a multi-turn tool-calling sequence, the model is called once for the initial response and once for each batch of tool results, so the chat middleware fires multiple times per run.
Each layer has its own context object. The shape is the same, a mutable bag of state plus a call_next callback, but the fields differ:
| Middleware | Context type | Useful fields |
|---|---|---|
| Agent run | AgentContext |
agent, messages, options, stream, result |
| Function call | FunctionInvocationContext |
function, arguments, result, kwargs |
| Chat request | ChatContext |
chat_client, messages, options, stream, result |
All three contexts also carry a metadata dictionary you can use to pass values between middleware layers, for example, a request ID set in agent middleware and read in function middleware.
A first agent middleware
The simplest useful middleware is a logging wrapper. It prints before and after the agent runs and otherwise leaves everything alone:
|
|
A few things worth pointing out. The middleware signature is fixed: an async function that takes the context and a call_next callable that takes no arguments. You mutate the context, and you await call_next() to continue down the chain. If you do not call call_next, the rest of the chain (including the agent itself) does not run, which is exactly how you implement blocking behavior later.
If you prefer a class, inherit from AgentMiddleware and implement process with the same signature:
|
|
Class-based middleware is the right choice when the middleware requires configuration or maintains its own state, such as a counter, a connection to an audit store, or a cache.
Function middleware: timing, kwargs, and retries
Function middleware is the layer that sees individual tool invocations. The context exposes the function being called, the validated arguments, and a mutable kwargs bag whose contents are forwarded to the tool at invocation time. That last part is what makes function middleware useful for injection: the model never sees these values, but the tool does.
|
|
The tool then receives those values via its own FunctionInvocationContext parameter:
|
|
This pattern keeps tenant IDs, user IDs, and request correlation IDs out of the prompt. The model has no business deciding the tenant ID, and you do not want it to make one up. Set it in middleware, read it in the tool.
The same layer is the natural place for timing and structured logging:
|
|
Because function middleware fires once per tool call, an agent that makes three tool calls in one run will print three lines. If you want a single summary per run, aggregate in the agent middleware and use metadata to share state.
You can also retry a failing tool by catching the exception around call_next:
|
|
Note the placement: this retries the tool, not the model. If the model itself is rate-limited, that is a chat-middleware problem.
Chat middleware: every model call
Chat middleware sits closest to the wire. It wraps each individual request to the chat client, meaning it runs not only for the initial user turn but also for the follow-up calls the model makes after each round of tool results.
This is where you put model-level concerns:
- Token counting and budget enforcement.
- Adding or rewriting system messages just before they go out.
- Retrying on
429or transient5xxresponses. - Logging the exact request/response payloads for audit.
|
|
The important mental model: in a run that involves two tool calls, you will typically see chat middleware fire three times (initial call, after-tool-1, after-tool-2), function middleware fire twice (the two tools), and agent middleware fire exactly once. If your numbers do not match that pattern, you are usually looking at a streaming or early-termination case.
Blocking and overriding results
The most powerful thing middleware can do is not call call_next, or replace context.result after the fact. Both are first-class scenarios.
A blocking pattern, for example a guard against obviously sensitive prompts:
|
|
MiddlewareTermination short-circuits the rest of the chain and the agent. The caller still gets a normal AgentResponse, the one you set on context.result, so calling code does not need to know that anything was blocked.
A result-rewriting pattern, applied after the agent has produced its answer, is just as common. PII redaction is the classic example:
|
|
For streaming responses the shape is different, context.result is an async generator of AgentResponseUpdate chunks rather than a finished AgentResponse. You can detect which case you are in with context.stream and wrap the generator if needed:
|
|
The point of this whole pattern is that the agent definition stays clean. The agent does not know about redaction, and redaction does not know about the agent. You can add or remove the middleware without touching either.
Registration and ordering
You can register middleware at two scopes. Agent-level middleware is set when you construct the Agent and applies to every run. Run-level middleware is passed to a single agent.run(...) call and only applies to that call.
|
|
Ordering follows the wrapping rule. For agent-level [A1, A2] and run-level [R1, R2], the execution order is A1 → A2 → R1 → R2 → Agent → R2 → R1 → A2 → A1. Function and chat middleware follow the same wrapping principle at their respective layers. In practice, the order matters mostly for two cases: blocking middleware should come first so it runs before logging or timing; result-rewriting middleware should come last so it sees the finished result before anyone else.
A small but important detail: a single middleware function can be used at any of the three layers as long as its context type matches. If you do not annotate the context parameter, MAF cannot infer the layer. The decorator form makes this explicit:
|
|
Use the decorators when you do not want type annotations, or when you want to be explicit about intent, regardless of annotations.
Summary
Middleware is the cleanest way to add cross-cutting behavior to a MAF agent: agent middleware for whole-run concerns, function middleware for tool-level concerns, and chat middleware for model-level concerns. The same three primitives, context, call_next, and an optional MiddlewareTermination, give you logging, redaction, blocking, retries, and result rewriting without changing the agent or the tools.
chat_client.create_agent(...) to chat_client.as_agent(...). The Multiple tools example has been updated to match. Other articles in this series also include changes to imports and constructors; see the client comparison article for the current set of clients and how to use them.
Comments
Comments Require Consent
The comment system (Giscus) uses GitHub and may set authentication cookies. Enable comments to join the discussion.