The reliability stack for LLM agents: tools and methods

A request can fail at three moments: before you send it, while it runs, or after it returns. Different tools and habits cover different moments. This is a directory grouped by what each one does.
Methods you apply yourself
You apply these for free, and they rule out several common failures before you reach for a tool.
- Pick the model that fits the request. A small fast model handles simple calls, and a larger one handles reasoning. One model for everything wastes budget on the easy calls and hits rate limits faster on the hard ones.
- Check compatibility before you switch models. Two models are rarely interchangeable, even under the same API. They differ on accepted parameters, tool handling, and context size, so a quick check before a swap saves a broken deploy.
- Pin explicit versions instead of moving aliases. An alias that repoints to the current model changes under you without warning, and a fixed version keeps your behavior stable.
Model references
You need model specs in one place to choose fast: context window, parameters, cost, capabilities.
- modelparams.dev is a community catalog of model parameters. We maintain it so you can compare models at a glance instead of opening ten documentation tabs.
Structured outputs and validation
Constraining the shape of a request or a response rules out most format errors before they reach the provider.
- Instructor returns validated, typed objects from an LLM using your schema, with automatic retries.
- Outlines guarantees schema-compliant output during generation rather than parsing it afterward.
- Pydantic defines and validates the data models the two tools above build on.
Repair and routing at runtime
A request that gets past prevention still breaks in production: a provider rate-limits you, a model got retired, a schema one provider accepts another rejects. Routing and repair keep the app up when that happens.
- Manifest is an open-source router that sends each request to the right model through custom routing rules. When a request fails, its self-healing layer catches the error, patches the request, and retries it automatically.
Guardrails
Content checks catch safety or policy issues in a response.
- Guardrails AI validates inputs and outputs against configurable rules like toxicity, PII, and format compliance.
- NeMo Guardrails adds programmable rails for topics, safety, and dialogue flow.
Observability and traces
A trace records what happened on every request. You see what broke, and you fix it with the runtime tools above.
- Langfuse traces every LLM call, tool invocation, and latency in a timeline, open source.
- Arize Phoenix gives open-source tracing and evaluation with strong support for RAG and multi-step agents.
- Datadog LLM Observability brings LLM traces, errors, and cost into the same platform as the rest of your infrastructure.
Evaluation and regression testing
Model and prompt changes drift in quality. A test suite surfaces the drop before it reaches production.
- Promptfoo replays a set of test cases against your prompts and models from a config file, and wires into CI.
- Braintrust scores prompt and model changes and can block a deploy when quality degrades.
How the pieces fit
Each category covers a different moment. Methods and catalogs help you choose before you send. Structured outputs constrain the shape. Routing and repair catch what still breaks in flight. Observability and evals tell you what to fix at the source. Coverage at each moment rarely comes from one product.