16 Comments
User's avatar
Louis-François Bouchard's avatar

We spent weeks going back and forth on these patterns while writing this and mistake #1 still surprised me with how often it comes up in our community. Context window mismanagement is probably responsible for 80% of the "my agent stopped working" messages we get. Glad we finally put this into a structured framework people can reference!

Paul Iusztin's avatar

yes! Also, let's not forget about the elephant in the room: AI Evals

ToxSec's avatar

"These six mistakes are not exotic edge cases. They are the exact patterns that repeatedly break real agentic systems. Individually, they look small, but in production, they compound into disasters."

excellent conclusion here. context mistakes are one of the more common ones and this article was really well written.

Paul Iusztin's avatar

yes! Plus, missing AI evals

Ayaz's avatar
Mar 23Edited

This is a banger. So helpful. Thank you.

Paul Iusztin's avatar

Here to help!

Myles Bryning's avatar

Good piece, but I still think most of the market is over-focused on orchestration mechanics. Context, planning, and evals matter, but for serious use the harder question is whether the system can show its evidence, replay its decision path, and surface drift when the underlying knowledge changes. Without that, you can end up with a very well engineered black box.

Denis Craciun's avatar

Good job, I agree with literally everything you said

Renan Liguori's avatar

Dude, this is spot on! I've been trying to reason with board of directors whether we should use agents or a simple workflow for specific processes. They simply don't have a clue on how things works and think AI will solve them all. Thank you for the post.

Markus's avatar

We're still building on LangGraph but meanwhile I've completely dropped feeding the "messages" list to the LLM. This is at best now only a record of events or used for the UI but never sent to the LLM. Each LLM call now has a deliberate choice of inputs, perhaps extracted from this messages sequence.

It doesn't need to know every JSON Schema detail from a tool call 4 messages ago.

For building planners I also found structured outputs nice at first but quickly it became extremely token intensive, noisy and brittle as different LLM Providers understand different versions of JSON schema and so on. I instead now generate a python-like DSL and use the ast module to parse it so I don't need a sandbox to run anything. Proved to need less than a third of the tokens and much more robust.

I'm gradually making the individual POMDP steps more explicit and not having a single LLM call be all components at once.

At this point I feel I am so far off from the typical agents tutorials and how the frameworks want you to do things that I sometimes doubt myself so I'm glad to read this :)

Pekka Pihlajasaari's avatar

You have a valuable insight at "Agents handle these open-ended scenarios. Most teams treat predictable problems as if they need agents. When you use an agent for a structured task, you pay for autonomy you do not need."

This begs the question of whether the team considers a closed form analytical framing of the problem any longer. When you use an agent when an analytical solution is available, you pay for the language model that you do not need (and the non-determinism it brings).

This happened already when ML was segmentation and fitting and when numerical integration was substituted for algebraic manipulation. It created a map from inputs to outputs that a more careful reflection of the structure of the data could have identified explicitly and solved in a direct fashion.

Knowband's avatar

Excellent breakdown of real production pitfalls that most teams only discover the hard way, especially around context management and overengineering. I like how each mistake ties directly to a practical fix, making it immediately actionable for anyone building agentic systems

The Crude Reality's avatar

This diagnostic framework should be pinned at the top of every AI engineering team’s workspace. After a decade in energy trading and risk management — where I’ve spent the last two years building AI-assisted commodity analysis and risk pricing systems — every single one of these mistakes maps directly to failures I’ve either made or watched unfold in production.

Mistake #1 hit me hardest because I learned it the expensive way. When I first built an agentic workflow for energy market research, I did exactly what you describe — loaded the context window with everything: geopolitical data, refinery specs, tanker movements, historical price patterns, regulatory frameworks. The output quality degraded visibly as the window filled. The moment I restructured into a curated memory layer with selective retrieval — only surfacing what the agent needed for the specific decision at hand — token consumption dropped roughly 60% and output quality improved dramatically. Context engineering is curation, not accumulation. That single insight saved more money than any model swap.

Your autonomy slider concept in Mistake #3 is the framework most teams are missing. In energy risk management, we face this exact spectrum daily. Price alert routing is a workflow — deterministic, predictable, no autonomy needed. But analysing how a novel supply chain disruption cascades through refinery feedstock availability into downstream product pricing — that requires genuine reasoning under uncertainty. The teams that try to solve both with the same architecture waste enormous resources on the predictable tasks and get unreliable results on the complex ones. Workflow-first, agent-only-when-necessary is exactly right.

The planning point in Mistake #5 connects to something I’ve observed in both trading systems and AI engineering: the difference between reactive and intentional systems. A trading algorithm that just responds to the last price tick without reference to an overall strategy will churn through commissions while going nowhere. An agent that just responds to the last tool output without reference to a goal does exactly the same thing with tokens. The parallel is precise — and the fix is identical. Define the objective, decompose the steps, evaluate progress, and only then select the next action.

One dimension I’d add from building domain-specific agentic systems: Mistake #0 might be skipping domain knowledge architecture entirely. Before any of your six mistakes become relevant, most teams fail because they haven’t structured their domain knowledge into a retrievable, version-controlled layer that the agent can reference. In energy markets, that means indexed refinery complexity data, counterparty risk profiles, shipping route databases, and regulatory constraint libraries — all structured for semantic retrieval rather than dumped into context. Without that foundation, even a perfectly architected agent produces generic output because it lacks the specialised knowledge to reason about domain-specific problems. The knowledge layer is the prerequisite for everything else.

Exceptional framework. The gap between AI demos and AI in production is almost entirely explained by these patterns — and having them codified as a diagnostic checklist is genuinely valuable for anyone shipping real systems.​​​​​​​​​​​​​​​​

Devesh's avatar

The 'not listening' mistake is the one that kills you. We built a custom AI interaction — it kept breaking a core rule, writing the user's lines instead of letting them respond. After being told 5+ times to stop, the user opened a consumer competitor app. Said 'I thought no one could understand me better than you.'

Custom-built AI lost to a free generic app. Not on capability — on listening. The consumer app just followed instructions. Ours didn't. Same user, same day, said a different interaction was great. She doesn't hate AI. She hates AI that doesn't listen.

User's avatar
Comment removed
Mar 23
Comment removed
Paul Iusztin's avatar

That sounds like a solid plan! I also strongly suggest thinking about AI evals