To the journal

AI agents 7 min read

AI that acts

AI is moving from answering to doing: it books, writes, changes code. It's the most exciting step yet, and it works when the AI keeps the thread.

For a year, AI was good company. You asked, it answered, you decided. Now it takes over whole workflows: it plans, calls tools, runs step by step. The assistant becomes an actor.

That changes the stakes. While AI only suggests, a mistake shows up at once. Once it acts on its own, a mistake simply runs on, into the next step and the one after.

01

From answering to doing

An agent is an AI that acts on its own. It breaks a task into steps, picks tools, checks intermediate results and keeps going until the goal is reached. It sounds like the next logical step, and it is.

The capabilities are impressive. An agent can research, write code, test it, book an appointment, send an email. In the demo it looks like magic.

An agent turns one answer into a chain of actions.

And the challenge lives in that chain.

02

The math of the chain

Each single step an AI takes is quite good, yet rarely perfect. And small errors multiply once steps build on each other. It's plain probability, with an unpleasant effect.

77%success rate is left from five chained steps when each one succeeds 95% of the time.Multi-agent analysis, 2026
Success rate across five chained steps, 95% each.

At twenty steps and 95% each, you're down to about a third. More steps with better per-step accuracy can lead to worse results. An early error poisons everything that comes after.

An agent's reliability is decided between the steps.

03

What actually breaks

So why? The models themselves are strong. The problem sits in between, across the run of steps.

Containers restart and wipe the history. The state from step three is gone by step twelve. People call it state amnesia: the agent forgets what it was just doing. A hallucination from an early step is passed on unchecked and tips the result at the end.

78%of enterprise agent pilots failed in 2026, by industry analysis.Gartner, Q1 2026

Research by METR shows the limit from the other side: on tasks that take a person a few minutes, top models are near perfect. On tasks spanning several hours, the success rate drops steeply. The longer the path, the more it matters that the agent holds the thread.

04

Holding the thread

If reliability emerges between the steps, that's where the lever is: the agent needs the right state across every step. What matters has to persist, and each step needs exactly the right thing in front of it.

This is where our work at Thinkery comes in. We're convinced an agent's reliability isn't born in the model but in the layer between the steps, the place where the thread is kept across the whole run.

An agent is only as good as what it still remembers between two steps.

05

Where this leads

Acting AI is here to stay. The question is whether we can trust it with workflows that last longer than a few minutes. That's decided by the reliability between the steps.

The sturdier that foundation, the more real work a person can truly hand off, without re-checking every step at the end. That's when an impressive demo becomes a tool you rely on.

Trust in acting AI grows with reliability across many steps.

Standing at exactly this question? Let's talk.

Send us two sentences about your situation. You talk straight to the people who build and guide.