AI that acts
AI is moving from answering to doing: it books, writes, changes code. It's the most exciting step yet, and it works when the AI keeps the thread.
For a year, AI was good company. You asked, it answered, you decided. Now it takes over whole workflows: it plans, calls tools, runs step by step. The assistant becomes an actor.
That changes the stakes. While AI only suggests, a mistake shows up at once. Once it acts on its own, a mistake simply runs on, into the next step and the one after.
01
From answering to doing
An agent is an AI that acts on its own. It breaks a task into steps, picks tools, checks intermediate results and keeps going until the goal is reached. It sounds like the next logical step, and it is.
The capabilities are impressive. An agent can research, write code, test it, book an appointment, send an email. In the demo it looks like magic.
An agent turns one answer into a chain of actions.
And the challenge lives in that chain.
02
The math of the chain
Each single step an AI takes is quite good, yet rarely perfect. And small errors multiply once steps build on each other. It's plain probability, with an unpleasant effect.
At twenty steps and 95% each, you're down to about a third. More steps with better per-step accuracy can lead to worse results. An early error poisons everything that comes after.
An agent's reliability is decided between the steps.
03
What actually breaks
So why? The models themselves are strong. The problem sits in between, across the run of steps.
Containers restart and wipe the history. The state from step three is gone by step twelve. People call it state amnesia: the agent forgets what it was just doing. A hallucination from an early step is passed on unchecked and tips the result at the end.
Research by METR shows the limit from the other side: on tasks that take a person a few minutes, top models are near perfect. On tasks spanning several hours, the success rate drops steeply. The longer the path, the more it matters that the agent holds the thread.
04
Holding the thread
If reliability emerges between the steps, that's where the lever is: the agent needs the right state across every step. What matters has to persist, and each step needs exactly the right thing in front of it.
This is where our work at Thinkery comes in. We're convinced an agent's reliability isn't born in the model but in the layer between the steps, the place where the thread is kept across the whole run.
An agent is only as good as what it still remembers between two steps.
05
Where this leads
Acting AI is here to stay. The question is whether we can trust it with workflows that last longer than a few minutes. That's decided by the reliability between the steps.
The sturdier that foundation, the more real work a person can truly hand off, without re-checking every step at the end. That's when an impressive demo becomes a tool you rely on.
Trust in acting AI grows with reliability across many steps.