LLMs: The Fifth Act

You can be deeply familiar with AI and still be skeptical of “agentic AI.” I was.

For a while, I dismissed most “agentic” startups as VC-funded cron jobs. An LLM integration wrapped in a thin scheduler, rebranded as autonomy.

But there’s something real underneath the hype, and understanding it requires zooming out. LLMs haven’t evolved in one continuous line. They’ve moved in distinct acts, each driven by hitting a ceiling and finding a way around it.

We’re now entering their fifth act. The first four largely occurred in parallel, so forgive the linear framing that follows.

Act 1: Pretraining (The “Wow” Moment)

The “wow” moment came at different times for different people. For me, it was unmistakably ChatGPT 3.5.

It wasn’t perfect. But it took something culturally significant — the Turing Test — and made it feel instantly irrelevant. It changed how I searched for information. Natural language became a usable interface for the first time.

The pretraining era was defined by bigger models, more data, better outputs. Progress scaled predictably with size.

And then it got complicated. Compute became prohibitively expensive as models grew. High-quality text data started running out. GPUs developed years-long waitlists. The ceiling was a combination of technical, economic, and data constraints hitting at once.

Act 2: Post-Training (Shaping Models, Not Just Scaling Them)

The next leap came from post-training.

Instead of scaling via parameter count alone, models gained another vector for improvement: reinforcement learning, fine-tuning, and alignment techniques. In some domains like mathematics and physics, the gains were significant — models improved sharply without growing larger. Models became more useable without becoming “smarter”.

But post-training introduced its own problems. Benchmarks, the standard way to gauge model quality, became gameable. Performance was routinely overstated as models overfit to evaluation datasets, making it harder to know how much real progress was being made.

Still, the durable gains were real. Fine-tuning made it practical to adapt generic models for specific tasks at a fraction of the cost of training from scratch, and the era established that raw scale wasn’t the only path forward.

Act 3: Tool Use (Connecting Models to the Real World)

LLMs had another fundamental constraint: they were frozen in time.

A model trained today knows nothing about tomorrow. Training is expensive and infrequent, and most LLMs don’t support incremental updates. Virtually any fact can be invalidated by a future event. Models, therefore, become stale quickly.

A major part of the solution was tools.

Instead of forcing LLMs to know everything, they could be used as orchestrators — calling APIs, searching the web, running calculations, querying live data. Companies could expose their own private APIs via function calling and protocols like MCP, letting the model decide when and how to use them to retrieve information.

This began to shift LLMs from pure response generators into higher-level planners. For many users, it’s what made them a viable replacement for Google search.

Act 4: Inference-Time Scaling (Letting Models “Think”)

For a long time, LLMs were understood as purely Type 1 thinkers: fast, reactive, shallow. A commonly cited limitation was that unlike the human brain — which can simply think longer to reach a better answer — compute couldn’t be scaled at query time.

Type 2 thinking is different: deliberate calculation, planning, decomposing problems, iterating toward better answers. The hallmarks of careful reasoning, until recently, felt distinctively human.

Chain-of-thought reasoning changed that. Models could now “think” for longer, breaking problems into steps and iterating before responding. Scaling shifted from training time to query time — exchanging more compute for better answers.

Today, this shows up everywhere: fast versus extended thinking modes, variable reasoning depth, higher cost for harder problems. It’s also where the scope of what we trust LLMs to do expanded significantly.

Act 5: Agentic AI (From Prompting to Delegating)

Across the first four acts, more powerful models expanded what was possible, but the interaction model stayed broadly the same.

PROMPT ──── THINKING / TOOL CALL ────→ ANSWER

Act Five is where that changes.

Before: User → Prompt → LLM → Response → Repeat Now: User → Goal → Agent → Execution

Instead of asking for answers, you delegate outcomes.

Agents can plan tasks, use tools, iterate toward goals, and operate over extended periods of time. They can be triggered by schedules, events, or even other agents. They can run in parallel, with zero ongoing human input. Their task may be as simple as writing a daily report — one the user can review, or ignore entirely.

The criticism that many of today’s implementations are just glorified cron jobs isn’t entirely wrong. But it misses the more important point: the way we interact with these systems is fundamentally changing, and we’re early in that shift.

The Five Acts of LLMs (At a Glance)

Act	What Changed	Limitation
1. Pretraining	Performance scaled with size. Conversational ability emerged	Static knowledge. Expensive retraining
2. Post-Training	Models improved without growing. Domain capabilities strengthened	Benchmarks became gameable. Still static
3. Tools & Search	Models began orchestrating APIs and live data	Still limited by prompt-response interaction
4. Thinking	Models could reason step-by-step with more compute	Interaction model unchanged
5. Agents	Shift from prompts to goals. Systems execute over time	Reliability, cost, and control remain unsolved

The Pattern Has Already Played Out

The dismissal of LLMs as “just next-token prediction” aged poorly, fast.

If you work in software, you’ve lived it: AI went from autocomplete to writing a significant share of production code faster than most people predicted. As an industry, we moved up the abstraction stack almost without noticing — from writing functions, to describing logic, to reviewing output. Each step felt incremental until you looked back.

The tools available today are capable. The tools frontier labs are reportedly testing internally are far more so — many hours of uninterrupted, goal-directed work toward a defined outcome.

Act Five is following the same pattern. It looks crude from the inside. It won’t from the outside, looking back.

Act Five, Scene One

The constraints holding agents back today — runaway costs, unpredictable execution, unclear scope, compliance concerns — feel like engineering problems, not fundamental ones. Entire categories are already emerging around cost controls, execution reliability, and tighter guardrails. Adoption will be uneven regardless as different use cases have different tolerances for failure. But this leaves space for early adopters to shape the trajectory of the technology.

Common early use cases — SDR workflows, marketing automation, intelligence gathering — are a reasonable place where we see a lot of activity. And while much of what we see today is exploratory, some of it will evolve into something more substantial.

When I visited ClawCon, I met small business owners demoing setups that were rough and imperfect, but genuinely useful. The instruction manual doesn’t exist yet. People are writing it in real time — and the industry is building tools to support them (we’re working on our own piece of that at BehaviorGraph).

We’ve moved from questions and answers to the era of autonomy.

Act Five is just beginning. We’ll figure out what it’s worth the same way we always do: by using it.