The LLM API Call Quietly Became an Agent Loop
One request now runs a server-side loop of model passes and tool calls. Here is what that buys you, and what it quietly costs.
- The major providers now let one API request run a loop on their side: search, run code, call a tool, read the result, and decide what to do next, all before you get a response.
- It is one request, not one model call. The orchestration harness that used to live in your code is moving inside the provider.
- You write less glue and give up some cost, latency, and observability control.
In GenAI models, since the beginning, the contract was simple. You sent a prompt, you got text back. If a tool was involved, you owned the harness: the model said "call search(q)," your code ran it, you appended the result, you called the API again. Frameworks existed largely to manage that loop for you.
That contract has changed. You now declare the tools and the provider runs the loop:
# Anthropic Messages API
client.messages.create(
model="claude-opus-4-8",
tools=[{"type": "web_search_20260209", "name": "web_search"}],
messages=[{"role": "user", "content": "Latest on the Mars rover?"}],
)
The model can search, read the results, search again, and only then answer. Every intermediate step ran on the provider's infrastructure, not yours.
Where the loop comes from
The pattern being absorbed here was formalized as ReAct (Reasoning + Acting, Yao et al., 2022). The model alternates three step types: a Thought (free-text reasoning about what to do next), an Action (a tool call with typed arguments), and an Observation (the tool result, fed back into context for the next thought). Those observations provide factual grounding that corrects the model's assumptions, which is why tool-using loops hallucinate less than pure chain-of-thought. What is new in 2026 is not the loop, it is who runs it: that cycle has moved from your code into the provider's request handler.
This is real, and it is recent
All three major vendors ship some version of it. The claims below link to primary sources.
OpenAI built a new primitive for it. The Responses API is, in their own words, an agentic loop that can call web search, file search, code interpreter, computer use, image generation, and remote MCP servers within one request.
Anthropic added server-side tools to the Messages API: web search, code execution, web fetch, and tool search, plus an MCP connector that handles remote-server connection, tool discovery, and errors for you, work that previously meant writing your own client harness. Per the tool-use docs, these server tools run on Anthropic's infrastructure and return results directly.
Google lets Gemini combine built-in tools (Google Search, Maps, code execution, URL context, file search) with your own functions in a single request, and circulate context across those calls so later steps can use earlier results.
The correction worth making
It is not "one LLM call." It is one API request wrapping many model passes and tool runs.
That distinction shows up on the invoice. On Gemini 3 models, you are billed per search query the model decides to run, so three searches inside one prompt is three billable uses. Older Gemini models billed per prompt. The billing line is the hidden loop becoming visible.
And the old endpoints did not disappear. Chat Completions still exists and is still a reasonable default. Nothing forced your calls to become agentic. You opt in with a parameter.
The line that separates signal from hype
There are two kinds of tools, and only one of them closes inside a single request.
Server, or hosted, tools such as web search, code execution, and the managed MCP connector run on the provider. The model calls them, the provider executes them, the loop completes, and you get the final answer. This is the part that is genuinely agentic in one request.
Client tools, your own function declarations, still bounce back to you. The model signals "call this function," your code runs it, you return the result. The provider cannot run your chargeCustomer(), so it cannot close that loop for you.
So "agentic in one call" is true for hosted tools and false for your own functions. Any summary that skips this is overselling.
What you trade for the convenience
From an operations standpoint, this is the part that matters more than the demo.
The convenience is real; so is the loss of control. Both belong on the same ledger.
The portability cost is concrete. A request built around one vendor's hosted code interpreter or server-side web search does not move cleanly to another. Abstraction layers also lag the native APIs, which can force two-call workarounds for things the raw API does in one.
How I would decide
Reach for the server-side loop when the work is provider-native: public web search, sandboxed code execution, reading URLs, connecting to a managed MCP server. The reduction in glue code is real.
Keep the loop in your own code when you need deterministic cost and latency, strong tracing, portability across vendors, or when the tools are your own functions hitting your own systems anyway. There, the hosted loop buys little and costs control.
The takeaway
Models could already use tools. What changed is that the provider is becoming the agent runtime, not just the model behind an endpoint, and it is absorbing part of the orchestration layer that frameworks were built to own.
The provider is becoming the agent runtime, not just the model behind an endpoint.
That is a good trade when you want a managed agent and a poor one when you want a predictable system. Treat it as a boundary decision, billed by the step, on the provider's terms. Choose it on purpose, not by default.