The Hidden Latency Tax in AI Agents — It’s Not the LLM

TinyFishie, TinyFish ObserverApr 3, 2026·Updated May 19, 2026· 10 min read

Most teams assume their agent is slow because the LLM is slow.

That's almost never where the time goes.

At TinyFish, we've seen this pattern across hundreds of production deployments. A team builds an agent, integrates it with web tools, and measures end-to-end latency at 15 to 25 seconds per turn. They look at the LLM call. One to three seconds for reasoning. They look at their application logic. Milliseconds.

Then they look at the web infrastructure layer and find 10 to 20 seconds they didn't account for.

The LLM is waiting on the web. Not the other way around.

This post breaks down where that time actually lives, how we categorize it, and what we've done in TinyFish's architecture to compress the parts that are compressible.

Where the Time Actually Goes

A typical agent turn that involves the web ("find information about X, then go verify it on the source page") requires multiple web operations in sequence. Each operation has its own latency profile. A realistic waterfall in a multi-tool stack looks like this:

Search call. 200ms to 1,000ms depending on the provider and search depth. Exa's standard search runs around 450ms, though their newer Instant mode brings this under 200ms. Tavily's basic depth is comparable, with fast and ultra-fast modes trading result quality for speed. TinyFish's Search API returns real-time structured JSON. This is the one layer where latency is usually acceptable across the board.

Page fetch. Two to six seconds for a typical dynamic page. This includes DNS resolution, TLS handshake, the initial HTML response, JavaScript execution, and waiting for dynamically loaded content to render. Pages with heavy client-side rendering (React apps, SPAs, infinite scroll) sit at the upper end. Tools like Firecrawl handle the rendering, but the physics of loading a modern webpage don't compress easily. TinyFish's Fetch API runs on the same browser infrastructure as our Web Agent, which means a fetch operation can share an already-warm session rather than spinning up a new one.

Browser session cold start. Five to ten seconds for a fresh session on many cloud browser setups. Independent testing of major providers consistently shows initialization times in this range, even when providers themselves claim "no cold starts." The overhead from session isolation, proxy configuration, infrastructure-level setup, and CDP connection establishment adds up. TinyFish maintains cold starts under one second through pre-warmed browser pools. We'll go deeper on why this number matters so much below.

Agent navigation. Three to ten seconds per page interaction once the browser session is running. Each step (clicking a button, filling a form field, waiting for content to load) involves an LLM reasoning call plus the browser execution time. A five-step workflow on a single site takes 15 to 50 seconds depending on page complexity. This is where TinyFish's codified learning architecture makes the biggest difference, and we'll explain how in the intelligence-bound latency section.

Total for a single "search then verify" turn. In a fragmented multi-tool stack, you're looking at roughly 10 to 20 seconds of infrastructure latency alone. No application logic has run yet. In TinyFish's unified infrastructure, the same workflow typically completes in 2 to 5 seconds because the transitions between search, fetch, browser, and agent happen internally, without network boundaries or re-authentication.

For a multi-site workflow (say, checking pricing across 10 competitor sites), multiply accordingly. In sequential execution, that's minutes. With TinyFish's parallel execution (up to 1,000 concurrent sessions, Enterprise tier), total wall-clock time equals the slowest single task.

The Three Categories of Latency

Not all latency is created equal. We've found it useful to categorize web infrastructure latency into three types, because each one responds to different interventions.

1. Physics-bound latency

Some latency can't be compressed because it reflects real physical constraints. DNS resolution, TLS negotiation, and network round trips take time that no software optimization eliminates. A page that loads 3MB of JavaScript and makes 40 XHR requests will take time to render regardless of your infrastructure.

This category is small in absolute terms (typically under one second for network overhead) but creates a hard floor. Even with optimal infrastructure, you can't serve a fully rendered dynamic page faster than the page itself allows.

TinyFish's approach: we don't try to fight physics. Instead, TinyFish's architecture is designed to avoid unnecessary page loads. When the agent has already fetched a page's content via the Fetch API, it doesn't re-load it in the Browser API. Shared state across primitives means the system never does the same work twice.

2. Architecture-bound latency

This is latency created by how your stack is assembled, not by fundamental constraints. Most of the waste lives here.

Cold starts. A browser cold start of five to ten seconds is not a physics problem. It's an infrastructure provisioning problem. TinyFish solves this with pre-warmed browser pools that maintain sub-250ms cold starts, consistently, under production load. This is the measured P50 across millions of monthly sessions, including proxy initialization and anti-bot configuration. The 40x difference between our 250ms and a typical five-to-ten-second cold start compounds across every session your agent opens.

Cross-service serialization. When your agent calls a search API, parses the response, then makes a separate call to a different fetch API, then opens a separate browser session, each transition involves a network round trip, authentication handshake, and response parsing. In TinyFish, search, fetch, browser, and agent execution share the same underlying infrastructure. A search result flows directly into a browser session. A fetched page informs the agent's next action. No serialization boundaries, no re-authentication, no wasted round trips.

Session re-establishment. In multi-tool stacks, agents frequently lose context between operations. A browser session from one provider can't carry state into an operation on another. The agent re-authenticates, re-navigates, and effectively repeats work. TinyFish's unified session model preserves state across all four primitives (search, fetch, browser, agent), so the agent never loses its place.

3. Intelligence-bound latency

This is latency from LLM reasoning during agent execution. Each time an agent observes a page, decides what to do, and plans its next action, there's an inference call. At current model speeds, that's 500ms to 2 seconds per decision point.

The general concept for compressing this is sometimes called "codified learning": converting navigation patterns that the agent has successfully executed into deterministic code paths that bypass LLM inference on future runs.

The idea is straightforward. The first time an agent navigates a particular site, it reasons through every step. It interprets the page layout, decides which element to click, figures out what changed after the click. Each of those decisions requires an LLM call. But if the system records what worked, the same navigation pattern doesn't need to be re-reasoned next time.

Our implementation works at the node level. TinyFish's agent architecture breaks each workflow into a graph of small steps ("nodes"), each with typed inputs and outputs. When a node's input-output mapping has been validated across multiple runs, it gets codified and stored as deterministic code. If the site changes and the codified path fails, the system falls back to LLM reasoning automatically.

What this means in practice: the first run against a new site uses full LLM reasoning at every step. By the tenth run, most steps are codified. The latency per step drops from 500ms-2s (LLM call) to single-digit milliseconds (code execution). Agent operations get faster and cheaper with use. And because structurally similar patterns across different sites can share codified logic, the entire fleet benefits as any individual agent learns.

Putting It Together: How TinyFish Compresses the Waterfall

The three categories above explain where latency comes from. The question is how much of it is actually compressible when you address all three simultaneously.

Parallel execution handles the multi-site dimension. When you need to check 50 sites, sequential execution takes the sum of all task times. With TinyFish's parallel execution, total time equals the slowest single task. The infrastructure handles session coordination, result aggregation, and failure recovery. For competitive price monitoring, what used to be hours becomes minutes.

But parallelization alone doesn't solve everything. Within each individual task, the steps are inherently sequential. Your agent needs to search, then navigate to a page, then log in, then extract data. Each step depends on the one before it. Parallelizing across tasks doesn't help the latency of any single task.

That's where the architecture-bound and intelligence-bound optimizations compound. A single task in a fragmented stack might look like this:

Search (450ms) + fetch cold start (5s) + fetch (3s) + browser cold start (7s) + 5 navigation steps at 2s each (10s) = ~25 seconds

The same task through TinyFish, after the workflow has been run enough times for codification to take effect (typically around 10 runs):

Search (sub-500ms) + fetch on warm infrastructure (2s, no cold start) + browser on shared session (0ms additional cold start) + 5 navigation steps, 3 codified at 10ms + 2 with LLM at 1s each (2s) = ~4.5 seconds

On a first run against a new site, before codification, TinyFish's number is higher. The cold start and shared-session advantages still apply, but all five navigation steps use full LLM reasoning. That puts a first-run task closer to 8 to 10 seconds. Still well below 25 seconds, because the architecture-bound savings (cold start, serialization) are immediate. The intelligence-bound savings (codification) accumulate over repeated runs.

That's roughly a 3x to 5x compression depending on workflow maturity. Across a 50-site parallel run, you're comparing 25 seconds (limited by the slowest task in the fragmented stack) against 4.5 to 10 seconds (limited by the slowest task in ours, depending on how codified the workflow is). The gap widens further because cold starts in fragmented stacks degrade under burst parallelism, while pre-warmed pools don't.

Why Cold Start Is the Number That Matters Most

If you take one metric away from this post, make it browser session cold start time.

Search latency is already sub-second across most providers. Fetch latency is dominated by page complexity, which varies but averages two to four seconds. LLM inference runs 500ms to 2 seconds. These numbers are relatively stable across the industry.

Browser cold start ranges from sub-250ms to 10+ seconds depending on the platform. That's a 40x range. And it applies to every single session your agent opens.

For a single-task workflow, the difference between 250ms and 7 seconds is the difference between a responsive tool and a sluggish one. For a 50-site parallel scan, it's the difference between results in 30 seconds and results in 90 seconds. For a real-time monitoring pipeline that opens hundreds of sessions per hour, it's the difference between feasible and not.

When evaluating any web agent platform, ask for P50 and P95 cold start numbers under production load, not in isolated benchmarks. Cold start performance often degrades under burst conditions. The number that matters is the one you'll see when 50 agents start simultaneously at 2am during a scheduled monitoring run.

Practical Recommendations

Measure your actual latency waterfall. Instrument each layer: search, fetch, browser cold start, per-step navigation, LLM inference. Most teams are surprised by how little of their total latency comes from the LLM.

Identify which latency category dominates your workflow. If it's physics-bound (page load times on complex sites), focus on reducing the number of pages you need to hit, or caching where possible. If it's architecture-bound (cold starts, cross-service overhead), the infrastructure layer is your highest-leverage change. If it's intelligence-bound (too many LLM calls per task), look at codified learning or similar approaches that convert repeated patterns into deterministic execution.

Evaluate cold start times under realistic conditions. Ask vendors for cold start data under burst parallelism, not sequential single-session tests.

Consider the sequential bottleneck within tasks. Parallel execution is necessary for multi-site workflows but doesn't help per-task latency. Reducing per-step latency by 50% has a multiplicative effect across multi-step workflows.

Account for latency trends. The web is getting more JavaScript-heavy, and bot-protection is getting more aggressive. Page load times and interaction times are trending up, not down. An architecture designed for today's 3-second page loads needs headroom for tomorrow's 5-second page loads.

Try it yourself. TinyFish offers 500 free steps with no credit card required. Run your existing workflow through our infrastructure and compare the latency waterfall against your current stack. The difference is measurable in the first run.

Frequently Asked Questions

What is the biggest source of latency in AI agent pipelines?

Browser session cold starts. While LLM inference gets the most attention, it typically accounts for only 1 to 3 seconds per turn. Browser cold starts on many cloud providers add 5 to 10 seconds per session. In multi-step workflows, this overhead compounds. TinyFish maintains browser cold starts under 250ms through pre-warmed browser pools, which eliminates the single largest latency bottleneck in most agent pipelines.

Why is my AI agent so slow even though the LLM responds quickly?

Most agent latency lives in the web infrastructure layer, not the model. A single agent turn that involves searching, fetching a page, and navigating a site can accumulate 10 to 20 seconds of infrastructure overhead in a fragmented multi-tool stack. The latency comes from cross-service network round trips, browser cold starts, authentication handshakes between tools, and sequential page rendering. Switching to unified infrastructure where all operations share the same underlying system can compress this to 2 to 5 seconds.

How does TinyFish reduce AI agent latency?

Three mechanisms. First, pre-warmed browser pools eliminate cold start delays (sub-250ms vs the industry-typical 5 to 10 seconds). Second, unified infrastructure means search, fetch, browser, and agent execution share the same system, removing cross-service serialization overhead. Third, codified learning converts repeated navigation patterns into deterministic code paths, replacing 500ms-to-2-second LLM inference calls with single-digit-millisecond code execution on subsequent runs.

What is codified learning in web agents?

Codified learning is an approach where an agent records successful navigation patterns and converts them into deterministic code paths for future use. Instead of making an LLM call every time it encounters a familiar login flow or pagination sequence, the agent executes the codified path directly. TinyFish implements this at the node level, breaking workflows into small typed steps that get individually validated and codified. The result is that agents get faster and cheaper with repeated use.

How does parallel execution affect AI agent performance?

Parallel execution compresses multi-site workflows by running tasks simultaneously rather than sequentially. Checking 50 competitor sites in parallel means total time equals the slowest single task, not the sum of all 50. TinyFish supports up to 1,000 concurrent browser sessions (Enterprise tier). However, parallelization only helps across tasks. Within each individual task, steps remain sequential, which is why per-step latency (cold starts, serialization overhead, LLM calls) still matters.

What browser cold start time should I expect from a web agent platform?

Browser cold start times in the industry range from sub-250ms to 10+ seconds. When evaluating a platform, ask for P50 and P95 cold start numbers measured under production load with burst parallelism, not in isolated single-session benchmarks. Cold start performance often degrades when many sessions launch simultaneously. TinyFish's measured P50 is under 250ms across millions of monthly sessions, including proxy initialization and anti-bot configuration.

How do I measure my AI agent's latency waterfall?

Instrument each layer separately: search call duration, page fetch time, browser cold start, per-step navigation latency, and LLM inference time. Compare the total infrastructure time against the LLM reasoning time. In most agent pipelines, infrastructure accounts for 70 to 90 percent of total latency. TinyFish provides observability and run history (up to 180 days on Pro plans) that surfaces this breakdown automatically.