How We Evaluate Fetch Quality for AI Agents

Why Fetch Quality Matters
When an AI agent fetches a web page, the extracted content goes straight into an LLM’s context window. Everything that happens next (summarization, question answering, code generation, research synthesis) depends on what came back from that fetch.
A dirty fetch hurts you twice. You pay for tokens, and your agent pays for degraded context.
Every character that enters the context window costs money. If a fetch returns 170,000 characters for a page where the actual article is 4,700 characters, you’re paying for 36x more tokens than the information you need. Scale that across every fetch in a session, across every session in a day, and the cost difference between a clean fetch and a noisy one becomes significant.
But cost is the smaller problem. The bigger problem is what noise does to the LLM’s reasoning. When an agent receives 170,000 characters and only 19% is the article it asked for, the model has to locate the relevant content inside a sea of navigation menus, ad placeholders, cookie banners, related-article links, and page chrome. Its downstream outputs get worse. Summaries drift. Answers pull from the wrong section. The agent makes decisions based on noise it mistook for signal.
Two things determine whether a fetch is useful:
1. Did the provider return the page at all? A failed fetch, a 403, or a truncated stub means the agent has nothing to work with. It either hallucinates an answer or tells the user it can’t help.
2. How clean is what came back? A “successful” fetch that returns 171,000 characters at 19% signal is not meaningfully better than a failed fetch. The article is technically in there, but the agent is paying full token cost to reason over 36x more noise than content.
We built a repeatable evaluation to measure both dimensions (coverage and signal quality) across five providers. This post describes the methodology, the results, and every decision we made along the way.
The Test Set
We built a fixed set of 45 URLs across six content categories. These represent the kinds of pages AI agents actually fetch in production:
- Documentation (10): API references and SDK guides from OpenAI, Anthropic, Playwright, Next.js, Stripe, Cloudflare, FastAPI, PostgreSQL, HTTPX
- Pricing pages (6): SaaS pricing with tables, tiers, and feature grids from OpenAI, Anthropic, Vercel, Stripe, Supabase, GitHub Copilot
- Code repositories (5): GitHub READMEs from LangGraph, Vercel AI SDK, Playwright, FastAPI, DuckDB
- Articles (5): Technical content and product announcements from Cloudflare, MDN, Wikipedia, OpenAI, Anthropic
- Simple pages (2): Minimal-content test pages from Quotes to Scrape and Books to Scrape
- News articles (15): 3 articles each from 5 publishers: Daily Mail, Hindustan Times, South China Morning Post, The Guardian, New York Times
News articles make up a third of the test set deliberately. News is one of the most common agent fetch targets, and it’s where providers diverge the most. Many news publishers serve content behind paywalls or via client-side rendering that requires a full browser environment to access. Index-based providers may not have the content at all. These are real-world conditions that agents encounter daily.
How we selected URLs:
- All URLs were chosen before any provider was tested. No cherry-picking based on results.
- Mix of static HTML and dynamically rendered pages.
- Mix of sites with and without anti-bot protections.
- All URLs were verified as live and accessible in a standard browser at test time.
The Full URL list is at the end of this blog.
Providers Tested: TinyFish, Exa, Firecrawl, Tavily, Parallel
We tested five providers. Each was called using its documented fetch or extract API with default settings. No provider-specific tuning, custom headers, or prompt optimization.
- TinyFish: Fetch API. Browser-backed rendering. Renders pages in a real browser, strips non-content elements, returns clean markdown.
- Exa: Contents API. Index-based. Returns content from Exa’s pre-built web index. Pages not in the index return an error.
- Firecrawl: Scrape API (v2). HTML-to-markdown conversion. Returns the full rendered page as markdown.
- Tavily: Extract API. Returns raw content from the target URL. Failed URLs are reported separately.
- Parallel: Extract API with full_content enabled. Falls back to short excerpts when full content is unavailable.
Every provider was called with its documented default configuration. No retries, no fallback modes, no special parameters beyond what’s needed to request markdown output.
How We Score
Pass / Fail Criteria
A result counts as usable only if it meets both criteria:
1. At least 1,000 characters returned. Anything shorter is a snippet or stub, not a page extraction. This threshold exists because some providers report success even when they return a truncated excerpt instead of the full page. For example, Parallel’s Extract API returns exactly 600 characters for every news URL. The full_content field comes back null, the API falls back to a short excerpt, and the response reports success. Without a minimum character threshold, those 600-character stubs would count as successful fetches.
2. Signal ratio above 30%. If more than 70% of the output is navigation, ads, and page chrome, the content is effectively buried. This threshold exists because some providers return enormous outputs where the actual article or documentation is a small fraction of the total. For example, Firecrawl returned 171,095 characters for a Daily Mail article where the article body was approximately 4,300 characters (a 19% signal ratio). The fetch technically “succeeded,” but an LLM processing that output pays for 36x more tokens than the information it actually needs.
Why these specific thresholds? We chose 1,000 characters because the shortest meaningful page in our test set (a technical documentation page on HTTPX timeouts) is approximately 1,200 characters. Anything under 1,000 characters is either a snippet, an error page, or a site that returned its homepage instead of the requested content. We chose 30% signal because below that threshold, the noise-to-content ratio makes the output actively worse than no output at all. The LLM has to find the needle in a haystack, and its downstream responses degrade measurably.
Both thresholds are applied after the fact, based on the data. No threshold was chosen to include or exclude a specific provider’s results.
What Is Signal Ratio?
Signal ratio measures what percentage of the extracted content is the actual information the agent was looking for, versus noise.
Content = article body, documentation text, product descriptions, pricing details, code examples, README text. The information a reader or agent came for.
Noise = navigation bars, sidebars, footer links, cookie banners, ad placeholders, social share buttons, related-article link lists, newsletter signup blocks, breadcrumbs, site-wide menus, legal boilerplate, link lists to other pages, metadata dumps.
How we measure it: We use Claude Sonnet 4.6 (claude-sonnet-4-6) as an LLM judge. For each fetched page, the judge receives the raw extracted text and the original URL, and classifies each section as content or noise. It returns a character count for each category and a signal ratio percentage.
The judge prompt is identical for every provider on every URL. No provider-specific adjustments. Outputs longer than 30,000 characters were truncated before being sent to the judge (the judge was informed of the truncation). This truncation applies equally and primarily affects high-character, low-signal outputs.
Why an LLM judge instead of heuristics? Automated approaches (HTML tag analysis, boilerplate detection, character-ratio heuristics) break across the wide variety of page structures in the test set. A documentation page, a news article, and a GitHub README have completely different DOM structures. An LLM judge can distinguish “this is a navigation menu” from “this is article content” regardless of how the page is built. We chose Claude Sonnet 4.6 for its consistency on classification tasks, and because using a third-party model (Anthropic) to judge outputs from all five providers, including TinyFish, adds a layer of independence to the scoring.
Results: Coverage and Signal Ratio
Coverage
Coverage measures how many of the 45 URLs each provider returned a usable result for.
- TinyFish: 42 / 45 (93%)
- Tavily: 36 / 45 (80%)
- Exa: 33 / 45 (73%)
- Firecrawl: 28 / 45 (62%)
- Parallel: 26 / 45 (58%)
Where TinyFish missed: TinyFish failed on 3 URLs. Books to Scrape returned only 400 characters (below the 1,000-character threshold). The OpenAI status page returned 551 characters (also below threshold). And the OpenAI GPT-4o announcement returned 4,218 characters but with only 24.8% signal, because the page rendered with significant non-content elements that pushed the signal ratio below 30%.
We’re reporting these failures because they’re in the data. No provider achieved 100% coverage.
Signal Ratio
Among usable results, signal ratio measures how clean the extracted content is.
| Provider | Median | Mean |
|---|---|---|
| TinyFish | 90.5% | 89.6% |
| Exa | 89.7% | 85.8% |
| Firecrawl | 79.5% | 72.5% |
| Tavily | 77.2% | 69.3% |
| Parallel | 76.3% | 72.3% |
Exa’s signal quality is comparable to TinyFish on pages it can reach. The difference is that Exa fails to reach 12 of 45 URLs because they’re not in its index. When Exa succeeds, it delivers clean content. Its index-based approach strips noise before storing.
We report both median and mean because some providers have high variance. Firecrawl’s median (79.5%) is significantly higher than its mean (72.5%), indicating a long tail of low-signal outputs pulling the average down.
Coverage by Category
| Category | TinyFish | Exa | Firecrawl | Tavily | Parallel |
|---|---|---|---|---|---|
| Documentation (10) | 10/10 | 10/10 | 9/10 | 9/10 | 9/10 |
| Pricing (6) | 6/6 | 6/6 | 6/6 | 5/6 | 5/6 |
| Repositories (5) | 5/5 | 5/5 | 2/5 | 3/5 | 5/5 |
| Articles (5) | 4/5 | 5/5 | 4/5 | 4/5 | 5/5 |
| News (15) | 15/15 | 5/15 | 4/15 | 12/15 | 0/15 |
The gap is concentrated in news. On documentation, pricing, repos, and articles, most providers perform reasonably well. The differences are 1–3 URLs at most. News is where the results diverge sharply.
Exa and Parallel both matched or beat TinyFish on articles (5/5 vs 4/5). Firecrawl matched TinyFish on pricing (6/6). No single provider dominates every category, but news coverage is the category where a 15-URL gap separates the top from the bottom.
News Publisher Breakdown
| Publisher | TinyFish | Tavily | Exa | Firecrawl | Parallel |
|---|---|---|---|---|---|
| Daily Mail (3) | 3/3 | 2†/3 | 2†/3 | 1/3 | 0/3 |
| Hindustan Times (3) | 3/3 | 3/3 | 0/3 | 0/3 | 0/3 |
| South China Morning Post (3) | 3/3 | 3/3 | 3/3 | 0/3 | 0/3 |
| The Guardian (3) | 3/3 | 3/3 | 0/3 | 3/3 | 0/3 |
| New York Times (3) | 3/3 | 1/3 | 0/3 | 0/3 | 0/3 |
† Returned content but with signal ratio below 30% on some articles. Technically fetched but not usable by our criteria.
Why each provider fails on news:
- Exa: The Guardian, NYT, and Hindustan Times are not in its web index. The API returns status: error. There’s no content to score.
- Firecrawl: NYT returns HTTP 403 with an explicit message: the site is not supported. Hindustan Times and SCMP return 6,000–29,000 characters, but at 8–18% signal. Massive page chrome with almost no article content.
- Tavily: NYT fails to fetch on 2 of 3 URLs. One Daily Mail article returns 46,000 characters at 17% signal. The article exists somewhere in the output, but it’s buried under 38,000 characters of navigation, ads, and related links.
- Parallel: Every news URL returns exactly 600 characters. The full_content field comes back null, and the API falls back to a short excerpt. The API reports success. There is no error code or flag indicating the content was truncated. An agent using Parallel would have no way to know it received a stub instead of the article.
A Concrete Example: One News Article, Five Providers
To make the numbers tangible, here’s what happens when all five providers fetch the same Daily Mail article. The article body is approximately 4,300 characters of text.
- TinyFish: 4,737 characters returned, 86.7% signal. The agent gets clean article text with minimal surrounding content.
- Exa: 65,793 characters returned, 29.7% signal. The article is present but buried in 14x more page chrome (navigation, related articles, ad containers, and site-wide menus).
- Firecrawl: 171,095 characters returned, 19.3% signal. The article is buried in 36x more navigation, ads, and related links. The LLM pays for 171K tokens to process ~4,300 characters of useful information.
- Tavily: 46,433 characters returned, 17.3% signal. The article is present but surrounded by 10x more page furniture.
- Parallel: 600 characters returned, 14.3% signal. A single sentence from the article, truncated. Not the article.
When TinyFish returns 4,737 characters at 87% signal, the LLM processes approximately 4,100 characters of article content. When Firecrawl returns 171,095 characters at 19% signal, the LLM processes the same ~4,100 characters of article content plus ~167,000 characters of noise, at the same per-token cost.
For this one article, the token cost difference between TinyFish and Firecrawl is approximately 36x. Multiply that across every fetch an agent makes in a session, and the cost and quality implications compound.
Run Details
- Date: June 2, 2026
- Environment: All providers tested sequentially from the same machine (US-based cloud instance). No parallelism. Each provider completed its full 45-URL pass before the next provider started.
- Delay between calls: 300ms, to avoid rate limiting on any provider.
- Timeout: 300 seconds per fetch. No artificial time cap. If a provider needed 5 minutes to render a page, it got 5 minutes.
- No retries. If a provider failed on a URL, that failure was recorded as-is. No second attempts, no fallback configurations, no manual intervention.
- LLM judge rate: ~1 call per 1.2 seconds to stay within Anthropic API rate limits.
- LLM judge model: Claude Sonnet 4.6 (claude-sonnet-4-6), called via the Anthropic API. The same model and prompt were used for every evaluation across all providers.
Limitations
We’re publishing this methodology because we want it scrutinized. Here’s what we think the weaknesses are:
45 URLs is a small sample. It’s large enough to reveal consistent patterns (especially in news coverage, where the differences are stark and consistent across all 15 URLs) but individual URL results can be noisy. We chose breadth across content types over depth within any single type. A larger test set would increase statistical confidence, and we plan to expand it in future evaluations.
LLM judges aren’t perfect. Claude Sonnet 4.6 occasionally miscounts characters or misclassifies borderline content. Is a “Related articles” section noise or content? Reasonable people (and models) can disagree. The same prompt and model were used for every provider on every URL, so any systematic bias affects all providers equally. The judge is not biased toward any one provider’s output format.
Provider capabilities change. These results reflect each provider’s API as of June 2026. Providers ship improvements continuously. A provider that scores poorly today may improve tomorrow. If you believe a result is outdated, contact us and we’ll re-run the evaluation.
We are one of the providers tested. TinyFish ran this evaluation. We have an obvious interest in the outcome. That’s why we’ve published the full URL list, the scoring criteria, the exact pass/fail thresholds, every number including our own failures, and the complete methodology. Anyone can reproduce this evaluation. The evaluation script, raw per-URL results (CSV with scores and LLM judge reasoning for all 225 provider × URL combinations), and full LLM judge outputs (JSON) are available. Email support@tinyfish.ai to request the complete logs.
The 30% signal threshold is a judgment call. We believe content that’s less than 30% signal is practically unusable for LLM processing, but the threshold is subjective. The raw data includes exact signal ratios for every provider on every URL, so anyone can re-score with a different threshold.
Have questions about the methodology, or want to run your own comparison? The full evaluation data (per-URL results, signal scores, and LLM judge reasoning for all 225 tests) is available on request. Email support@tinyfish.ai.
Sign up and try Fetch in our Playground →
Full URL List
Documentation (10)
- Streaming Responses guide
- Tool Use overview
- Tools and Tool Calling
- Locators
- Route Handlers
- Durable Objects
- Webhook Signatures
- Background Tasks
- JSON Data Types
- Timeouts
Pricing (6)
Repositories (5)
Articles (5)
Simple Pages (2)
Status Pages (2)
News: Daily Mail (3)
- Obama Trump marriage Michelle issues
- Meghan Markle Hollywood power brokers
- Canada actress Claire Brosseau assisted suicide
News: Hindustan Times (3)
- Axar Patel name-drops Kuldeep Yadav after Delhi Capitals elimination
- BCCI controls the ICC, South Africa spinner makes big charge
- Ex-India cricketer claims TMC denied him ticket
News: South China Morning Post (3)
- Iran’s top diplomat Abbas Araghchi to visit China
- Panama minister blasts China’s ship crackdown
- China pushes 10,000-card computing clusters in AI race
News: The Guardian (3)
- Blake Lively, Justin Baldoni settlement
- Kitten rescued from glue bucket in Texas
- LIV Golf funding, Cam Smith
News: New York Times (3)



