80% of your Web Fetch returns Junk

Matthew Sparr, ML EngineerMay 11, 2026

Your agent fetches a news article for you. HTTP 200, markdown back, looks fine. Then you realize 80% of what your agent brought back and your LLM processed was nav bars, trending headlines, and weather widgets.

That’s the actual state of web fetching for agents right now. Most fetchers return a successful response and still give the agent a bad version of the page. But, not TinyFish Fetch. We pulled fifteen real articles from five different publishers last week and ran each one through TinyFish Fetch and two well-known competitors to see how wrong they got it.

Don't take our word for it.

Take these URLs, run them in our Fetch Playground, and through whatever you've got.

Or just keep scrolling - we'll show you exactly what we found!

HTTP 200 can still be bad input

Fifteen articles, five publishers, same five-minute window, live retrieval, markdown output. The table below shows the median of total text returned; not article length.

Bigger numbers here generally mean more irrelevant content.

Article	TinyFish Fetch	Service A	Service B
Daily Mail	8,736 chars	37,136 chars	85,054 chars
Hindustan Times	2,990 chars	543 (headline only)	30,571 chars
SCMP	1,863 chars	1,990 chars	14,710 chars
The Guardian	3,755 chars	empty (SOURCE_NOT_AVAILABLE)	8,006 chars
New York Times	2,136 chars (2/3)	empty (SOURCE_NOT_AVAILABLE)	empty (HTTP 403)

Service B returned 8–10x as many characters as TinyFish Fetch on three of the four pages. Those extra characters were not deeper coverage. They were junk.

Service A was shorter on average but only returned a headline and a sign-in widget for one article and nothing at all for another. Nothing relevant. Nothing useful.

What the agent actually sees

Take the Daily Mail article for instance.

The article body is about 4,300 characters. Here’s what each service fetched and fed to the model:

Service	Total chars	% of Total that is Article Content	% of Total that is NOT Article Content
TinyFish Fetch	4,673	~92%	~8% (a small DC Insider newsletter promo line)
Service A	63,400	~7%	~93% (200 lines of unrelated story headlines stacked at the top)
Service B	164,986	~3%	~97% (full site nav, weather widget, 60+ trending links, ad slots, runtime error text)

Rough math on that Daily Mail article: at ~4 characters per token, a 4,673-character TinyFish Fetch result is ~1,170 input tokens. Service B's 164,986-character version of the same article is ~41,000 tokens.

35× the cost for the same article, plus slower inference, plus irrelevant facts competing for attention in the context window. For one page this is a small waste. At fifty pages, it compounds into real degradation that affects response times, overall accuracy, and bottom lines.

The same pattern persists across other articles.

Hindustan Times: Service B returned 28,470 characters where the article body itself was ~3,000. The rest was the full top-nav rendered as a markdown bullet list (every Indian city page included).

SCMP: Service B returned 14,710 characters for an article whose body is ~1,863 characters. Roughly 87% of the response was section nav, edition pickers, related rails, and footer chrome.

What TinyFish Fetch does differently

Fetch is a browser-backed extraction service. The exact heuristics are proprietary, but the shape of the work is straightforward. We do a lot of small, site-specific things so the caller does not have to.

It also benefits from the same proprietary browser infrastructure behind our Browser API. Fetch does not need the full control surface of Browser, but using a browser we control gives us a better place to handle anti-bot systems: browser fingerprints, request behavior, proxy routing, and challenge pages.

Those details matter even when the only thing you want back is clean article text.

Load the page the way the page expects to be loaded. Some pages work with a normal HTTP fetch. Many do not. Modern sites often need browser rendering before the article is actually present.
Use the right wait strategy for the domain. Some pages are ready early. Some need a short idle window. Some get worse if you wait too long, because infinite-scroll modules and recommendation rails start filling the page. Fetch uses domain policies for those cases.
Retry suspicious results. If an extraction is very short, looks like a challenge page, or otherwise looks wrong, Fetch can try a fuller browser path instead of returning a technically valid but useless response.
Separate article text from site chrome. The extraction step drops navigation, related-story rails, comment widgets, ad slots, and other non-article content before the result goes to the model.
Return errors when the page is not really content. Bot challenges, empty pages, proxy failures, and other degraded states should not be passed along as source material.
Normalize the output. Markdown is the default, with HTML and JSON available when callers need them. The goal is to make the result usable in a prompt without another cleanup step.
Watch domains over time. Sites change constantly. We monitor output length, success rates, and degraded responses so new domain behavior becomes our problem, not every customer’s integration problem.

None of this is magic. It is just the unglamorous part of making web content usable for agents.

TinyFish Fetch isn't perfect either

It missed 1 of 3 NYT URLs, and we’re actively working to improve this. (Check back soon!)

However, our competitors miss whole publishers.

Where it fits

Use Fetch by itself when your agent already has URLs. Point it at a page and get clean content back.

Use Search first when your agent does not know where to look yet. Search finds candidate sources. Fetch turns those sources into usable evidence.

Any usecase that reads a lot of public web pages, like news monitoring, financial research, brand intelligence, or regulatory tracking lives and dies on fetch quality. Run one URL through TinyFish Fetch and you'll see it in the output: less noise, sharper answers, fewer tokens, fewer wasted dollars.

Try it

Search and Fetch are free. No credits, no credit card.

# Search

curl "<https://api.search.tinyfish.ai?query=nvidia+earnings+2026>" \\\\
-H "X-API-Key: $TINYFISH_API_KEY"

# Fetch

curl -X POST <https://api.fetch.tinyfish.ai> \\\\
-H "X-API-Key: $TINYFISH_API_KEY" \\\\
-H "Content-Type: application/json" \\\\
-d '{"url": "<https://www.theguardian.com/any-article>", "format": "markdown"}'

Grab your API Key: agent.tinyfish.ai/api-keys

Or try it out in the Playground first: agent.tinyfish.ai/playground/fetch