Give Agents Better Context, Not More Context

TL;DR
I built web-research, an open-source CLI and MCP server that helps AI agents search, fetch, clean, and return focused web content instead of sending raw HTML or noisy markdown into the model context.
In one benchmark across three real documentation pages, it reduced context from 23,851 estimated tokens to 1,608 estimated tokens — a 93.3% reduction versus raw markdown. Token counts are estimated as chars / 4, not a precise tokenizer.
“The cheapest token is the one you never send.”
The problem
A few months ago, I started seeing discussions about TinyFish making web search free for AI agents.
That caught my attention.
But what I kept thinking about was not the search cost.
It was what happens after the search.
If an agent searches the web and then sends everything it finds into the model context, you are not solving the problem. You are just moving it.
Search can be cheap.
Processing noise is not.
When an agent fetches a web page and passes the result directly into context, it pays for content it never uses. Raw HTML causes context overflows, higher inference cost, and worse answers because the model has to reason through scripts, layout markup, and navigation before it gets to the actual content.
Raw HTML often includes:
- JavaScript bundles
- inline styles
- navigation bars
- cookie consent banners
- duplicated text across sections
- boilerplate headers, footers, and sidebars
Cleaned markdown is better, but it can still be too large.
The three documentation pages I tested totaled 93.2 KB of markdown. That is not a large research session by agent standards, and it was already too much.
The model does not need everything.
It needs the useful parts.
Getting that distinction right is the actual problem.
What I built
web-research is an open-source CLI tool and MCP server written in Go. It handles the full research pipeline:
- Search the web
- Fetch the page
- Clean HTML into markdown
- Extract or rank useful content
- Return structured JSON output with sources
The CLI supports commands like:
wr search "MCP server authentication patterns"
wr fetch "https://docs.anthropic.com/en/docs/tool-use" "tool result format"
wr research "how does query decomposition work in RAG"
wr research --mode=chunks "Next.js App Router data fetching"
wr research --mode=lossless "Stripe webhook signature verification"The MCP server registers three tools via modelcontextprotocol/go-sdk, so Claude Code and Cursor can call them natively:
- web_search
- web_fetch
- web_research
The output is structured JSON. A web_research response looks like this:
{
"query": "Next.js App Router data fetching",
"sub_queries": [
"Next.js App Router fetch patterns",
"server components data loading"
],
"answer": "...",
"sources": [
{
"url": "https://nextjs.org/docs/...",
"title": "Data Fetching",
"summary": "..."
}
],
"stats": {
"searched_results": 10,
"fetched_pages": 3,
"cache_hits": 1
}
}
Where TinyFish fit
TinyFish is the search and HTTP access layer.
web-research is the content-processing layer that runs after the HTTP response arrives.
They sit in sequence:
- TinyFish handles search.
- web-research cleans and processes the returned content.
- The agent receives only the focused context it needs, along with source references and stats.
TinyFish gets the web results back.
web-research decides what from those results should actually enter the model context.
The problem I focused on lives in that second step.
What worked
1. Separating retrieval from summarization
Not every research task needs an LLM summary. That was the clearest lesson early on.
I added three retrieval modes, set with --mode:
lossless returns full cleaned markdown with no LLM call. Use it when the agent needs the complete source text for comparison, code extraction, or downstream processing.
wr research --mode=lossless "Stripe webhook payload structure"chunks splits content into paragraphs or sections, scores each against the query using TF-IDF cosine similarity, and returns the top five by default in reading order. Use it when the task needs relevant sections without paying for summarization.
wr research --mode=chunks --top-k=3 "Next.js App Router caching"summarize runs LLM compression over the fetched content and returns a compact answer with cited sources. This is the default mode and is useful when the agent needs a concise answer it can act on immediately.
A summarization step adds latency and an LLM call that is sometimes unnecessary. When you skip it with chunks or lossless, you still get clean, processed content.
2. Query decomposition and parallel fetching
A single query often misses important angles.
wr research sends the original query to a configured LLM, currently Groq or GitHub Copilot, with a prompt that asks for two to three distinct sub-queries covering different angles. The result is capped at three.
Then it:
- searches all sub-queries in parallel
- deduplicates URLs across results
- fetches the top unique pages concurrently
- processes each page by the selected retrieval mode
Running searches and fetches concurrently keeps the additional sub-queries from linearly adding latency.
Deduplication also matters because the same URL frequently appears across multiple sub-queries. Fetching it twice is pure waste.
3. Jina fallback for sparse JavaScript-rendered pages
TinyFish handles the primary search and fetch path.
When a page body comes back under 500 characters — too sparse to be useful — web-research retries the URL through Jina as a secondary fallback before processing the content.
Some pages return sparse HTML when fetched directly. This often happens with JavaScript-rendered SPAs, including React or Vue documentation sites that build their content client-side.
A 200 status does not mean useful content arrived.
When web-research fetches a page and the cleaned body text is under 500 characters, it retries the URL through r.jina.ai, which renders the page server-side and returns clean markdown. If Jina returns more content than the direct fetch did, Jina wins.
The 500-character threshold is a heuristic. A real page with less than 500 characters of useful content would still trigger the fallback, but that is rare in practice.
Of the three benchmark pages, nextjs.org/docs and next-intl.dev/docs triggered the Jina fallback during the benchmark run. docs.stripe.com/webhooks returned sufficient content directly.
4. JSON output with stats
The token_stats field in fetch responses includes the estimated raw and summary token counts and the reduction percentage, so agents can observe compression at runtime.
The agent does not just get text.
It gets:
- the answer
- the source URLs
- generated sub-queries
- fetch stats
- token reduction estimates
That makes the output more useful than a plain string.
Concrete output example
Below is a representative example of what the agent receives when running the benchmark query against docs.stripe.com/webhooks in summarize mode. The structure matches the real ResearchResponse JSON schema.
Query: webhook event verification best practices
Pages fetched: 3
Mode: summarize
Cache hits: 0
{
"query": "webhook event verification best practices",
"sub_queries": [
"Stripe webhook signature verification",
"how to validate webhook payloads",
"webhook endpoint security headers"
],
"answer": "Verify webhook signatures using the raw request body and the Stripe-Signature header. Use stripe.webhooks.constructEvent() with your endpoint secret. Never use the parsed body — HMAC-SHA256 validation requires the exact raw bytes Stripe sent.",
"sources": [
{
"url": "https://docs.stripe.com/webhooks",
"title": "Use incoming webhooks to get real-time updates",
"summary": "Stripe signs payloads using HMAC-SHA256 with the endpoint secret. Verify using the raw body and Stripe-Signature header before processing any event."
}
],
"stats": {
"searched_results": 9,
"fetched_pages": 3,
"cache_hits": 0
}
}The agent receives a direct answer, a cited source URL, and structured metadata.
It does not receive Stripe’s navigation, cookie banner, footer, changelog, or unrelated SDK reference sections because those were not relevant to the query.
Measurable outcome
I measured how much content would be passed into a model during one research session.
The test fetched three documentation pages:
- nextjs.org/docs/app/building-your-application/data-fetching
- docs.stripe.com/webhooks
- next-intl.dev/docs/getting-started/app-router
Token counts are estimated as chars / 4, the same method used internally by the tool. This is an approximation, not a precise tokenizer count. Real numbers will vary by tokenizer.
| Source | Size | Estimated tokens |
|---|---|---|
| Raw HTML, all 3 pages | 2.9 MB | 752,671 |
| Raw markdown, all 3 pages | 93.2 KB | 23,851 |
| wr output, summarize mode | — | 1,608 |
The result:
- 93.3% reduction vs. raw markdown
- 99.8% reduction vs. raw HTML
Most of the removed content was scripts, navigation, cookie banners, repeated footers, and page boilerplate — content that was not useful for this research task.
Results vary. These three pages are content-heavy documentation sites. A page that is mostly content to begin with will show a smaller reduction. The retrieval mode also affects output size: chunks mode returns more than summarize, and lossless returns the full cleaned page.
What I’d do differently / Lessons learned
1. Do not build summarization first
I added summarize mode before lossless and chunks.
It should have been last, after proving the extraction pipeline worked well on its own.
Summarization hides extraction problems. Noisy input produces a noisy summary, and the summary can still look fine at a glance.
2. Clean extraction matters more than summarization quality
The HTML-to-markdown conversion and content cleaning steps have more impact on output quality than the summarization prompt.
Get those right first.
3. Token reduction is only useful if source quality is preserved
A small output that drops relevant content is worse than a large output that keeps it.
Measure both compression and recall.
For this benchmark, the output preserved cited source references. A future evaluation should also measure whether expected key facts and sections are retained across retrieval modes.
4. Agents need structured output, not text blobs
The JSON response with sources, sub_queries, and stats is more useful than a plain string.
The agent can log stats, surface sources to the user, and decide how to process the answer differently.
5. Be precise about what “tokens” means in benchmarks
Saying “tokens” without specifying the estimation method makes the number unverifiable.
chars / 4 is an approximation. Say so upfront.
Recommendation for other builders
If you are building AI agents that browse the web, do not start by optimizing prompts.
First look at what you are putting into the context window.
Most raw web content is not signal. Before you tune a system prompt or switch models, check whether your agent is receiving hundreds of thousands of estimated tokens of navigation menus, cookie banners, and JavaScript.
Fixing that upstream — by running fetched content through a cleaning and extraction step before it reaches the model — can have more impact than downstream prompt tuning.
The simplest version of this is three lines of MCP config pointing at wr-mcp, or one wr research --mode=chunks call in your agent loop instead of a raw fetch.
The cheapest token is the one you never send.
Try it / Links
web-research is open source:
GitHub repo → https://github.com/mrvarmazyar/web-research
If you are working on agents, MCP tools, or research workflows, Mohammad is actively looking for feedback on the design, retrieval modes, and edge cases.
Want to build agents with better web context?
📌 Sign up free → agent.tinyfish.ai
Docs → docs.tinyfish.ai
Open source Cookbook → github.com/tinyfish-io/tinyfish-cookbook



