Build Logs

Give Agents Better Context, Not More Context

Mohammad Varmazyar·Jun 19, 2026

TL;DR

I built web-research, an open-source CLI and MCP server that helps AI agents search, fetch, clean, and return focused web content instead of sending raw HTML or noisy markdown into the model context.

In one benchmark across three real documentation pages, it reduced context from 23,851 estimated tokens to 1,608 estimated tokens — a 93.3% reduction versus raw markdown. Token counts are estimated as chars / 4, not a precise tokenizer.

“The cheapest token is the one you never send.”

The problem

A few months ago, I started seeing discussions about TinyFish making web search free for AI agents.

That caught my attention.

But what I kept thinking about was not the search cost.

It was what happens after the search.

If an agent searches the web and then sends everything it finds into the model context, you are not solving the problem. You are just moving it.

Search can be cheap.

Processing noise is not.

When an agent fetches a web page and passes the result directly into context, it pays for content it never uses. Raw HTML causes context overflows, higher inference cost, and worse answers because the model has to reason through scripts, layout markup, and navigation before it gets to the actual content.

Raw HTML often includes:

JavaScript bundles
inline styles
navigation bars
cookie consent banners
duplicated text across sections
boilerplate headers, footers, and sidebars

Cleaned markdown is better, but it can still be too large.

The three documentation pages I tested totaled 93.2 KB of markdown. That is not a large research session by agent standards, and it was already too much.

The model does not need everything.

It needs the useful parts.

Getting that distinction right is the actual problem.

What I built

web-research is an open-source CLI tool and MCP server written in Go. It handles the full research pipeline:

Search the web
Fetch the page
Clean HTML into markdown
Extract or rank useful content
Return structured JSON output with sources

The CLI supports commands like:

wr search "MCP server authentication patterns"
wr fetch "https://docs.anthropic.com/en/docs/tool-use" "tool result format"
wr research "how does query decomposition work in RAG"
wr research --mode=chunks "Next.js App Router data fetching"
wr research --mode=lossless "Stripe webhook signature verification"

The MCP server registers three tools via modelcontextprotocol/go-sdk, so Claude Code and Cursor can call them natively:

web_search
web_fetch
web_research

The output is structured JSON. A web_research response looks like this:

{
  "query": "Next.js App Router data fetching",
  "sub_queries": [
    "Next.js App Router fetch patterns",
    "server components data loading"
  ],
  "answer": "...",
  "sources": [
    {
      "url": "https://nextjs.org/docs/...",
      "title": "Data Fetching",
      "summary": "..."
    }
  ],
  "stats": {
    "searched_results": 10,
    "fetched_pages": 3,
    "cache_hits": 1
  }
}

Figure 1. Workflow diagram. Query → TinyFish Search → Fetch Pages → Clean HTML to Markdown → Retrieval Mode → Structured JSON to Agent.

Where TinyFish fit

TinyFish is the search and HTTP access layer.

web-research is the content-processing layer that runs after the HTTP response arrives.

They sit in sequence:

TinyFish handles search.
web-research cleans and processes the returned content.
The agent receives only the focused context it needs, along with source references and stats.

TinyFish gets the web results back.

web-research decides what from those results should actually enter the model context.

The problem I focused on lives in that second step.

What worked

1. Separating retrieval from summarization

Not every research task needs an LLM summary. That was the clearest lesson early on.

I added three retrieval modes, set with --mode:

lossless returns full cleaned markdown with no LLM call. Use it when the agent needs the complete source text for comparison, code extraction, or downstream processing.

wr research --mode=lossless "Stripe webhook payload structure"

chunks splits content into paragraphs or sections, scores each against the query using TF-IDF cosine similarity, and returns the top five by default in reading order. Use it when the task needs relevant sections without paying for summarization.

wr research --mode=chunks --top-k=3 "Next.js App Router caching"

summarize runs LLM compression over the fetched content and returns a compact answer with cited sources. This is the default mode and is useful when the agent needs a concise answer it can act on immediately.

A summarization step adds latency and an LLM call that is sometimes unnecessary. When you skip it with chunks or lossless, you still get clean, processed content.

2. Query decomposition and parallel fetching

A single query often misses important angles.

wr research sends the original query to a configured LLM, currently Groq or GitHub Copilot, with a prompt that asks for two to three distinct sub-queries covering different angles. The result is capped at three.

Then it:

searches all sub-queries in parallel
deduplicates URLs across results
fetches the top unique pages concurrently
processes each page by the selected retrieval mode

Running searches and fetches concurrently keeps the additional sub-queries from linearly adding latency.

Deduplication also matters because the same URL frequently appears across multiple sub-queries. Fetching it twice is pure waste.

3. Jina fallback for sparse JavaScript-rendered pages

TinyFish handles the primary search and fetch path.

When a page body comes back under 500 characters — too sparse to be useful — web-research retries the URL through Jina as a secondary fallback before processing the content.

Some pages return sparse HTML when fetched directly. This often happens with JavaScript-rendered SPAs, including React or Vue documentation sites that build their content client-side.

A 200 status does not mean useful content arrived.

When web-research fetches a page and the cleaned body text is under 500 characters, it retries the URL through r.jina.ai, which renders the page server-side and returns clean markdown. If Jina returns more content than the direct fetch did, Jina wins.

The 500-character threshold is a heuristic. A real page with less than 500 characters of useful content would still trigger the fallback, but that is rare in practice.

Of the three benchmark pages, nextjs.org/docs and next-intl.dev/docs triggered the Jina fallback during the benchmark run. docs.stripe.com/webhooks returned sufficient content directly.

4. JSON output with stats

The token_stats field in fetch responses includes the estimated raw and summary token counts and the reduction percentage, so agents can observe compression at runtime.

The agent does not just get text.

It gets:

the answer
the source URLs
generated sub-queries
fetch stats
token reduction estimates

That makes the output more useful than a plain string.

Concrete output example

Below is a representative example of what the agent receives when running the benchmark query against docs.stripe.com/webhooks in summarize mode. The structure matches the real ResearchResponse JSON schema.

Query: webhook event verification best practices

Pages fetched: 3

Mode: summarize

Cache hits: 0

{
  "query": "webhook event verification best practices",
  "sub_queries": [
    "Stripe webhook signature verification",
    "how to validate webhook payloads",
    "webhook endpoint security headers"
  ],
  "answer": "Verify webhook signatures using the raw request body and the Stripe-Signature header. Use stripe.webhooks.constructEvent() with your endpoint secret. Never use the parsed body — HMAC-SHA256 validation requires the exact raw bytes Stripe sent.",
  "sources": [
    {
      "url": "https://docs.stripe.com/webhooks",
      "title": "Use incoming webhooks to get real-time updates",
      "summary": "Stripe signs payloads using HMAC-SHA256 with the endpoint secret. Verify using the raw body and Stripe-Signature header before processing any event."
    }
  ],
  "stats": {
    "searched_results": 9,
    "fetched_pages": 3,
    "cache_hits": 0
  }
}

The agent receives a direct answer, a cited source URL, and structured metadata.

It does not receive Stripe’s navigation, cookie banner, footer, changelog, or unrelated SDK reference sections because those were not relevant to the query.

Measurable outcome

I measured how much content would be passed into a model during one research session.

The test fetched three documentation pages:

nextjs.org/docs/app/building-your-application/data-fetching
docs.stripe.com/webhooks
next-intl.dev/docs/getting-started/app-router

Token counts are estimated as chars / 4, the same method used internally by the tool. This is an approximation, not a precise tokenizer count. Real numbers will vary by tokenizer.

Source	Size	Estimated tokens
Raw HTML, all 3 pages	2.9 MB	752,671
Raw markdown, all 3 pages	93.2 KB	23,851
wr output, summarize mode	—	1,608

The result:

93.3% reduction vs. raw markdown
99.8% reduction vs. raw HTML

Most of the removed content was scripts, navigation, cookie banners, repeated footers, and page boilerplate — content that was not useful for this research task.

Results vary. These three pages are content-heavy documentation sites. A page that is mostly content to begin with will show a smaller reduction. The retrieval mode also affects output size: chunks mode returns more than summarize, and lossless returns the full cleaned page.

What I’d do differently / Lessons learned

1. Do not build summarization first

I added summarize mode before lossless and chunks.

It should have been last, after proving the extraction pipeline worked well on its own.

Summarization hides extraction problems. Noisy input produces a noisy summary, and the summary can still look fine at a glance.

2. Clean extraction matters more than summarization quality

The HTML-to-markdown conversion and content cleaning steps have more impact on output quality than the summarization prompt.

Get those right first.

3. Token reduction is only useful if source quality is preserved

A small output that drops relevant content is worse than a large output that keeps it.

Measure both compression and recall.

For this benchmark, the output preserved cited source references. A future evaluation should also measure whether expected key facts and sections are retained across retrieval modes.

4. Agents need structured output, not text blobs

The JSON response with sources, sub_queries, and stats is more useful than a plain string.

The agent can log stats, surface sources to the user, and decide how to process the answer differently.

5. Be precise about what “tokens” means in benchmarks

Saying “tokens” without specifying the estimation method makes the number unverifiable.

chars / 4 is an approximation. Say so upfront.

Recommendation for other builders

If you are building AI agents that browse the web, do not start by optimizing prompts.

First look at what you are putting into the context window.

Most raw web content is not signal. Before you tune a system prompt or switch models, check whether your agent is receiving hundreds of thousands of estimated tokens of navigation menus, cookie banners, and JavaScript.

Fixing that upstream — by running fetched content through a cleaning and extraction step before it reaches the model — can have more impact than downstream prompt tuning.

The simplest version of this is three lines of MCP config pointing at wr-mcp, or one wr research --mode=chunks call in your agent loop instead of a raw fetch.

The cheapest token is the one you never send.

Try it / Links

web-research is open source:

GitHub repo → https://github.com/mrvarmazyar/web-research

If you are working on agents, MCP tools, or research workflows, Mohammad is actively looking for feedback on the design, retrieval modes, and edge cases.

Want to build agents with better web context?

📌 Sign up free → agent.tinyfish.ai

Docs → docs.tinyfish.ai

Open source Cookbook → github.com/tinyfish-io/tinyfish-cookbook