#1 on WebVoyager: TinyFish Scores 91.1% on Live Web Browser Agent Benchmark

We commissioned an independent eval lab to run the full WebVoyager benchmark across four browser agents, all tested under identical conditions with cross-family LLM grading. This post covers the results, the methodology, and what we learned from the failures.
TLDR: TinyFish achieved 91.1% accuracy on WebVoyager (641 tasks across 15 live websites), ranking #1 against BrowserUse (88.3%), Smooth (86.6%), and Notte (84.2%). WebVoyager is the standard independent benchmark for browser agents: run on real websites with live bot detection, not cached snapshots, making this the most rigorous comparison available.
What is the WebVoyager benchmark?
WebVoyager (He et al., 2024) is a benchmark that tests browser agents on 641 real tasks across 15 live websites. Unlike sandboxed or cached evaluations, WebVoyager runs against the actual web: real Amazon with real bot detection, real Google Flights with real A/B tests, real ESPN with content that changes between page loads.
That’s the point. If your agent can only navigate a frozen snapshot of a website, you haven’t built a web agent. You’ve built a demo.
The 15 sites span three difficulty tiers: static sites like ArXiv and Wolfram Alpha, dynamic sites like Amazon, Apple, and GitHub where content shifts constantly, and adversarial sites like Google Flights and Booking.com that actively fight automation with CAPTCHAs and fingerprinting.
Which AI web agent is the most accurate?
TinyFish ranked #1 of 4 agents, independently evaluated by Mersault.
| Agent | Pass Rate | Gap to #1 |
|---|---|---|
| TinyFish | 91.1% | — |
| BrowserUse | 88.3% | −2.8 pts |
| Smooth | 86.6% | −4.5 pts |
| Notte | 84.2% | −6.9 pts |
At this accuracy level, every point is hard-won. The 2.8-point lead over BrowserUse translates to roughly 18 additional tasks completed correctly out of 641.
TinyFish held state-of-the-art on 8 of 15 websites, including perfect scores on Apple (100%, 42/42) and ESPN (100%, 44/44). On dynamic sites, the closest proxy for what production workflows actually look like, TinyFish scored 95%, five points above the field average.
What WebVoyager does (and doesn’t) measure
WebVoyager measures whether an agent can complete a specific task on a live website and return the correct answer. It’s a strong test of real-world accuracy. It does not measure latency, cost efficiency, or how well the agent handles tasks outside the 15-site set.
That scope limitation matters. A benchmark result doesn’t tell you everything about an agent. But when the benchmark is run properly (on the live web, with cross-family grading, under identical conditions) it tells you something important about which agents actually work and which ones look good only under controlled conditions.
This is why methodology matters as much as scores. Some published benchmark results use the same model to execute and grade tasks, which is like grading your own exam. Others run against cached snapshots, which strips out the complexity that makes the web hard. We wanted results we could stand behind, so we set up an evaluation designed to be hard to game.
Why TinyFish ranks first: intelligence vs. infrastructure
The headline number tells you who won. The failure analysis tells you something more useful: which agent is closest to its ceiling and which one has room to run.
We broke down every failure across all four agents into two categories:
Infrastructure failures (blocks, CAPTCHAs, timeouts): The agent understood the task but got stopped by something external. These are engineering problems with engineering fixes.
Reasoning failures (wrong answer from the right page): The agent reached the correct information and got the answer wrong. This is a comprehension problem, and it’s much harder to fix incrementally.
| Agent | Infrastructure failures | Reasoning failures |
|---|---|---|
| TinyFish | 75% | 25% |
| BrowserUse | 4% (crash/block) | 88% |
| Smooth | 26% (crash/block) | 72% |
| Notte | 79% (63% timeouts) | 21% |
When TinyFish fails, it’s overwhelmingly because something blocked the agent, not because the agent didn’t understand the task. When BrowserUse and Smooth fail, it’s overwhelmingly because the agent reached the right page and extracted the wrong answer.
For anyone evaluating web agents for production: infrastructure problems have a clear engineering roadmap. Each blocked site, each timeout, each CAPTCHA maps to a concrete fix.
Reasoning gaps are a harder ceiling. You can’t patch your way to better comprehension.
Where TinyFish wins, and where it doesn’t
SOTA on 8 of 15 websites. Perfect on Apple and ESPN. 98% on ArXiv, Coursera, and GitHub. 95% on Booking. 93.3% reliability across the full task set.
But: TinyFish trails on Google Flights (74%) and Cambridge Dictionary (67%). We had the highest CAPTCHA-block and timeout counts of the four agents. On adversarial sites overall, we scored 84% against a field best of 91%.
These are real gaps. We’re not explaining them away. They’re also gaps with clear causes (anti-bot handling, session management on adversarial sites, step budgets for deep multi-step flows) and we’re already shipping fixes.
Publishing the weaknesses alongside the wins is the point. A benchmark that only shows the good numbers isn’t a benchmark. It’s marketing.
How the benchmark was run
641 tasks. Full WebVoyager set. No removals except 2 structurally untestable tasks. 101 tasks were patched where reference answers had gone stale.
One operator. All four agents ran Claude Sonnet as the underlying model (where selectable). This keeps the comparison clean: you’re measuring the agent’s infrastructure, not which LLM it happens to use.
Cross-family grading. GPT-4o graded every task. Using a different model family for grading eliminates self-preference bias. Binary pass/fail. No partial credit. No manual reclassification.
One window. All agents tested within the same May 2026 window using the Mersault Benchmark Harness v0.1.0. Identical timeout policies, identical scoring rules, with manual audit for known judge biases.
Choosing a web agent: what this means for you
If you’re evaluating AI web agents, the WebVoyager results give you one clean comparison point: four agents, identical conditions, independently graded. TinyFish leads at 91.1%, with a failure profile that skews toward fixable infrastructure problems rather than fundamental reasoning limitations.
But a benchmark is one input, not the full picture. Try TinyFish on your own workflows.
Start with our Agent Playground, and check out our docs to see how it performs on the sites and tasks that matter to you and your team.
Join our Discord to get in touch with our engineering team.
FAQ
What is the WebVoyager benchmark? WebVoyager is a browser agent benchmark consisting of 641 tasks across 15 live websites. Designed by He et al. (2024), it tests whether agents can complete real tasks on the actual web, not cached snapshots, covering sites from Amazon to Google Flights.
How was TinyFish evaluated, and by whom? TinyFish was evaluated independently by Mersault, an eval lab that tested four browser agents under identical conditions in May 2026. All agents used the same model (Claude Sonnet), were graded by a separate model (GPT-4o), and scored binary pass/fail with no partial credit.
What’s the difference between accuracy and reliability for a web agent? Accuracy measures whether the agent returns the correct answer. Reliability measures whether the agent completes the task without crashing, timing out, or hitting an unrecoverable error. TinyFish scored 91.1% accuracy and 93.3% reliability, meaning the agent rarely fails silently.
How is TinyFish different from other AI web agents? TinyFish’s failure profile is predominantly infrastructure-based (75% of failures), meaning the agent understands tasks correctly but occasionally gets blocked by CAPTCHAs or timeouts. Competing agents showed 72 to 88% reasoning failures, where the agent reached the right page but extracted the wrong answer. Infrastructure problems have clearer engineering fixes than reasoning gaps.


