Blog
Ideas and Insights
Latest news, updates, and insights from TinyFish.

TinyFish set out to build web agents that solve real world problems for DoorDash, Google Hotels, ClassPass, and all the smaller businesses trying to keep up with the giants. That's what we do every day at production scale.
But we also wanted to test ourselves against the public benchmarks. Not because benchmarks are the goal. They rarely translate to real-world performance. The constraints aren't realistic, and it doesn't matter if your agent can play interactive chess puzzles on the web. What matters is whether it can solve your problem faster, cheaper, and at scale.
Still, Mind2Web is the most rigorous public evaluation for web agents right now, with 300 tasks across 136 live websites, three difficulty levels, and human evaluation. It's where OpenAI Operator, Claude Computer Use, and Browser Use all have published scores. So we ran it.
We ran TinyFish through the full benchmark in parallel. Here are the results alongside the current leaderboard:
| Easy | Medium | Hard | Total | |
|---|---|---|---|---|
| TinyFish | 97.5% | 89.9% | 81.9% | 89.9% |
| Operator (OpenAI) | 83.1% | 58.0% | 43.2% | 61.3% |
| Claude Computer Use 3.7 | 90.4% | 49.0% | 32.4% | 56.3% |
| Browser Use | 55.4% | 26.6% | 8.1% | 30.0% |
We're submitting to the official leaderboard. In the meantime, we published every run so you don't have to take our word for it.
All 300 tasks, including every failure → [Public] TinyFish-Mind2Web Agent Runs
The rest of this post covers what these tasks actually involve, how we failed, and why we think the system works.
Mind2Web tasks run on live websites. The actual site, with all its pop-ups, dynamic pricing, and form validation.
An easy task: "Browse Marriott Bonvoy credit cards on Marriott." Navigate, find the section, view the listings. A few clicks.
A hard task: "Book 4 tickets in the upper section for any Kevin Hart show in New York in the next three months and view ticket prices with estimated fees." That's StubHub. Search events, filter by date range and location, select a show, choose a seating section, set ticket quantity, navigate a pricing page where fees calculate in real time. Ten-plus steps where things change between page loads.
Another hard task: "Find the highest critic-scored red or white wine from Oregon, priced under $40, that pairs well with fish or dessert." Multiple filters in sequence on wineaccess.com, constraint checking against each result, paginated inventory that shifts underneath you.
The benchmark evaluates each intermediate step, not just the final answer. And this is what makes the easy-to-hard drop the most interesting number in the results:
| Agent | Easy | Hard | Drop |
|---|---|---|---|
| TinyFish | 97.5% | 81.9% | 15.6 pts |
| Operator | 83.1% | 43.2% | 39.9 pts |
| Claude Computer Use 3.7 | 90.4% | 32.4% | 58.0 pts |
| Browser Use | 55.4% | 8.1% | 47.3 pts |
Hard tasks compound errors. Every step is a chance to fail, and failures cascade. At 95% per-step accuracy, a 3-step task succeeds 86% of the time, but a 10-step task succeeds 60%. At 90% per-step, the 10-step task drops to 35%.
A system that drops 16 points from easy to hard handles compounding well. A system that drops 58 points was being flattered by easy tasks.
We failed 40 out of 300 tasks. Here's every one, with the reason.
Anti-bot blocks — 12 failures. Sites that blocked execution at the infrastructure level before the agent could attempt the task.
| Site | Failures | Task IDs |
|---|---|---|
| apartments.com | 8 | #53, #60, #74, #107, #118, #174, #186, #200 |
| booking.com | 1 | #208 |
| compass.com | 1 | #124 |
| americanexpress.com | 1 | #65 |
| kaggle.com | 1 | #197 |
apartments.com accounts for 8 of our 40 failures. If you've tried automating anything on that site, you already know. We ran every task through our own platform with the same proxy rotation, fingerprinting, and anti-bot tooling our customers use in production. Some sites are just that aggressive.
UI interaction limitations — 4 failures. Widget types our execution layer doesn't handle yet.
| Task | What happened |
|---|---|
| #181, #226 — chase.com | Slider widgets in retirement calculators |
| #27 — chess.com | Drag-and-drop without individual DOM IDs |
| #248 — imgur.com | Couldn't locate specific image in meme creator |
Edge cases — 24 failures.
| Task | What happened |
|---|---|
| #172 — traderjoes.com | Couldn't set location preference |
| #224 — thumbtack.com | Filter application issues |
| #285 — dblp.org | SQL-style interactive search we don't support |
| #3 — imdb.com | Last 2 steps incomplete, AMC+ not found |
| #4 — weather.com | Clicked wrong forecast section |
| #7 — coursera.org | Failed to unwrap filter and select Advertising skill |
| #54 — us.trip.com | Missed clicking attractions in nearby section |
| #55 — samsung.com | Failed to retrieve review content |
| #61 — stanford.edu | Missing Monday filter for class schedule |
| #85 — flightaware.com | Filter did not return all matching flights |
| #89 — akc.org | State filter failed |
| #95 — cars.com | Agent navigated to wrong page instead of loan calculator |
| #120 — drugs.com | Used generic search instead of drug-specific search |
| #163 — ohiomeansjobs.ohio.gov | Advanced filters not applied |
| #167 — student.com | Price upper bound filter failed |
| #168 — healthline.com | Used wrong function for diet comparison |
| #192 — ohiomeansjobs.ohio.gov | Search keywords too broad |
| #196 — statista.com | Location filter for China failed |
| #235 — eventbrite.com | Missed final navigation step |
| #245 — doctor.webmd.com | Distance filter repeatedly selected wrong value |
| #246 — ohiomeansjobs.ohio.gov | Advanced filters not applied |
| #275 — tourradar.com | Duration filter failed |
| #287 — bestbuy.com | Failed to select open-box option |
| #288 — healthline.com | Used wrong tool for recipe search |
Every one of these 300 tasks has a clickable link to the full execution trace. Pick a failure, or pick a pass. Watch what happened.
The standard web agent architecture: screenshot the page, send it to a frontier model, ask what to click, repeat. This is how Operator, Claude Computer Use, and Browser Use all work.
It has a scaling problem. A round-trip to a frontier model takes 1-5 seconds per step. Large models are stochastic — same screenshot, different actions — so consistency degrades across long workflows. And the cost per session at production volume doesn't work.
We split the problem based on an observation: about 20-30% of steps in a typical web workflow need actual reasoning. Understanding what a page is asking, interpreting an unusual layout, choosing between valid paths. The rest — clicking date pickers, selecting dropdowns, submitting forms, paginating — is mechanical.
The reasoning layer uses large models for the 20-30% that's ambiguous. The execution layer uses small, task-specific models trained on web interaction patterns for the rest. These run in milliseconds, not seconds. Same input, same output. No hallucinated click targets.
The infrastructure layer handles proxy rotation, browser fingerprinting, geographic routing, and bot detection evasion. All of it was running during the benchmark, the same setup our customers use. An agent that reasons perfectly but can't get past the front door is useless, and this is the layer we're investing the most in right now.
A good example: the results we published are one-shot success rates with no retries and no manual intervention. But we did re-run some failed tasks afterward. Take Task #197 on kaggle.com ("Identify the ongoing competition that offers the highest prize and find the code that received the most votes in that competition"). In our benchmark submission, it failed on an anti-bot block. On a subsequent run, TinyFish automatically reconfigured, switching to a different proxy and passing Cloudflare on its own. You can watch the full execution trace here. That auto-reconfiguration is the differentiator: not just having anti-bot tooling, but having a system that detects blocks and adapts in real time without human input.
One API. Natural language in, structured data out.
curl -N -X POST https://agent.tinyfish.ai/v1/automation/run-sse \
-H "X-API-Key: $TINYFISH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://agentql.com",
"goal": "Find all AgentQL subscription plans and their prices. Return result in json format"
}'TinyFish Cookbook — starter templates.
All 300 execution traces — judge for yourself.
Online-Mind2Web Paper — benchmark methodology.
No credit card. No setup. Run your first operation in under a minute.

Google's Gemini 3.0 Flash delivers measurably better accuracy than 2.5 Flash, and that improvement compounds at scale. We tested it on complex multi-step navigation—the results were immediate: faster execution, more precise decisions, cleaner output. For production workflows, a 5% accuracy gain translates to hundreds fewer failures per month. If consumer browser agents felt too inconsistent for production, this changes the equation.

The internet gave humans access to knowledge. Now AI agents need access to action. The web wasn't built for machines. It was built for humans with browsers and clicks. TinyFish is building the infrastructure that makes the web operable for agents: stable contracts instead of brittle DOMs, outcomes instead of search results, reliable execution at scale. Google organized the world's information. We're making it executable.

A procurement agent selects a faulty supplier. A fleet robot chooses contaminated fuel. Current liability frameworks assume human judgment in the loop. Courts will soon discover that assumption no longer holds.