TinyFish
Search
Fast, structured web search
Fetch
Any URL to clean content
Agent
Multi-step web automation
Browser
Stealth Chromium sessions
All products share one API keyView docs →
Documentation
API reference and guides
Integrations
Connect with your stack
Blog
Product updates and insights
Cookbook
Open-source examples
Pricing
Overview
Enterprise-grade web data
Use Cases
What teams are building
Customers
See who builds with TinyFish
ContactLog InLog In
Products
SearchFast, structured web search
FetchAny URL to clean content
AgentMulti-step web automation
BrowserStealth Chromium sessions
Resources
DocumentationAPI reference and guides
IntegrationsConnect with your stack
BlogProduct updates and insights
CookbookOpen-source examples
PricingPlans, credits, and billing
Enterprise
OverviewEnterprise-grade web data
Use CasesWhat teams are building
CustomersSee who builds with TinyFish
ContactLog In
TinyFish

Web APIs built for agents.

Product
  • Enterprise
  • Use Cases
  • Customers
  • Pricing
  • Integrations
  • Docs
  • Trust
Resources
  • Cookbook
  • Blog
  • Current
  • Accelerator
Connect
  • X/Twitter
  • LinkedIn
  • Discord
  • GitHub
  • Contact Us
© 2026 TinyFish·Privacy·Cookies·Terms
Company

OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

Sky Zhang·Feb 12, 2026
Share
OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

TinyFish set out to build web agents that solve real world problems for DoorDash, Google Hotels, ClassPass, and all the smaller businesses trying to keep up with the giants. That's what we do every day at production scale.

But we also wanted to test ourselves against the public benchmarks. Not because benchmarks are the goal. They rarely translate to real-world performance. The constraints aren't realistic, and it doesn't matter if your agent can play interactive chess puzzles on the web. What matters is whether it can solve your problem faster, cheaper, and at scale.

Still, Mind2Web is the most rigorous public evaluation for web agents right now, with 300 tasks across 136 live websites, three difficulty levels, and human evaluation. It's where OpenAI Operator, Claude Computer Use, and Browser Use all have published scores. So we ran it.

We ran TinyFish through the full benchmark in parallel. Here are the results alongside the current leaderboard:

EasyMediumHardTotal
TinyFish97.5%89.9%81.9%89.9%
Operator (OpenAI)83.1%58.0%43.2%61.3%
Claude Computer Use 3.790.4%49.0%32.4%56.3%
Browser Use55.4%26.6%8.1%30.0%

We're submitting to the official leaderboard. In the meantime, we published every run so you don't have to take our word for it.

All 300 tasks, including every failure → [Public] TinyFish-Mind2Web Agent Runs

The rest of this post covers what these tasks actually involve, how we failed, and why we think the system works.

What these tasks involve

Mind2Web tasks run on live websites. The actual site, with all its pop-ups, dynamic pricing, and form validation.

An easy task: "Browse Marriott Bonvoy credit cards on Marriott." Navigate, find the section, view the listings. A few clicks.

A hard task: "Book 4 tickets in the upper section for any Kevin Hart show in New York in the next three months and view ticket prices with estimated fees." That's StubHub. Search events, filter by date range and location, select a show, choose a seating section, set ticket quantity, navigate a pricing page where fees calculate in real time. Ten-plus steps where things change between page loads.

Another hard task: "Find the highest critic-scored red or white wine from Oregon, priced under $40, that pairs well with fish or dessert." Multiple filters in sequence on wineaccess.com, constraint checking against each result, paginated inventory that shifts underneath you.

The benchmark evaluates each intermediate step, not just the final answer. And this is what makes the easy-to-hard drop the most interesting number in the results:

AgentEasyHardDrop
TinyFish97.5%81.9%15.6 pts
Operator83.1%43.2%39.9 pts
Claude Computer Use 3.790.4%32.4%58.0 pts
Browser Use55.4%8.1%47.3 pts

Hard tasks compound errors. Every step is a chance to fail, and failures cascade. At 95% per-step accuracy, a 3-step task succeeds 86% of the time, but a 10-step task succeeds 60%. At 90% per-step, the 10-step task drops to 35%.

A system that drops 16 points from easy to hard handles compounding well. A system that drops 58 points was being flattered by easy tasks.

40 failures, all documented

We failed 40 out of 300 tasks. Here's every one, with the reason.

Anti-bot blocks — 12 failures. Sites that blocked execution at the infrastructure level before the agent could attempt the task.

SiteFailuresTask IDs
apartments.com8#53, #60, #74, #107, #118, #174, #186, #200
booking.com1#208
compass.com1#124
americanexpress.com1#65
kaggle.com1#197

apartments.com accounts for 8 of our 40 failures. If you've tried automating anything on that site, you already know. We ran every task through our own platform with the same proxy rotation, fingerprinting, and anti-bot tooling our customers use in production. Some sites are just that aggressive.

UI interaction limitations — 4 failures. Widget types our execution layer doesn't handle yet.

TaskWhat happened
#181, #226 — chase.comSlider widgets in retirement calculators
#27 — chess.comDrag-and-drop without individual DOM IDs
#248 — imgur.comCouldn't locate specific image in meme creator

Edge cases — 24 failures.

TaskWhat happened
#172 — traderjoes.comCouldn't set location preference
#224 — thumbtack.comFilter application issues
#285 — dblp.orgSQL-style interactive search we don't support
#3 — imdb.comLast 2 steps incomplete, AMC+ not found
#4 — weather.comClicked wrong forecast section
#7 — coursera.orgFailed to unwrap filter and select Advertising skill
#54 — us.trip.comMissed clicking attractions in nearby section
#55 — samsung.comFailed to retrieve review content
#61 — stanford.eduMissing Monday filter for class schedule
#85 — flightaware.comFilter did not return all matching flights
#89 — akc.orgState filter failed
#95 — cars.comAgent navigated to wrong page instead of loan calculator
#120 — drugs.comUsed generic search instead of drug-specific search
#163 — ohiomeansjobs.ohio.govAdvanced filters not applied
#167 — student.comPrice upper bound filter failed
#168 — healthline.comUsed wrong function for diet comparison
#192 — ohiomeansjobs.ohio.govSearch keywords too broad
#196 — statista.comLocation filter for China failed
#235 — eventbrite.comMissed final navigation step
#245 — doctor.webmd.comDistance filter repeatedly selected wrong value
#246 — ohiomeansjobs.ohio.govAdvanced filters not applied
#275 — tourradar.comDuration filter failed
#287 — bestbuy.comFailed to select open-box option
#288 — healthline.comUsed wrong tool for recipe search

Every one of these 300 tasks has a clickable link to the full execution trace. Pick a failure, or pick a pass. Watch what happened.

How the system works

The standard web agent architecture: screenshot the page, send it to a frontier model, ask what to click, repeat. This is how Operator, Claude Computer Use, and Browser Use all work.

It has a scaling problem. A round-trip to a frontier model takes 1-5 seconds per step. Large models are stochastic — same screenshot, different actions — so consistency degrades across long workflows. And the cost per session at production volume doesn't work.

We split the problem based on an observation: about 20-30% of steps in a typical web workflow need actual reasoning. Understanding what a page is asking, interpreting an unusual layout, choosing between valid paths. The rest — clicking date pickers, selecting dropdowns, submitting forms, paginating — is mechanical.

The reasoning layer uses large models for the 20-30% that's ambiguous. The execution layer uses small, task-specific models trained on web interaction patterns for the rest. These run in milliseconds, not seconds. Same input, same output. No hallucinated click targets.

The infrastructure layer handles proxy rotation, browser fingerprinting, geographic routing, and bot detection evasion. All of it was running during the benchmark, the same setup our customers use. An agent that reasons perfectly but can't get past the front door is useless, and this is the layer we're investing the most in right now.

A good example: the results we published are one-shot success rates with no retries and no manual intervention. But we did re-run some failed tasks afterward. Take Task #197 on kaggle.com ("Identify the ongoing competition that offers the highest prize and find the code that received the most votes in that competition"). In our benchmark submission, it failed on an anti-bot block. On a subsequent run, TinyFish automatically reconfigured, switching to a different proxy and passing Cloudflare on its own. You can watch the full execution trace here. That auto-reconfiguration is the differentiator: not just having anti-bot tooling, but having a system that detects blocks and adapts in real time without human input.

One API. Natural language in, structured data out.

Try it Yourself

curl -N -X POST https://agent.tinyfish.ai/v1/automation/run-sse \
  -H "X-API-Key: $TINYFISH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://agentql.com",
    "goal": "Find all AgentQL subscription plans and their prices. Return result in json format"
  }'

TinyFish Cookbook — starter templates.

All 300 execution traces — judge for yourself.

Online-Mind2Web Paper — benchmark methodology.

Get started

Start building.

No credit card. No setup. Run your first operation in under a minute.

Get 500 free creditsRead the docs
More Articles
80% of your Web Fetch returns Junk
Engineering

80% of your Web Fetch returns Junk

Matthew Sparr·May 11, 2026
Search and Fetch are now FREE for every agent, everywhere!
Company

Search and Fetch are now FREE for every agent, everywhere!

Keith Zhai·May 4, 2026
Production-Grade Web Fetching for AI Agents
Engineering

Production-Grade Web Fetching for AI Agents

Chenlu Ji·Apr 14, 2026