Blog

Ideas and Insights

Latest news, updates, and insights from TinyFish.

Technology

Open AI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

TinyFish Storytellers·TinyFish Team·Feb 12, 2026

TinyFish set out to build web agents that solve real world problems for DoorDash, Google Hotels, ClassPass, and all the smaller businesses trying to keep up with the giants. That's what we do every day at production scale.

But we also wanted to test ourselves against the public benchmarks. Not because benchmarks are the goal. They rarely translate to real-world performance. The constraints aren't realistic, and it doesn't matter if your agent can play interactive chess puzzles on the web. What matters is whether it can solve your problem faster, cheaper, and at scale.

Still, Mind2Web is the most rigorous public evaluation for web agents right now, with 300 tasks across 136 live websites, three difficulty levels, and human evaluation. It's where OpenAI Operator, Claude Computer Use, and Browser Use all have published scores. So we ran it.

We ran TinyFish through the full benchmark in parallel. Here are the results alongside the current leaderboard:

	Easy	Medium	Hard	Total
TinyFish	97.5%	89.9%	81.9%	89.9%
Operator (OpenAI)	83.1%	58.0%	43.2%	61.3%
Claude Computer Use 3.7	90.4%	49.0%	32.4%	56.3%
Browser Use	55.4%	26.6%	8.1%	30.0%

We're submitting to the official leaderboard. In the meantime, we published every run so you don't have to take our word for it.

All 300 tasks, including every failure → [Public] TinyFish-Mind2Web Agent Runs

The rest of this post covers what these tasks actually involve, how we failed, and why we think the system works.

What these tasks involve

Mind2Web tasks run on live websites. The actual site, with all its pop-ups, dynamic pricing, and form validation.

An easy task: "Browse Marriott Bonvoy credit cards on Marriott." Navigate, find the section, view the listings. A few clicks.

A hard task: "Book 4 tickets in the upper section for any Kevin Hart show in New York in the next three months and view ticket prices with estimated fees." That's StubHub. Search events, filter by date range and location, select a show, choose a seating section, set ticket quantity, navigate a pricing page where fees calculate in real time. Ten-plus steps where things change between page loads.

Another hard task: "Find the highest critic-scored red or white wine from Oregon, priced under $40, that pairs well with fish or dessert." Multiple filters in sequence on wineaccess.com, constraint checking against each result, paginated inventory that shifts underneath you.

The benchmark evaluates each intermediate step, not just the final answer. And this is what makes the easy-to-hard drop the most interesting number in the results:

Agent	Easy	Hard	Drop
TinyFish	97.5%	81.9%	15.6 pts
Operator	83.1%	43.2%	39.9 pts
Claude Computer Use 3.7	90.4%	32.4%	58.0 pts
Browser Use	55.4%	8.1%	47.3 pts

Hard tasks compound errors. Every step is a chance to fail, and failures cascade. At 95% per-step accuracy, a 3-step task succeeds 86% of the time, but a 10-step task succeeds 60%. At 90% per-step, the 10-step task drops to 35%.

A system that drops 16 points from easy to hard handles compounding well. A system that drops 58 points was being flattered by easy tasks.

40 failures, all documented

We failed 40 out of 300 tasks. Here's every one, with the reason.

Anti-bot blocks — 12 failures. Sites that blocked execution at the infrastructure level before the agent could attempt the task.

Site	Failures	Task IDs
apartments.com	8	#53, #60, #74, #107, #118, #174, #186, #200
booking.com	1	#208
compass.com	1	#124
americanexpress.com	1	#65
kaggle.com	1	#197

apartments.com accounts for 8 of our 40 failures. If you've tried automating anything on that site, you already know. We ran every task through our own platform with the same proxy rotation, fingerprinting, and anti-bot tooling our customers use in production. Some sites are just that aggressive.

UI interaction limitations — 4 failures. Widget types our execution layer doesn't handle yet.

Task	What happened
#181, #226 — chase.com	Slider widgets in retirement calculators
#27 — chess.com	Drag-and-drop without individual DOM IDs
#248 — imgur.com	Couldn't locate specific image in meme creator

Edge cases — 24 failures.

Task	What happened
#172 — traderjoes.com	Couldn't set location preference
#224 — thumbtack.com	Filter application issues
#285 — dblp.org	SQL-style interactive search we don't support
#3 — imdb.com	Last 2 steps incomplete, AMC+ not found
#4 — weather.com	Clicked wrong forecast section
#7 — coursera.org	Failed to unwrap filter and select Advertising skill
#54 — us.trip.com	Missed clicking attractions in nearby section
#55 — samsung.com	Failed to retrieve review content
#61 — stanford.edu	Missing Monday filter for class schedule
#85 — flightaware.com	Filter did not return all matching flights
#89 — akc.org	State filter failed
#95 — cars.com	Agent navigated to wrong page instead of loan calculator
#120 — drugs.com	Used generic search instead of drug-specific search
#163 — ohiomeansjobs.ohio.gov	Advanced filters not applied
#167 — student.com	Price upper bound filter failed
#168 — healthline.com	Used wrong function for diet comparison
#192 — ohiomeansjobs.ohio.gov	Search keywords too broad
#196 — statista.com	Location filter for China failed
#235 — eventbrite.com	Missed final navigation step
#245 — doctor.webmd.com	Distance filter repeatedly selected wrong value
#246 — ohiomeansjobs.ohio.gov	Advanced filters not applied
#275 — tourradar.com	Duration filter failed
#287 — bestbuy.com	Failed to select open-box option
#288 — healthline.com	Used wrong tool for recipe search

Every one of these 300 tasks has a clickable link to the full execution trace. Pick a failure, or pick a pass. Watch what happened.

How the system works

The standard web agent architecture: screenshot the page, send it to a frontier model, ask what to click, repeat. This is how Operator, Claude Computer Use, and Browser Use all work.

It has a scaling problem. A round-trip to a frontier model takes 1-5 seconds per step. Large models are stochastic — same screenshot, different actions — so consistency degrades across long workflows. And the cost per session at production volume doesn't work.

We split the problem based on an observation: about 20-30% of steps in a typical web workflow need actual reasoning. Understanding what a page is asking, interpreting an unusual layout, choosing between valid paths. The rest — clicking date pickers, selecting dropdowns, submitting forms, paginating — is mechanical.

The reasoning layer uses large models for the 20-30% that's ambiguous. The execution layer uses small, task-specific models trained on web interaction patterns for the rest. These run in milliseconds, not seconds. Same input, same output. No hallucinated click targets.

The infrastructure layer handles proxy rotation, browser fingerprinting, geographic routing, and bot detection evasion. All of it was running during the benchmark, the same setup our customers use. An agent that reasons perfectly but can't get past the front door is useless, and this is the layer we're investing the most in right now.

A good example: the results we published are one-shot success rates with no retries and no manual intervention. But we did re-run some failed tasks afterward. Take Task #197 on kaggle.com ("Identify the ongoing competition that offers the highest prize and find the code that received the most votes in that competition"). In our benchmark submission, it failed on an anti-bot block. On a subsequent run, TinyFish automatically reconfigured, switching to a different proxy and passing Cloudflare on its own. You can watch the full execution trace here. That auto-reconfiguration is the differentiator: not just having anti-bot tooling, but having a system that detects blocks and adapts in real time without human input.

One API. Natural language in, structured data out.

Try it Yourself

curl -N -X POST https://agent.tinyfish.ai/v1/automation/run-sse \
  -H "X-API-Key: $TINYFISH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://agentql.com",
    "goal": "Find all AgentQL subscription plans and their prices. Return result in json format"
  }'

TinyFish Cookbook — starter templates.

All 300 execution traces — judge for yourself.

Online-Mind2Web Paper — benchmark methodology.

Start building.

No credit card. No setup. Run your first operation in under a minute.

Try the PlaygroundRead API Docs

Product

Gemini 3.0 Flash + Mino API: When Reasoning Meets Real Execution

Google's Gemini 3.0 Flash delivers measurably better accuracy than 2.5 Flash, and that improvement compounds at scale. We tested it on complex multi-step navigation—the results were immediate: faster execution, more precise decisions, cleaner output. For production workflows, a 5% accuracy gain translates to hundreds fewer failures per month. If consumer browser agents felt too inconsistent for production, this changes the equation.

Sky Zhang·Dec 17, 2025

Product

The Era of Abundant Intelligence

The internet gave humans access to knowledge. Now AI agents need access to action. The web wasn't built for machines. It was built for humans with browsers and clicks. TinyFish is building the infrastructure that makes the web operable for agents: stable contracts instead of brittle DOMs, outcomes instead of search results, reliable execution at scale. Google organized the world's information. We're making it executable.

Sudheesh Nair·Dec 15, 2025

Technology

Liability When Nobody Decided Anything

A procurement agent selects a faulty supplier. A fleet robot chooses contaminated fuel. Current liability frameworks assume human judgment in the loop. Courts will soon discover that assumption no longer holds.

Sudheesh Nair·Dec 13, 2025

Technology

Open AI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

TinyFish Storytellers·TinyFish Team·Feb 12, 2026

We ran TinyFish through the full benchmark in parallel. Here are the results alongside the current leaderboard:

	Easy	Medium	Hard	Total
TinyFish	97.5%	89.9%	81.9%	89.9%
Operator (OpenAI)	83.1%	58.0%	43.2%	61.3%
Claude Computer Use 3.7	90.4%	49.0%	32.4%	56.3%
Browser Use	55.4%	26.6%	8.1%	30.0%

We're submitting to the official leaderboard. In the meantime, we published every run so you don't have to take our word for it.

All 300 tasks, including every failure → [Public] TinyFish-Mind2Web Agent Runs

The rest of this post covers what these tasks actually involve, how we failed, and why we think the system works.

What these tasks involve

Mind2Web tasks run on live websites. The actual site, with all its pop-ups, dynamic pricing, and form validation.

An easy task: "Browse Marriott Bonvoy credit cards on Marriott." Navigate, find the section, view the listings. A few clicks.

The benchmark evaluates each intermediate step, not just the final answer. And this is what makes the easy-to-hard drop the most interesting number in the results:

Agent	Easy	Hard	Drop
TinyFish	97.5%	81.9%	15.6 pts
Operator	83.1%	43.2%	39.9 pts
Claude Computer Use 3.7	90.4%	32.4%	58.0 pts
Browser Use	55.4%	8.1%	47.3 pts

A system that drops 16 points from easy to hard handles compounding well. A system that drops 58 points was being flattered by easy tasks.

40 failures, all documented

We failed 40 out of 300 tasks. Here's every one, with the reason.

Anti-bot blocks — 12 failures. Sites that blocked execution at the infrastructure level before the agent could attempt the task.

Site	Failures	Task IDs
apartments.com	8	#53, #60, #74, #107, #118, #174, #186, #200
booking.com	1	#208
compass.com	1	#124
americanexpress.com	1	#65
kaggle.com	1	#197

UI interaction limitations — 4 failures. Widget types our execution layer doesn't handle yet.

Task	What happened
#181, #226 — chase.com	Slider widgets in retirement calculators
#27 — chess.com	Drag-and-drop without individual DOM IDs
#248 — imgur.com	Couldn't locate specific image in meme creator

Edge cases — 24 failures.

Task	What happened
#172 — traderjoes.com	Couldn't set location preference
#224 — thumbtack.com	Filter application issues
#285 — dblp.org	SQL-style interactive search we don't support
#3 — imdb.com	Last 2 steps incomplete, AMC+ not found
#4 — weather.com	Clicked wrong forecast section
#7 — coursera.org	Failed to unwrap filter and select Advertising skill
#54 — us.trip.com	Missed clicking attractions in nearby section
#55 — samsung.com	Failed to retrieve review content
#61 — stanford.edu	Missing Monday filter for class schedule
#85 — flightaware.com	Filter did not return all matching flights
#89 — akc.org	State filter failed
#95 — cars.com	Agent navigated to wrong page instead of loan calculator
#120 — drugs.com	Used generic search instead of drug-specific search
#163 — ohiomeansjobs.ohio.gov	Advanced filters not applied
#167 — student.com	Price upper bound filter failed
#168 — healthline.com	Used wrong function for diet comparison
#192 — ohiomeansjobs.ohio.gov	Search keywords too broad
#196 — statista.com	Location filter for China failed
#235 — eventbrite.com	Missed final navigation step
#245 — doctor.webmd.com	Distance filter repeatedly selected wrong value
#246 — ohiomeansjobs.ohio.gov	Advanced filters not applied
#275 — tourradar.com	Duration filter failed
#287 — bestbuy.com	Failed to select open-box option
#288 — healthline.com	Used wrong tool for recipe search

Every one of these 300 tasks has a clickable link to the full execution trace. Pick a failure, or pick a pass. Watch what happened.

How the system works

The standard web agent architecture: screenshot the page, send it to a frontier model, ask what to click, repeat. This is how Operator, Claude Computer Use, and Browser Use all work.

One API. Natural language in, structured data out.

Try it Yourself

curl -N -X POST https://agent.tinyfish.ai/v1/automation/run-sse \
  -H "X-API-Key: $TINYFISH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://agentql.com",
    "goal": "Find all AgentQL subscription plans and their prices. Return result in json format"
  }'

TinyFish Cookbook — starter templates.

All 300 execution traces — judge for yourself.

Online-Mind2Web Paper — benchmark methodology.

Start building.

No credit card. No setup. Run your first operation in under a minute.

Try the PlaygroundRead API Docs

Product

Ideas and Insights

Open AI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

What these tasks involve

40 failures, all documented

How the system works

Try it Yourself

Start building.

More Articles

Gemini 3.0 Flash + Mino API: When Reasoning Meets Real Execution

The Era of Abundant Intelligence

Liability When Nobody Decided Anything

Open AI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

What these tasks involve

40 failures, all documented

How the system works

Try it Yourself

Start building.

More Articles

Gemini 3.0 Flash + Mino API: When Reasoning Meets Real Execution

The Era of Abundant Intelligence

Liability When Nobody Decided Anything