
You have a list of 500 URLs — competitor product pages, supplier portals, job listings, or real estate listings. You need the data from each one.
The answer to "which tool fetches this data reliably" depends on what's in that list — not on how many URLs there are.
What's in your list → which tool:
If your URLs are documentation pages, blog posts, static product catalogs, or any content that loads fully in the initial HTML response, Python's requests library with async execution is the fastest and cheapest option—often by a large margin.
import asyncio
import httpx
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
try:
r = await client.get(url, timeout=15)
return {"url": url, "status": r.status_code, "html": r.text}
except Exception as e:
return {"url": url, "error": str(e)}
async def crawl_list(urls: list[str], concurrency: int = 20) -> list:
results = []
async with httpx.AsyncClient(follow_redirects=True) as client:
for i in range(0, len(urls), concurrency):
batch = urls[i:i + concurrency]
batch_results = await asyncio.gather(*[fetch(client, url) for url in batch])
results.extend(batch_results)
print(f"Processed {min(i + concurrency, len(urls))}/{len(urls)}")
return results
with open("urls.txt") as f:
urls = [line.strip() for line in f if line.strip()]
results = asyncio.run(crawl_list(urls))In our testing, this handles 1,000 static URLs in under a minute on a standard laptop. For 100K+ URLs, Scrapy's built-in scheduler, downloader middleware, and item pipeline make more sense—it handles deduplication, retry logic, and output formatting at Scrapy's architecture level.
Where this breaks down: Any URL that requires JavaScript execution. If the page shows a loading spinner and populates content after load, requests returns the spinner HTML, not the content.
For lists where content loads via JavaScript—React SPAs, infinite scroll, dynamic filtering, price tables that render after an API call—you need a real browser.
import asyncio
from playwright.async_api import async_playwright
async def fetch_js(page, url: str) -> dict:
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
content = await page.content()
return {"url": url, "html": content}
except Exception as e:
return {"url": url, "error": str(e)}
async def crawl_js_list(urls: list[str], concurrency: int = 5) -> list:
results = []
async with async_playwright() as p:
browser = None
try:
browser = await p.chromium.launch(headless=True)
for i in range(0, len(urls), concurrency):
batch = urls[i:i + concurrency]
pages = [await browser.new_page() for _ in batch]
batch_results = await asyncio.gather(*[
fetch_js(page, url) for page, url in zip(pages, batch)
])
for page in pages:
await page.close()
results.extend(batch_results)
finally:
if browser:
await browser.close()
return resultsKeep concurrency low (3–8 pages) when running locally—each headless Chromium instance consumes 100–300MB. For larger lists, cloud browser infrastructure (Browserless, Browserbase) handles the browser pool so you're not resource-limited on your machine.
Where this breaks down: Sites with strict automation requirements at the network and behavioral level. JavaScript-level automation handling helps at low volume; at scale, sites with enterprise-grade detection infrastructure become less reliable.

This is where simple HTTP requests stop being sufficient. Your list includes:
For these, maintaining a Playwright-based crawler means:
AI web agents handle this at the infrastructure level. You pass a URL and a goal; the agent handles rendering, infrastructure-level request handling, and authentication for sites where you have authorized access.
import asyncio
import aiohttp
import os
async def crawl_url(session, url: str, goal: str) -> dict:
async with session.post(
"https://agent.tinyfish.ai/v1/automation/run",
headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
json={"url": url, "goal": goal},
timeout=aiohttp.ClientTimeout(total=120)
) as resp:
if resp.status != 200:
return {"url": url, "result": None, "status": "HTTP_ERROR",
"error": await resp.text()}
data = await resp.json()
# "COMPLETED" means the run finished — not that the goal succeeded.
# Check for TASK_FAILED / SITE_BLOCKED / TIMEOUT before using result.
status = data.get("status")
result = data.get("result")
if status != "COMPLETED" or result is None:
return {"url": url, "result": None, "status": status,
"error": data.get("error")}
return {"url": url, "result": result, "status": status}
async def crawl_protected_list(urls: list[str], goal: str, concurrency: int = 10) -> list:
results = []
async with aiohttp.ClientSession() as session:
for i in range(0, len(urls), concurrency):
batch = urls[i:i + concurrency]
batch_results = await asyncio.gather(*[
crawl_url(session, url, goal) for url in batch
])
results.extend(batch_results)
print(f"Processed {min(i+concurrency, len(urls))}/{len(urls)}")
return results
urls = ["https://protected-site.com/product/1", "https://protected-site.com/product/2"]
goal = "Extract the product name, current price, and availability status. Return as JSON."
results = asyncio.run(crawl_protected_list(urls, goal))The concurrency limit is determined by your plan—10 concurrent agents on Starter, 50 on Pro. For a 1,000-URL list on Pro, that's 20 sequential batches of 50.
When the math shifts: requests and Playwright are cheaper per-URL on cooperative, stable sites. TinyFish makes sense when you factor in what Playwright-at-scale actually costs: server infrastructure, proxy subscriptions, and the engineering hours spent maintaining scrapers as sites change. For mixed or complex URL lists, that total cost typically exceeds TinyFish's per-step pricing before you hit production scale.
Real URL lists are rarely uniform. A supplier monitoring list might include:
The practical approach: categorize your list before you crawl it. A quick HEAD request or a sample run reveals which URLs respond to simple HTTP requests vs. which require rendering vs. which block automation. Route each category to the appropriate tool. The 10% that requires agents is where reliability actually matters — authentication failures and automation blocks are what stall production workflows, not the cooperative pages.
To classify URLs before routing them, a quick probe is faster than a full crawl:
import httpx
import random
def classify_url(url: str, timeout: int = 10) -> str:
"""Returns 'static', 'js', or 'blocked' based on a quick probe."""
try:
r = httpx.get(url, timeout=timeout, follow_redirects=True,
headers={"User-Agent": "Mozilla/5.0"})
if r.status_code in (401, 403, 429):
return "blocked"
html = r.text
js_signals = [
len(html) < 500, # near-empty response
'<div id="root">' in html, # React
'<div id="app">' in html, # Vue
"ng-version=" in html, # Angular
"window.__NUXT__" in html, # Nuxt
html.count("<p") < 2 and len(html) < 2000, # minimal real content
]
return "js" if any(js_signals) else "static"
except Exception:
return "blocked"
# Sample 10% before committing to the full crawl
sample = random.sample(urls, min(50, len(urls)))
categories: dict[str, list] = {"static": [], "js": [], "blocked": []}
for url in sample:
categories[classify_url(url)].append(url)
print(f"Static: {len(categories['static'])}, JS: {len(categories['js'])}, Blocked: {len(categories['blocked'])}")
# Route the full list to the matching tool based on these proportionsA 429 response means rate-limited — retry with backoff before escalating. A 403 indicates access is blocked or restricted; retrying with the same tool won't help. A near-empty response or JS framework marker means JS rendering is needed. Clean HTML with visible <p> tags is static.
| List size | Tool | Rough time (10 concurrent) |
|---|---|---|
| 100–1,000 static | requests/httpx | 1–5 min |
| 100–1,000 JS | Playwright | 5–20 min |
| 100–1,000 protected | TinyFish agents | 10–30 min |
| 10,000+ static | Scrapy | Hours, distributed |
| 10,000+ JS or protected | Infrastructure + agents | Plan accordingly |
For very large lists (100K+), distributed architecture matters regardless of tool—whether that's Scrapy's built-in scheduler, a task queue like Celery, or submitting batches to an async agent API and polling for results.
Test TinyFish against the protected or authenticated URLs in your list — 500 free steps, no credit card.
For static HTML content, httpx with asyncio is the fastest approach—you can process 20–50 URLs simultaneously with a single machine and finish 1,000 URLs in under a minute. The key is async execution: sequential requests would take 10–15x longer for the same list. For JavaScript-rendered content, Playwright in async mode with 5–10 concurrent browser pages is the practical ceiling before memory constraints become a factor on standard hardware.
Rate limiting is the first line: 1–2 requests per second per domain for most sites, slower for aggressive protection. Rotate user agents across requests. For moderate protection, requests with a realistic user agent and reasonable delays works. For sites with enterprise-grade automation detection, JavaScript-level automation plugins help at low volume but degrade at scale — TinyFish provides infrastructure-level browser sessions that are more reliable for protected sites at production scale.
Scrapy if your URLs return static HTML and you need high volume (10K+) with built-in scheduling, retry logic, and output pipelines. Playwright if URLs require JavaScript execution. The two aren't mutually exclusive—Scrapy has a Playwright middleware (scrapy-playwright) that handles JS rendering within Scrapy's architecture. For lists with mixed content types, start with Scrapy for the static subset and use a separate Playwright job for the JS-heavy URLs.
Normalize URLs first: lowercase the scheme and domain, sort query parameters alphabetically, strip tracking parameters (utm_*, ref=, fbclid=), and resolve relative URLs to absolute. Python's urllib.parse.urlparse plus a set for deduplication handles most cases. For large lists with near-duplicate URLs (same page, different session IDs), a URL fingerprinting library like w3lib.url.canonicalize_url gives more aggressive deduplication.
When the target pages are behind login walls that your team has authorized access to—supplier pricing portals, internal tools, subscription content, or any page that redirects to a login page for unauthenticated requests. Signs your list needs auth: all results return the same HTML (the login page), response sizes are suspiciously uniform, or you see redirect chains ending at /login. For authenticated list crawling at scale, session management becomes the primary complexity—handling login flows, session expiry, and re-authentication across many concurrent workers. TinyFish handles session management and multi-step login flows for sites where you have authorized account access — you provide credentials, the agent handles the rest.
No credit card. No setup. Run your first operation in under a minute.