TinyFish
Search
Fast, structured web search
Fetch
Any URL to clean content
Agent
Multi-step web automation
Browser
Stealth Chromium sessions
All products share one API keyView docs →
Documentation
API reference and guides
Integrations
Connect with your stack
Blog
Product updates and insights
Cookbook
Open-source examples
Pricing
Overview
Enterprise-grade web data
Use Cases
What teams are building
Customers
See who builds with TinyFish
ContactLog InLog In
Products
SearchFast, structured web search
FetchAny URL to clean content
AgentMulti-step web automation
BrowserStealth Chromium sessions
Resources
DocumentationAPI reference and guides
IntegrationsConnect with your stack
BlogProduct updates and insights
CookbookOpen-source examples
PricingPlans, credits, and billing
Enterprise
OverviewEnterprise-grade web data
Use CasesWhat teams are building
CustomersSee who builds with TinyFish
ContactLog In
TinyFish

Web APIs built for agents.

Product
  • Enterprise
  • Use Cases
  • Customers
  • Pricing
  • Integrations
  • Docs
  • Trust
Resources
  • Cookbook
  • Blog
  • Current
  • Accelerator
Connect
  • X/Twitter
  • LinkedIn
  • Discord
  • GitHub
  • Contact Us
© 2026 TinyFish·Privacy·Cookies·Terms
Engineering

Fetching Data from a Large URL List: The Complete Decision Guide

TinyFishie·TinyFish Observer·May 7, 2026·9 min read
Share
Fetching Data from a Large URL List: The Complete Decision Guide

You have a list of 500 URLs — competitor product pages, supplier portals, job listings, or real estate listings. You need the data from each one.

The answer to "which tool fetches this data reliably" depends on what's in that list — not on how many URLs there are.

What's in your list → which tool:

  1. All static HTML, no bot protection → requests + httpx (fastest, cheapest)
  2. JavaScript-rendered content, no bot protection → Playwright or Crawlee
  3. Mixed list with some protected sites → Playwright + proxy rotation
  4. Protected or authenticated URLs at scale → TinyFish Web Agent
  5. Massive volume (100K+) of public pages → Scrapy
Get API Key

The Tool That Fits the List

Static HTML at Volume: requests + asyncio

If your URLs are documentation pages, blog posts, static product catalogs, or any content that loads fully in the initial HTML response, Python's requests library with async execution is the fastest and cheapest option—often by a large margin.

import asyncio
import httpx
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    try:
        r = await client.get(url, timeout=15)
        return {"url": url, "status": r.status_code, "html": r.text}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def crawl_list(urls: list[str], concurrency: int = 20) -> list:
    results = []
    async with httpx.AsyncClient(follow_redirects=True) as client:
        for i in range(0, len(urls), concurrency):
            batch = urls[i:i + concurrency]
            batch_results = await asyncio.gather(*[fetch(client, url) for url in batch])
            results.extend(batch_results)
            print(f"Processed {min(i + concurrency, len(urls))}/{len(urls)}")
    return results

with open("urls.txt") as f:
    urls = [line.strip() for line in f if line.strip()]

results = asyncio.run(crawl_list(urls))

In our testing, this handles 1,000 static URLs in under a minute on a standard laptop. For 100K+ URLs, Scrapy's built-in scheduler, downloader middleware, and item pipeline make more sense—it handles deduplication, retry logic, and output formatting at Scrapy's architecture level.

Where this breaks down: Any URL that requires JavaScript execution. If the page shows a loading spinner and populates content after load, requests returns the spinner HTML, not the content.

JavaScript Content: Playwright with Batching

For lists where content loads via JavaScript—React SPAs, infinite scroll, dynamic filtering, price tables that render after an API call—you need a real browser.

import asyncio
from playwright.async_api import async_playwright

async def fetch_js(page, url: str) -> dict:
    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
        content = await page.content()
        return {"url": url, "html": content}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def crawl_js_list(urls: list[str], concurrency: int = 5) -> list:
    results = []
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            for i in range(0, len(urls), concurrency):
                batch = urls[i:i + concurrency]
                pages = [await browser.new_page() for _ in batch]
                batch_results = await asyncio.gather(*[
                    fetch_js(page, url) for page, url in zip(pages, batch)
                ])
                for page in pages:
                    await page.close()
                results.extend(batch_results)
        finally:
            if browser:
                await browser.close()
    return results

Keep concurrency low (3–8 pages) when running locally—each headless Chromium instance consumes 100–300MB. For larger lists, cloud browser infrastructure (Browserless, Browserbase) handles the browser pool so you're not resource-limited on your machine.

Where this breaks down: Sites with strict automation requirements at the network and behavioral level. JavaScript-level automation handling helps at low volume; at scale, sites with enterprise-grade detection infrastructure become less reliable.

Decision diagram for choosing the right tool based on URL list content type

Sites with Strict Requirements or Authenticated Access: TinyFish

This is where simple HTTP requests stop being sufficient. Your list includes:

  • Product pages that return different content to automation than to browsers
  • Pricing pages that require login using your own authorized account
  • Sites with strict automation requirements that affect reliability at scale
  • Authenticated portals where each URL requires an authorized session

For these, maintaining a Playwright-based crawler means:

  • Managing automation configuration that needs ongoing updates as site requirements evolve
  • Building session management for authenticated URLs
  • Handling multi-step login flows and session state
  • Debugging failures that change based on detection logic you don't control

AI web agents handle this at the infrastructure level. You pass a URL and a goal; the agent handles rendering, infrastructure-level request handling, and authentication for sites where you have authorized access.

import asyncio
import aiohttp
import os

async def crawl_url(session, url: str, goal: str) -> dict:
    async with session.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=aiohttp.ClientTimeout(total=120)
    ) as resp:
        if resp.status != 200:
            return {"url": url, "result": None, "status": "HTTP_ERROR",
                    "error": await resp.text()}
        data = await resp.json()
        # "COMPLETED" means the run finished — not that the goal succeeded.
        # Check for TASK_FAILED / SITE_BLOCKED / TIMEOUT before using result.
        status = data.get("status")
        result = data.get("result")
        if status != "COMPLETED" or result is None:
            return {"url": url, "result": None, "status": status,
                    "error": data.get("error")}
        return {"url": url, "result": result, "status": status}

async def crawl_protected_list(urls: list[str], goal: str, concurrency: int = 10) -> list:
    results = []
    async with aiohttp.ClientSession() as session:
        for i in range(0, len(urls), concurrency):
            batch = urls[i:i + concurrency]
            batch_results = await asyncio.gather(*[
                crawl_url(session, url, goal) for url in batch
            ])
            results.extend(batch_results)
            print(f"Processed {min(i+concurrency, len(urls))}/{len(urls)}")
    return results

urls = ["https://protected-site.com/product/1", "https://protected-site.com/product/2"]
goal = "Extract the product name, current price, and availability status. Return as JSON."
results = asyncio.run(crawl_protected_list(urls, goal))

The concurrency limit is determined by your plan—10 concurrent agents on Starter, 50 on Pro. For a 1,000-URL list on Pro, that's 20 sequential batches of 50.

When the math shifts: requests and Playwright are cheaper per-URL on cooperative, stable sites. TinyFish makes sense when you factor in what Playwright-at-scale actually costs: server infrastructure, proxy subscriptions, and the engineering hours spent maintaining scrapers as sites change. For mixed or complex URL lists, that total cost typically exceeds TinyFish's per-step pricing before you hit production scale.

Handling the Mixed List

Real URL lists are rarely uniform. A supplier monitoring list might include:

  • 60% static pricing pages (requests would work)
  • 30% JavaScript-rendered product tables (Playwright needed)
  • 10% authenticated portals with bot protection (agents needed)

The practical approach: categorize your list before you crawl it. A quick HEAD request or a sample run reveals which URLs respond to simple HTTP requests vs. which require rendering vs. which block automation. Route each category to the appropriate tool. The 10% that requires agents is where reliability actually matters — authentication failures and automation blocks are what stall production workflows, not the cooperative pages.

To classify URLs before routing them, a quick probe is faster than a full crawl:

import httpx
import random

def classify_url(url: str, timeout: int = 10) -> str:
    """Returns 'static', 'js', or 'blocked' based on a quick probe."""
    try:
        r = httpx.get(url, timeout=timeout, follow_redirects=True,
                      headers={"User-Agent": "Mozilla/5.0"})
        if r.status_code in (401, 403, 429):
            return "blocked"
        html = r.text
        js_signals = [
            len(html) < 500,                          # near-empty response
            '<div id="root">' in html,              # React
            '<div id="app">' in html,               # Vue
            "ng-version=" in html,                    # Angular
            "window.__NUXT__" in html,                # Nuxt
            html.count("<p") < 2 and len(html) < 2000, # minimal real content
        ]
        return "js" if any(js_signals) else "static"
    except Exception:
        return "blocked"

# Sample 10% before committing to the full crawl
sample = random.sample(urls, min(50, len(urls)))
categories: dict[str, list] = {"static": [], "js": [], "blocked": []}
for url in sample:
    categories[classify_url(url)].append(url)

print(f"Static: {len(categories['static'])}, JS: {len(categories['js'])}, Blocked: {len(categories['blocked'])}")
# Route the full list to the matching tool based on these proportions

A 429 response means rate-limited — retry with backoff before escalating. A 403 indicates access is blocked or restricted; retrying with the same tool won't help. A near-empty response or JS framework marker means JS rendering is needed. Clean HTML with visible <p> tags is static.

Scale Considerations

List sizeToolRough time (10 concurrent)
100–1,000 staticrequests/httpx1–5 min
100–1,000 JSPlaywright5–20 min
100–1,000 protectedTinyFish agents10–30 min
10,000+ staticScrapyHours, distributed
10,000+ JS or protectedInfrastructure + agentsPlan accordingly

For very large lists (100K+), distributed architecture matters regardless of tool—whether that's Scrapy's built-in scheduler, a task queue like Celery, or submitting batches to an async agent API and polling for results.

Test TinyFish against the protected or authenticated URLs in your list — 500 free steps, no credit card.

Get API Key

FAQ

What's the fastest way to fetch data from a large URL list in Python?

For static HTML content, httpx with asyncio is the fastest approach—you can process 20–50 URLs simultaneously with a single machine and finish 1,000 URLs in under a minute. The key is async execution: sequential requests would take 10–15x longer for the same list. For JavaScript-rendered content, Playwright in async mode with 5–10 concurrent browser pages is the practical ceiling before memory constraints become a factor on standard hardware.

How do I improve reliability when fetching data from many URLs?

Rate limiting is the first line: 1–2 requests per second per domain for most sites, slower for aggressive protection. Rotate user agents across requests. For moderate protection, requests with a realistic user agent and reasonable delays works. For sites with enterprise-grade automation detection, JavaScript-level automation plugins help at low volume but degrade at scale — TinyFish provides infrastructure-level browser sessions that are more reliable for protected sites at production scale.

Should I use Scrapy or Playwright for a large URL list?

Scrapy if your URLs return static HTML and you need high volume (10K+) with built-in scheduling, retry logic, and output pipelines. Playwright if URLs require JavaScript execution. The two aren't mutually exclusive—Scrapy has a Playwright middleware (scrapy-playwright) that handles JS rendering within Scrapy's architecture. For lists with mixed content types, start with Scrapy for the static subset and use a separate Playwright job for the JS-heavy URLs.

How do I deduplicate URLs before crawling?

Normalize URLs first: lowercase the scheme and domain, sort query parameters alphabetically, strip tracking parameters (utm_*, ref=, fbclid=), and resolve relative URLs to absolute. Python's urllib.parse.urlparse plus a set for deduplication handles most cases. For large lists with near-duplicate URLs (same page, different session IDs), a URL fingerprinting library like w3lib.url.canonicalize_url gives more aggressive deduplication.

When does crawling a URL list require authentication?

When the target pages are behind login walls that your team has authorized access to—supplier pricing portals, internal tools, subscription content, or any page that redirects to a login page for unauthenticated requests. Signs your list needs auth: all results return the same HTML (the login page), response sizes are suspiciously uniform, or you see redirect chains ending at /login. For authenticated list crawling at scale, session management becomes the primary complexity—handling login flows, session expiry, and re-authentication across many concurrent workers. TinyFish handles session management and multi-step login flows for sites where you have authorized account access — you provide credentials, the agent handles the rest.

Related Reading

  • Pillar: The Best Web Scraping Tools in 2026
  • How to Monitor 1,000 Websites in Parallel with the TinyFish API
  • Scraping Dynamic Websites: When Playwright Is the Right Tool
Get started

Start building.

No credit card. No setup. Run your first operation in under a minute.

Get 500 free creditsRead the docs
More Articles
Search and Fetch are now FREE for every agent, everywhere!
Company

Search and Fetch are now FREE for every agent, everywhere!

Keith Zhai·May 4, 2026
Production-Grade Web Fetching for AI Agents
Engineering

Production-Grade Web Fetching for AI Agents

Chenlu Ji·Apr 14, 2026
Why Stitched Web Stacks Fail in Production
Product & Integrations

Why Stitched Web Stacks Fail in Production

Keith Zhai·Apr 14, 2026