TinyFish
Search
Fast, structured web search
Fetch
Any URL to clean content
Agent
Multi-step web automation
Browser
Stealth Chromium sessions
All products share one API keyView docs →
Documentation
API reference and guides
Integrations
Connect with your stack
Blog
Product updates and insights
Cookbook
Open-source examples
Pricing
Overview
Enterprise-grade web data
Use Cases
What teams are building
Customers
See who builds with TinyFish
ContactLog InLog In
Products
SearchFast, structured web search
FetchAny URL to clean content
AgentMulti-step web automation
BrowserStealth Chromium sessions
Resources
DocumentationAPI reference and guides
IntegrationsConnect with your stack
BlogProduct updates and insights
CookbookOpen-source examples
PricingPlans, credits, and billing
Enterprise
OverviewEnterprise-grade web data
Use CasesWhat teams are building
CustomersSee who builds with TinyFish
ContactLog In
TinyFish

Web APIs built for agents.

Product
  • Enterprise
  • Use Cases
  • Customers
  • Pricing
  • Integrations
  • Docs
  • Trust
Resources
  • Cookbook
  • Blog
  • Current
  • Accelerator
Connect
  • X/Twitter
  • LinkedIn
  • Discord
  • GitHub
  • Contact Us
© 2026 TinyFish·Privacy·Cookies·Terms
Engineering

Web Data Extraction: From Static Pages to AI Agents

TinyFishie·TinyFish Observer·May 5, 2026·10 min read
Share
Web Data Extraction: From Static Pages to AI Agents

You need data from a website in a usable format: JSON, CSV, a clean table. The site isn't an API. It's not static HTML. It might render content via JavaScript, require authentication, or return different results to automation than to a browser.

Which tool you reach for depends entirely on which of those is true.

This guide covers all four tiers of website complexity with working code and honest tool recommendations. Tiers 1 and 2 don't need TinyFish — the right answer for many sites is still requests and BeautifulSoup. The guide says so.

Note: data extraction should respect each site's terms of service and robots.txt. The tools and techniques here are for legitimate use cases: price monitoring, market research, internal data pipelines, and publicly available information.

What Counts as "Structured Data Extraction"?

Dumping raw HTML isn't extraction. Structured data extraction means getting machine-readable output — JSON, CSV, a clean table — where the fields map to the information you actually need: price, title, availability, review count.

The right approach depends on one question: what kind of site is this?

The answer falls into four tiers, each requiring a different tool.

Four Tiers of Website Complexity

TierSite typeWhat makes it hardRight toolCode complexity
1Has an API or RSS feedNothing — use the API`requests` + JSONMinimal
2JS-rendered, no strict requirementsContent loads after JS runsPlaywright or TinyFish FetchMedium
3Strict automation requirements at scaleInfrastructure complexity at volumeTinyFish Fetch APILow (API call)
4Authenticated or multi-step workflowSession state + conditional decisionsTinyFish Web AgentLow (goal string)

Tiers 1 and 2 don't require a managed platform. If you're on Tier 1 or 2 at low volume, use the simpler tool.

Four tiers of website complexity from static HTML pages with APIs to authenticated multi-step workflows requiring AI agents

Tier 1 — Sites with APIs or RSS Feeds

Before writing a scraper, check for an API.

Many sites that appear to require scraping have JSON endpoints. Browser devtools → Network tab → filter for application/json requests. Product listing pages often load data from a /api/products or /v1/listings endpoint your scraper can call directly — faster, cleaner, and more stable than HTML parsing.

For sites with RSS or Atom feeds (blogs, news, job boards), the feed is already structured data.

For sites where the HTML itself is static and well-structured, requests + BeautifulSoup is the right tool. It's fast, it's free, and there's nothing to manage.

import requests
from bs4 import BeautifulSoup

def extract_products(url: str) -> list[dict]:
    r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")

    products = []
    for item in soup.select(".product-card"):
        products.append({
            "name": item.select_one(".product-name").text.strip(),
            "price": item.select_one(".product-price").text.strip(),
            "url": item.select_one("a")["href"],
        })
    return products

results = extract_products("https://example-shop.com/products")

This handles most documentation sites, static product catalogs, blog feeds, and public data sources. If it works, stop here.

Where it breaks down: Any site that loads content via JavaScript after the initial HTML response. You'll get the page shell with empty containers, not the data.

Tier 2 — JavaScript-Rendered Pages

Single-page apps, infinite scroll, dynamic tables, price widgets that load after the initial render — these require a real browser. The HTML you get from requests is what the server sends before JavaScript runs; the content you need loads after.

Two options, both valid:

Option A: Playwright (local, controlled environments)

import asyncio
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

result = asyncio.run(extract_js_page("https://example-spa.com/listings"))

Playwright is the right choice for local development, CI pipelines, and controlled environments. You manage the browser process.

Option B: TinyFish Browser API (managed infrastructure, same Playwright code)

For the same JS-rendered page, connect to TinyFish's managed browser instead of launching one locally. Your selectors and extraction logic stay identical:

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page_managed(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

The automation logic is unchanged. The infrastructure — browser servers, proxy routing, reliability handling — is managed by TinyFish. Browser cold starts under 250ms (source: tinyfish.ai).

For local development and direct browser control, Playwright. For production extraction without running browser infrastructure, TinyFish Browser API.

Tier 3 — Sites with Strict Automation Requirements

Get API Key

The challenge at this tier isn't writing the code — it's owning the infrastructure that makes the code reliable at scale.

Browser servers, proxy routing, session handling, failure recovery: these aren't one-time setup tasks. They're an ongoing operational surface that grows with your target list. TinyFish handles this infrastructure layer so your Python code stays focused on extraction logic — not ops. The same API call works whether you're running 10 requests or 10,000.

TinyFish's Browser API handles the infrastructure layer. You use the same Playwright selectors; TinyFish manages browser servers, proxy routing, and reliability underneath.

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_with_managed_browser(url: str) -> Optional[str]:
    """Extract page content using TinyFish managed browser infrastructure."""
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle", timeout=30000)
            content = await page.content()
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

TinyFish achieves up to 85% anti-bot pass rate (standard commercial sites; internal testing; source: tinyfish.ai) and a browser cold start under 250ms. For sites where these infrastructure concerns are the reliability bottleneck, the Fetch API removes that layer from your codebase.

For running this type of extraction at scale across many URLs simultaneously, see How to Monitor 1,000 Websites in Parallel with the TinyFish API.

Tier 4 — Authenticated or Multi-Step Workflows

Some extraction targets involve authenticated workflows on your own accounts — they require:

  • Logging in before data is accessible
  • Navigation through multiple pages with conditional logic
  • Filling a search form and extracting results from the response
  • Handling state that changes between steps

Writing and maintaining the exact navigation sequence for these workflows in Playwright is high-maintenance code. Every change to the login flow or form structure breaks the script.

For goal-directed workflows, TinyFish's Web Agent takes a plain-language description of what you want and handles the navigation:

import requests
import os
from typing import Optional

def run_extraction_agent(url: str, goal: str) -> Optional[dict]:
    """Run a goal-directed extraction agent."""
    response = requests.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=120
    )
    if not response.ok:
        return None
    data = response.json()
    # "COMPLETED" means the run finished — not necessarily that the goal succeeded
    if data.get("status") != "COMPLETED" or data.get("result") is None:
        return None
    return data["result"]

# Describe what you want — the agent handles the navigation
result = run_extraction_agent(
    url="https://supplier-portal.com/pricing",
    goal="Navigate to the pricing page, extract all product SKUs and their current prices. Return as JSON with fields: sku, name, price, currency."
)

The agent handles authentication, conditional navigation, and session state. You handle the goal definition.

For complex workflows and authenticated access to systems your team is authorized to use, use TinyFish's credential vault (use_vault: true) rather than passing credentials in the goal string.

For more on goal-directed automation for complex extraction workflows, see From Selenium to AI Agents: A Migration Guide.

Choosing the Right Approach

Decision flowchart:

Does the site have a documented API or RSS feed?
  → Yes: Use the API directly. Don't scrape what you can query.
  → No: Continue

Does the page content load via JavaScript?
  → No: requests + BeautifulSoup
  → Yes: Continue

Do you need managed infrastructure (scale, reliability, no server management)?
  → No: Playwright (local browser)
  → Yes: TinyFish Fetch API

Does the workflow require authentication, multi-step navigation, or conditional decisions?
  → No: TinyFish Fetch API
  → Yes: TinyFish Web Agent

Start with the simplest tool that works. When that tool's limits appear — JavaScript rendering, infrastructure overhead at scale, multi-step workflows — the next tier is ready. Tiers 3 and 4 aren't edge cases; they're where production data pipelines typically land.

Get API Key

FAQ

What's the best Python library for extracting structured data from websites?

The right Python library depends on whether your target uses static HTML or JavaScript rendering. For static HTML, requests + BeautifulSoup or lxml is the fastest and simplest. For JavaScript-rendered pages, Playwright gives you full browser control. For managed infrastructure at scale, TinyFish's Fetch API removes the server maintenance. There's no single best library — the right choice is determined by the site type, volume, and whether you want to manage browser infrastructure yourself.

How do I extract JSON data that a website loads dynamically?

Check the Network tab in browser devtools first. Filter for application/json or fetch requests. Many sites that appear to require scraping actually load their data from an API endpoint you can call directly with requests. If the data isn't available via a direct API call, you'll need a real browser (Playwright or TinyFish Fetch with browser: true) to wait for the JavaScript to execute and the data to load.

What's the difference between web scraping and structured data extraction?

Web scraping typically refers to the process of downloading web pages. Structured data extraction is what you do with those pages: parsing them to get machine-readable output in a defined schema. You can scrape without extracting (just downloading HTML) and extract without scraping (calling an API). For practical purposes, the distinction matters because it determines your tool choice: if you need raw page content, a fetcher is enough; if you need specific fields in a specific format, you also need a parsing step.

How do I handle pagination when extracting data?

For static pagination (page 1, page 2, etc.), loop over page URLs and aggregate results. For infinite scroll or dynamic pagination, you need a real browser — Playwright can wait for the "Load More" button and click it programmatically, or TinyFish's Web Agent can handle pagination as part of a goal-directed workflow. For deep pagination at scale (thousands of pages), consider whether the site has an API endpoint for its paginated data — it's almost always faster than page-by-page extraction.

When does extracting structured data become a legal concern?

The legal landscape varies by jurisdiction, site terms of service, and the type of data. In most cases, extracting publicly available data that doesn't contain personal information is legally defensible (see hiQ v. LinkedIn in the US context). Key boundaries: always respect robots.txt, don't extract and republish data in ways that compete with the source site's business, and be careful with personal data (GDPR applies even to publicly visible information about EU individuals). When in doubt, check the site's terms of service and consult legal counsel for commercial use cases.

Related Reading

  • Pillar: The Best Web Scraping Tools in 2026
  • How to Scrape Dynamic Websites Without Playwright
  • How to Crawl a List of URLs at Scale
  • From Selenium to AI Agents: A Migration Guide
Get started

Start building.

No credit card. No setup. Run your first operation in under a minute.

Get 500 free creditsRead the docs
More Articles
Search and Fetch are now FREE for every agent, everywhere!
Company

Search and Fetch are now FREE for every agent, everywhere!

Keith Zhai·May 4, 2026
Production-Grade Web Fetching for AI Agents
Engineering

Production-Grade Web Fetching for AI Agents

Chenlu Ji·Apr 14, 2026
Why Stitched Web Stacks Fail in Production
Product & Integrations

Why Stitched Web Stacks Fail in Production

Keith Zhai·Apr 14, 2026