Engineering

Web Data Extraction: From Static Pages to AI Agents

TinyFishie·TinyFish Observer·May 5, 2026·10 min read

You need data from a website in a usable format: JSON, CSV, a clean table. The site isn't an API. It's not static HTML. It might render content via JavaScript, require authentication, or return different results to automation than to a browser.

Which tool you reach for depends entirely on which of those is true.

This guide covers all four tiers of website complexity with working code and honest tool recommendations. Tiers 1 and 2 don't need TinyFish — the right answer for many sites is still requests and BeautifulSoup. The guide says so.

Note: data extraction should respect each site's terms of service and robots.txt. The tools and techniques here are for legitimate use cases: price monitoring, market research, internal data pipelines, and publicly available information.

What Counts as "Structured Data Extraction"?

Dumping raw HTML isn't extraction. Structured data extraction means getting machine-readable output — JSON, CSV, a clean table — where the fields map to the information you actually need: price, title, availability, review count.

The right approach depends on one question: what kind of site is this?

The answer falls into four tiers, each requiring a different tool.

Four Tiers of Website Complexity

Tier	Site type	What makes it hard	Right tool	Code complexity
1	Has an API or RSS feed	Nothing — use the API	`requests` + JSON	Minimal
2	JS-rendered, no strict requirements	Content loads after JS runs	Playwright or TinyFish Fetch	Medium
3	Strict automation requirements at scale	Infrastructure complexity at volume	TinyFish Fetch API	Low (API call)
4	Authenticated or multi-step workflow	Session state + conditional decisions	TinyFish Web Agent	Low (goal string)

Tiers 1 and 2 don't require a managed platform. If you're on Tier 1 or 2 at low volume, use the simpler tool.

Four tiers of website complexity from static HTML pages with APIs to authenticated multi-step workflows requiring AI agents

Tier 1 — Sites with APIs or RSS Feeds

Before writing a scraper, check for an API.

Many sites that appear to require scraping have JSON endpoints. Browser devtools → Network tab → filter for application/json requests. Product listing pages often load data from a /api/products or /v1/listings endpoint your scraper can call directly — faster, cleaner, and more stable than HTML parsing.

For sites with RSS or Atom feeds (blogs, news, job boards), the feed is already structured data.

For sites where the HTML itself is static and well-structured, requests + BeautifulSoup is the right tool. It's fast, it's free, and there's nothing to manage.

import requests
from bs4 import BeautifulSoup

def extract_products(url: str) -> list[dict]:
    r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")

    products = []
    for item in soup.select(".product-card"):
        products.append({
            "name": item.select_one(".product-name").text.strip(),
            "price": item.select_one(".product-price").text.strip(),
            "url": item.select_one("a")["href"],
        })
    return products

results = extract_products("https://example-shop.com/products")

This handles most documentation sites, static product catalogs, blog feeds, and public data sources. If it works, stop here.

Where it breaks down: Any site that loads content via JavaScript after the initial HTML response. You'll get the page shell with empty containers, not the data.

Tier 2 — JavaScript-Rendered Pages

Single-page apps, infinite scroll, dynamic tables, price widgets that load after the initial render — these require a real browser. The HTML you get from requests is what the server sends before JavaScript runs; the content you need loads after.

Two options, both valid:

Option A: Playwright (local, controlled environments)

import asyncio
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

result = asyncio.run(extract_js_page("https://example-spa.com/listings"))

Playwright is the right choice for local development, CI pipelines, and controlled environments. You manage the browser process.

Option B: TinyFish Browser API (managed infrastructure, same Playwright code)

For the same JS-rendered page, connect to TinyFish's managed browser instead of launching one locally. Your selectors and extraction logic stay identical:

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page_managed(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

The automation logic is unchanged. The infrastructure — browser servers, proxy routing, reliability handling — is managed by TinyFish. Browser cold starts under 250ms (source: tinyfish.ai).

For local development and direct browser control, Playwright. For production extraction without running browser infrastructure, TinyFish Browser API.

Tier 3 — Sites with Strict Automation Requirements

The challenge at this tier isn't writing the code — it's owning the infrastructure that makes the code reliable at scale.

Browser servers, proxy routing, session handling, failure recovery: these aren't one-time setup tasks. They're an ongoing operational surface that grows with your target list. TinyFish handles this infrastructure layer so your Python code stays focused on extraction logic — not ops. The same API call works whether you're running 10 requests or 10,000.

TinyFish's Browser API handles the infrastructure layer. You use the same Playwright selectors; TinyFish manages browser servers, proxy routing, and reliability underneath.

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_with_managed_browser(url: str) -> Optional[str]:
    """Extract page content using TinyFish managed browser infrastructure."""
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle", timeout=30000)
            content = await page.content()
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

TinyFish achieves up to 85% anti-bot pass rate (standard commercial sites; internal testing; source: tinyfish.ai) and a browser cold start under 250ms. For sites where these infrastructure concerns are the reliability bottleneck, the Fetch API removes that layer from your codebase.

For running this type of extraction at scale across many URLs simultaneously, see How to Monitor 1,000 Websites in Parallel with the TinyFish API.

Tier 4 — Authenticated or Multi-Step Workflows

Some extraction targets involve authenticated workflows on your own accounts — they require:

Logging in before data is accessible
Navigation through multiple pages with conditional logic
Filling a search form and extracting results from the response
Handling state that changes between steps

Writing and maintaining the exact navigation sequence for these workflows in Playwright is high-maintenance code. Every change to the login flow or form structure breaks the script.

For goal-directed workflows, TinyFish's Web Agent takes a plain-language description of what you want and handles the navigation:

import requests
import os
from typing import Optional

def run_extraction_agent(url: str, goal: str) -> Optional[dict]:
    """Run a goal-directed extraction agent."""
    response = requests.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=120
    )
    if not response.ok:
        return None
    data = response.json()
    # "COMPLETED" means the run finished — not necessarily that the goal succeeded
    if data.get("status") != "COMPLETED" or data.get("result") is None:
        return None
    return data["result"]

# Describe what you want — the agent handles the navigation
result = run_extraction_agent(
    url="https://supplier-portal.com/pricing",
    goal="Navigate to the pricing page, extract all product SKUs and their current prices. Return as JSON with fields: sku, name, price, currency."
)

The agent handles authentication, conditional navigation, and session state. You handle the goal definition.

For complex workflows and authenticated access to systems your team is authorized to use, use TinyFish's credential vault (use_vault: true) rather than passing credentials in the goal string.

For more on goal-directed automation for complex extraction workflows, see From Selenium to AI Agents: A Migration Guide.

Choosing the Right Approach

Decision flowchart:

Does the site have a documented API or RSS feed?
  → Yes: Use the API directly. Don't scrape what you can query.
  → No: Continue

Does the page content load via JavaScript?
  → No: requests + BeautifulSoup
  → Yes: Continue

Do you need managed infrastructure (scale, reliability, no server management)?
  → No: Playwright (local browser)
  → Yes: TinyFish Fetch API

Does the workflow require authentication, multi-step navigation, or conditional decisions?
  → No: TinyFish Fetch API
  → Yes: TinyFish Web Agent

Start with the simplest tool that works. When that tool's limits appear — JavaScript rendering, infrastructure overhead at scale, multi-step workflows — the next tier is ready. Tiers 3 and 4 aren't edge cases; they're where production data pipelines typically land.

FAQ

What's the best Python library for extracting structured data from websites?

The right Python library depends on whether your target uses static HTML or JavaScript rendering. For static HTML, requests + BeautifulSoup or lxml is the fastest and simplest. For JavaScript-rendered pages, Playwright gives you full browser control. For managed infrastructure at scale, TinyFish's Fetch API removes the server maintenance. There's no single best library — the right choice is determined by the site type, volume, and whether you want to manage browser infrastructure yourself.

How do I extract JSON data that a website loads dynamically?

Check the Network tab in browser devtools first. Filter for application/json or fetch requests. Many sites that appear to require scraping actually load their data from an API endpoint you can call directly with requests. If the data isn't available via a direct API call, you'll need a real browser (Playwright or TinyFish Fetch with browser: true) to wait for the JavaScript to execute and the data to load.

What's the difference between web scraping and structured data extraction?

Web scraping typically refers to the process of downloading web pages. Structured data extraction is what you do with those pages: parsing them to get machine-readable output in a defined schema. You can scrape without extracting (just downloading HTML) and extract without scraping (calling an API). For practical purposes, the distinction matters because it determines your tool choice: if you need raw page content, a fetcher is enough; if you need specific fields in a specific format, you also need a parsing step.

How do I handle pagination when extracting data?

For static pagination (page 1, page 2, etc.), loop over page URLs and aggregate results. For infinite scroll or dynamic pagination, you need a real browser — Playwright can wait for the "Load More" button and click it programmatically, or TinyFish's Web Agent can handle pagination as part of a goal-directed workflow. For deep pagination at scale (thousands of pages), consider whether the site has an API endpoint for its paginated data — it's almost always faster than page-by-page extraction.

When does extracting structured data become a legal concern?

The legal landscape varies by jurisdiction, site terms of service, and the type of data. In most cases, extracting publicly available data that doesn't contain personal information is legally defensible (see hiQ v. LinkedIn in the US context). Key boundaries: always respect robots.txt, don't extract and republish data in ways that compete with the source site's business, and be careful with personal data (GDPR applies even to publicly visible information about EU individuals). When in doubt, check the site's terms of service and consult legal counsel for commercial use cases.

Web Data Extraction: From Static Pages to AI Agents

TinyFishie·TinyFish Observer·May 5, 2026·10 min read

Which tool you reach for depends entirely on which of those is true.

What Counts as "Structured Data Extraction"?

The right approach depends on one question: what kind of site is this?

The answer falls into four tiers, each requiring a different tool.

Four Tiers of Website Complexity

Tier	Site type	What makes it hard	Right tool	Code complexity
1	Has an API or RSS feed	Nothing — use the API	`requests` + JSON	Minimal
2	JS-rendered, no strict requirements	Content loads after JS runs	Playwright or TinyFish Fetch	Medium
3	Strict automation requirements at scale	Infrastructure complexity at volume	TinyFish Fetch API	Low (API call)
4	Authenticated or multi-step workflow	Session state + conditional decisions	TinyFish Web Agent	Low (goal string)

Tiers 1 and 2 don't require a managed platform. If you're on Tier 1 or 2 at low volume, use the simpler tool.

Tier 1 — Sites with APIs or RSS Feeds

Before writing a scraper, check for an API.

For sites with RSS or Atom feeds (blogs, news, job boards), the feed is already structured data.

For sites where the HTML itself is static and well-structured, requests + BeautifulSoup is the right tool. It's fast, it's free, and there's nothing to manage.

import requests
from bs4 import BeautifulSoup

def extract_products(url: str) -> list[dict]:
    r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")

    products = []
    for item in soup.select(".product-card"):
        products.append({
            "name": item.select_one(".product-name").text.strip(),
            "price": item.select_one(".product-price").text.strip(),
            "url": item.select_one("a")["href"],
        })
    return products

results = extract_products("https://example-shop.com/products")

This handles most documentation sites, static product catalogs, blog feeds, and public data sources. If it works, stop here.

Where it breaks down: Any site that loads content via JavaScript after the initial HTML response. You'll get the page shell with empty containers, not the data.

Tier 2 — JavaScript-Rendered Pages

Two options, both valid:

Option A: Playwright (local, controlled environments)

import asyncio
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

result = asyncio.run(extract_js_page("https://example-spa.com/listings"))

Playwright is the right choice for local development, CI pipelines, and controlled environments. You manage the browser process.

Option B: TinyFish Browser API (managed infrastructure, same Playwright code)

For the same JS-rendered page, connect to TinyFish's managed browser instead of launching one locally. Your selectors and extraction logic stay identical:

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page_managed(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

The automation logic is unchanged. The infrastructure — browser servers, proxy routing, reliability handling — is managed by TinyFish. Browser cold starts under 250ms (source: tinyfish.ai).

For local development and direct browser control, Playwright. For production extraction without running browser infrastructure, TinyFish Browser API.

Tier 3 — Sites with Strict Automation Requirements

The challenge at this tier isn't writing the code — it's owning the infrastructure that makes the code reliable at scale.

TinyFish's Browser API handles the infrastructure layer. You use the same Playwright selectors; TinyFish manages browser servers, proxy routing, and reliability underneath.

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_with_managed_browser(url: str) -> Optional[str]:
    """Extract page content using TinyFish managed browser infrastructure."""
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle", timeout=30000)
            content = await page.content()
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

For running this type of extraction at scale across many URLs simultaneously, see How to Monitor 1,000 Websites in Parallel with the TinyFish API.

Tier 4 — Authenticated or Multi-Step Workflows

Some extraction targets involve authenticated workflows on your own accounts — they require:

Logging in before data is accessible
Navigation through multiple pages with conditional logic
Filling a search form and extracting results from the response
Handling state that changes between steps

Writing and maintaining the exact navigation sequence for these workflows in Playwright is high-maintenance code. Every change to the login flow or form structure breaks the script.

For goal-directed workflows, TinyFish's Web Agent takes a plain-language description of what you want and handles the navigation:

import requests
import os
from typing import Optional

def run_extraction_agent(url: str, goal: str) -> Optional[dict]:
    """Run a goal-directed extraction agent."""
    response = requests.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=120
    )
    if not response.ok:
        return None
    data = response.json()
    # "COMPLETED" means the run finished — not necessarily that the goal succeeded
    if data.get("status") != "COMPLETED" or data.get("result") is None:
        return None
    return data["result"]

# Describe what you want — the agent handles the navigation
result = run_extraction_agent(
    url="https://supplier-portal.com/pricing",
    goal="Navigate to the pricing page, extract all product SKUs and their current prices. Return as JSON with fields: sku, name, price, currency."
)

The agent handles authentication, conditional navigation, and session state. You handle the goal definition.

For complex workflows and authenticated access to systems your team is authorized to use, use TinyFish's credential vault (use_vault: true) rather than passing credentials in the goal string.

For more on goal-directed automation for complex extraction workflows, see From Selenium to AI Agents: A Migration Guide.

Choosing the Right Approach

Decision flowchart:

Does the site have a documented API or RSS feed?
  → Yes: Use the API directly. Don't scrape what you can query.
  → No: Continue

Does the page content load via JavaScript?
  → No: requests + BeautifulSoup
  → Yes: Continue

Do you need managed infrastructure (scale, reliability, no server management)?
  → No: Playwright (local browser)
  → Yes: TinyFish Fetch API

Does the workflow require authentication, multi-step navigation, or conditional decisions?
  → No: TinyFish Fetch API
  → Yes: TinyFish Web Agent

Web Data Extraction: From Static Pages to AI Agents

What Counts as "Structured Data Extraction"?

Four Tiers of Website Complexity

Tier 1 — Sites with APIs or RSS Feeds

Tier 2 — JavaScript-Rendered Pages

Tier 3 — Sites with Strict Automation Requirements

Tier 4 — Authenticated or Multi-Step Workflows

Choosing the Right Approach

FAQ

What's the best Python library for extracting structured data from websites?

How do I extract JSON data that a website loads dynamically?

What's the difference between web scraping and structured data extraction?

How do I handle pagination when extracting data?

When does extracting structured data become a legal concern?

Related Reading

Start building.

Search and Fetch are now FREE for every agent, everywhere!

Production-Grade Web Fetching for AI Agents

Why Stitched Web Stacks Fail in Production

Web Data Extraction: From Static Pages to AI Agents

What Counts as "Structured Data Extraction"?

Four Tiers of Website Complexity

Tier 1 — Sites with APIs or RSS Feeds

Tier 2 — JavaScript-Rendered Pages

Tier 3 — Sites with Strict Automation Requirements

Tier 4 — Authenticated or Multi-Step Workflows

Choosing the Right Approach

FAQ

What's the best Python library for extracting structured data from websites?

How do I extract JSON data that a website loads dynamically?

What's the difference between web scraping and structured data extraction?

How do I handle pagination when extracting data?

When does extracting structured data become a legal concern?

Related Reading

Start building.

Search and Fetch are now FREE for every agent, everywhere!

Production-Grade Web Fetching for AI Agents

Why Stitched Web Stacks Fail in Production