
You need data from a website in a usable format: JSON, CSV, a clean table. The site isn't an API. It's not static HTML. It might render content via JavaScript, require authentication, or return different results to automation than to a browser.
Which tool you reach for depends entirely on which of those is true.
This guide covers all four tiers of website complexity with working code and honest tool recommendations. Tiers 1 and 2 don't need TinyFish — the right answer for many sites is still requests and BeautifulSoup. The guide says so.
Note: data extraction should respect each site's terms of service and robots.txt. The tools and techniques here are for legitimate use cases: price monitoring, market research, internal data pipelines, and publicly available information.
Dumping raw HTML isn't extraction. Structured data extraction means getting machine-readable output — JSON, CSV, a clean table — where the fields map to the information you actually need: price, title, availability, review count.
The right approach depends on one question: what kind of site is this?
The answer falls into four tiers, each requiring a different tool.
| Tier | Site type | What makes it hard | Right tool | Code complexity |
|---|---|---|---|---|
| 1 | Has an API or RSS feed | Nothing — use the API | `requests` + JSON | Minimal |
| 2 | JS-rendered, no strict requirements | Content loads after JS runs | Playwright or TinyFish Fetch | Medium |
| 3 | Strict automation requirements at scale | Infrastructure complexity at volume | TinyFish Fetch API | Low (API call) |
| 4 | Authenticated or multi-step workflow | Session state + conditional decisions | TinyFish Web Agent | Low (goal string) |
Tiers 1 and 2 don't require a managed platform. If you're on Tier 1 or 2 at low volume, use the simpler tool.

Before writing a scraper, check for an API.
Many sites that appear to require scraping have JSON endpoints. Browser devtools → Network tab → filter for application/json requests. Product listing pages often load data from a /api/products or /v1/listings endpoint your scraper can call directly — faster, cleaner, and more stable than HTML parsing.
For sites with RSS or Atom feeds (blogs, news, job boards), the feed is already structured data.
For sites where the HTML itself is static and well-structured, requests + BeautifulSoup is the right tool. It's fast, it's free, and there's nothing to manage.
import requests
from bs4 import BeautifulSoup
def extract_products(url: str) -> list[dict]:
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
products = []
for item in soup.select(".product-card"):
products.append({
"name": item.select_one(".product-name").text.strip(),
"price": item.select_one(".product-price").text.strip(),
"url": item.select_one("a")["href"],
})
return products
results = extract_products("https://example-shop.com/products")This handles most documentation sites, static product catalogs, blog feeds, and public data sources. If it works, stop here.
Where it breaks down: Any site that loads content via JavaScript after the initial HTML response. You'll get the page shell with empty containers, not the data.
Single-page apps, infinite scroll, dynamic tables, price widgets that load after the initial render — these require a real browser. The HTML you get from requests is what the server sends before JavaScript runs; the content you need loads after.
Two options, both valid:
Option A: Playwright (local, controlled environments)
import asyncio
from playwright.async_api import async_playwright
from typing import Optional
async def extract_js_page(url: str) -> Optional[str]:
async with async_playwright() as p:
browser = None
try:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
await page.wait_for_selector(".product-list", timeout=10000)
content = await page.inner_text(".product-list")
return content
except Exception as e:
print(f"Error loading {url}: {e}")
return None
finally:
if browser:
await browser.close()
result = asyncio.run(extract_js_page("https://example-spa.com/listings"))Playwright is the right choice for local development, CI pipelines, and controlled environments. You manage the browser process.
Option B: TinyFish Browser API (managed infrastructure, same Playwright code)
For the same JS-rendered page, connect to TinyFish's managed browser instead of launching one locally. Your selectors and extraction logic stay identical:
import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional
async def extract_js_page_managed(url: str) -> Optional[str]:
async with async_playwright() as p:
browser = None
try:
session = TinyFish().browser.sessions.create()
browser = await p.chromium.connect_over_cdp(session.cdp_url)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
await page.wait_for_selector(".product-list", timeout=10000)
content = await page.inner_text(".product-list")
return content
except Exception as e:
print(f"Error loading {url}: {e}")
return None
finally:
if browser:
await browser.close()The automation logic is unchanged. The infrastructure — browser servers, proxy routing, reliability handling — is managed by TinyFish. Browser cold starts under 250ms (source: tinyfish.ai).
For local development and direct browser control, Playwright. For production extraction without running browser infrastructure, TinyFish Browser API.
The challenge at this tier isn't writing the code — it's owning the infrastructure that makes the code reliable at scale.
Browser servers, proxy routing, session handling, failure recovery: these aren't one-time setup tasks. They're an ongoing operational surface that grows with your target list. TinyFish handles this infrastructure layer so your Python code stays focused on extraction logic — not ops. The same API call works whether you're running 10 requests or 10,000.
TinyFish's Browser API handles the infrastructure layer. You use the same Playwright selectors; TinyFish manages browser servers, proxy routing, and reliability underneath.
import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional
async def extract_with_managed_browser(url: str) -> Optional[str]:
"""Extract page content using TinyFish managed browser infrastructure."""
async with async_playwright() as p:
browser = None
try:
session = TinyFish().browser.sessions.create()
browser = await p.chromium.connect_over_cdp(session.cdp_url)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
content = await page.content()
return content
except Exception as e:
print(f"Error loading {url}: {e}")
return None
finally:
if browser:
await browser.close()TinyFish achieves up to 85% anti-bot pass rate (standard commercial sites; internal testing; source: tinyfish.ai) and a browser cold start under 250ms. For sites where these infrastructure concerns are the reliability bottleneck, the Fetch API removes that layer from your codebase.
For running this type of extraction at scale across many URLs simultaneously, see How to Monitor 1,000 Websites in Parallel with the TinyFish API.
Some extraction targets involve authenticated workflows on your own accounts — they require:
Writing and maintaining the exact navigation sequence for these workflows in Playwright is high-maintenance code. Every change to the login flow or form structure breaks the script.
For goal-directed workflows, TinyFish's Web Agent takes a plain-language description of what you want and handles the navigation:
import requests
import os
from typing import Optional
def run_extraction_agent(url: str, goal: str) -> Optional[dict]:
"""Run a goal-directed extraction agent."""
response = requests.post(
"https://agent.tinyfish.ai/v1/automation/run",
headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
json={"url": url, "goal": goal},
timeout=120
)
if not response.ok:
return None
data = response.json()
# "COMPLETED" means the run finished — not necessarily that the goal succeeded
if data.get("status") != "COMPLETED" or data.get("result") is None:
return None
return data["result"]
# Describe what you want — the agent handles the navigation
result = run_extraction_agent(
url="https://supplier-portal.com/pricing",
goal="Navigate to the pricing page, extract all product SKUs and their current prices. Return as JSON with fields: sku, name, price, currency."
)The agent handles authentication, conditional navigation, and session state. You handle the goal definition.
For complex workflows and authenticated access to systems your team is authorized to use, use TinyFish's credential vault (use_vault: true) rather than passing credentials in the goal string.
For more on goal-directed automation for complex extraction workflows, see From Selenium to AI Agents: A Migration Guide.
Decision flowchart:
Does the site have a documented API or RSS feed?
→ Yes: Use the API directly. Don't scrape what you can query.
→ No: Continue
Does the page content load via JavaScript?
→ No: requests + BeautifulSoup
→ Yes: Continue
Do you need managed infrastructure (scale, reliability, no server management)?
→ No: Playwright (local browser)
→ Yes: TinyFish Fetch API
Does the workflow require authentication, multi-step navigation, or conditional decisions?
→ No: TinyFish Fetch API
→ Yes: TinyFish Web AgentStart with the simplest tool that works. When that tool's limits appear — JavaScript rendering, infrastructure overhead at scale, multi-step workflows — the next tier is ready. Tiers 3 and 4 aren't edge cases; they're where production data pipelines typically land.
The right Python library depends on whether your target uses static HTML or JavaScript rendering. For static HTML, requests + BeautifulSoup or lxml is the fastest and simplest. For JavaScript-rendered pages, Playwright gives you full browser control. For managed infrastructure at scale, TinyFish's Fetch API removes the server maintenance. There's no single best library — the right choice is determined by the site type, volume, and whether you want to manage browser infrastructure yourself.
Check the Network tab in browser devtools first. Filter for application/json or fetch requests. Many sites that appear to require scraping actually load their data from an API endpoint you can call directly with requests. If the data isn't available via a direct API call, you'll need a real browser (Playwright or TinyFish Fetch with browser: true) to wait for the JavaScript to execute and the data to load.
Web scraping typically refers to the process of downloading web pages. Structured data extraction is what you do with those pages: parsing them to get machine-readable output in a defined schema. You can scrape without extracting (just downloading HTML) and extract without scraping (calling an API). For practical purposes, the distinction matters because it determines your tool choice: if you need raw page content, a fetcher is enough; if you need specific fields in a specific format, you also need a parsing step.
For static pagination (page 1, page 2, etc.), loop over page URLs and aggregate results. For infinite scroll or dynamic pagination, you need a real browser — Playwright can wait for the "Load More" button and click it programmatically, or TinyFish's Web Agent can handle pagination as part of a goal-directed workflow. For deep pagination at scale (thousands of pages), consider whether the site has an API endpoint for its paginated data — it's almost always faster than page-by-page extraction.
The legal landscape varies by jurisdiction, site terms of service, and the type of data. In most cases, extracting publicly available data that doesn't contain personal information is legally defensible (see hiQ v. LinkedIn in the US context). Key boundaries: always respect robots.txt, don't extract and republish data in ways that compete with the source site's business, and be careful with personal data (GDPR applies even to publicly visible information about EU individuals). When in doubt, check the site's terms of service and consult legal counsel for commercial use cases.
No credit card. No setup. Run your first operation in under a minute.