topscrape — Declarative, Resilient Web Scraping

Why topscrape

Everything a maintainable scraper needs.

Built on the modern Python stack — Pydantic V2, httpx, parsel — and designed so you spend time on your data, not your selectors.

🧩

Declarative Models

Subclass ScraperModel, annotate your fields — that's your scraper. No soup loops, no manual None checks.

⛓️

Selector Chains

List multiple CSS selectors, XPath expressions, or regex patterns per field. First match wins, automatically.

📡

Drift Detection

When a fallback fires, you get a SelectorDriftWarning with the exact field and selector. Catch breakage before it breaks.

✅

Pydantic Validation

Get float, not "$19.99". Types are enforced at parse time. Bad data raises a clear, typed exception.

⚡

Async Ready

Every constructor has an async twin. Scrape hundreds of pages concurrently with asyncio and httpx.

🔧

Transform Pipeline

Pass a lambda to clean raw strings before validation. Strip symbols, convert units, normalise casing — inline.

Usage

From zero to typed data in minutes.

Real examples — from a simple CSS extraction to async scraping with transforms.

from topscrape import ScraperModel, Field

class Product(ScraperModel):
    title:  str   = Field(selectors=["h1.title", "h1"])
    price:  str   = Field(selectors=[".product-price", "[data-price]"])
    rating: float = Field(selectors=["[data-rating]"], attr="data-rating")
    badge:  str   = Field(selectors=[".badge"], default="N/A")

# Sync — one line fetch + parse
product = Product.from_url("https://example.com/item/1")

print(product.title)   # → "Best Laptop 2024"
print(product.rating)  # → 4.7  (float, not "4.7")
print(product.badge)   # → "N/A"  (default used)

from topscrape import ScraperModel, Field

class Product(ScraperModel):
    # transform runs BEFORE Pydantic validation
    price: float = Field(
        selectors=[".price", "[data-price]", "//span[@itemprop='price']"],
        transform=lambda v: v.replace("$", "").replace(",", ""),
    )
    # XPath selector — any string starting with //
    sku: str = Field(selectors=["//meta[@name='sku']"], attr="content")
    # Regex selector — prefix with r:
    version: str = Field(selectors=[r"r:v(\d+\.\d+\.\d+)"], default="")

product = Product.from_html(html_string)
print(product.price)  # → 1299.99  ← float, not "$1,299.99"

import asyncio
from topscrape import ScraperModel, Field

class Article(ScraperModel):
    title:   str = Field(selectors=["h1", ".article-title"])
    author:  str = Field(selectors=[".byline"], default="Unknown")

async def scrape_many(urls: list[str]):
    tasks = [Article.from_url_async(url) for url in urls]
    articles = await asyncio.gather(*tasks)
    return articles

# Runs all requests concurrently
articles = asyncio.run(scrape_many(urls))
for a in articles:
    print(a.title, "—", a.author)

from topscrape import ScraperModel, Field

class ProductPage(ScraperModel):
    # multiple=True → returns list of all matches
    features: list[str] = Field(
        selectors=[".feature-list li", ".specs td"],
        multiple=True,
    )
    images: list[str] = Field(
        selectors=["img.gallery-photo"],
        attr="src",
        multiple=True,
    )

page = ProductPage.from_html(html)
print(page.features)  # → ["16GB RAM", "1TB SSD", "4K Display"]
print(len(page.images))  # → 5

# Extract the page title
$ topscrape https://example.com "title"
Example Domain

# Try two selectors — fallback chain
$ topscrape https://example.com ".price" "[data-price]"
[DRIFT WARNING] Primary selector '.price' failed; used fallback[1] '[data-price]'.
19.99

# Extract an attribute
$ topscrape https://example.com "a.hero-link" --attr href

# All matches as JSON
$ topscrape https://example.com "li.feature" --all --json

API Reference

Everything you need, nothing you don't.

Field( selectors, ... )

Parameter	Type	Description
`selectors`	`list[str]`	Ordered list to try. CSS by default; prefix `//` for XPath; prefix `r:` for regex.
`attr`	`str \| None`	HTML attribute to extract instead of text content. e.g. `"href"`, `"src"`.
`transform`	`callable \| None`	Applied to the raw string before Pydantic validation. Use to clean, strip, or convert.
`default`	`Any`	Returned when all selectors fail. Default is `...` (field is required).
`multiple`	`bool`	Return all matches as a list instead of only the first. Default: `False`.

ScraperModel — constructors

Method	Description
`.from_html(html, url="")`	Parse a raw HTML string.
`.from_url(url, **kwargs)`	Fetch + parse synchronously. Extra kwargs forwarded to httpx.
`await .from_url_async(url, **kwargs)`	Async fetch + parse. Ideal for concurrent scraping with asyncio.gather.
`.from_selector(selector, url="")`	Parse from an existing `parsel.Selector` — reuse one fetch for multiple models.

Scraping that survives
when sites change.

Everything a maintainable scraper needs.

From zero to typed data in minutes.

Know when a site changes, before it breaks.

Everything you need, nothing you don't.

Scraping that surviveswhen sites change.

Everything a maintainable scraper needs.

From zero to typed data in minutes.

Know when a site changes, before it breaks.

Everything you need, nothing you don't.

Scraping that survives
when sites change.