v0.1.0 · MIT · Python 3.9+

Scraping that survives
when sites change.

Define your data model once. topscrape tries every selector in your chain, validates the result with Pydantic, and warns you the moment something drifts.

PyPI Python CI MIT
pip install topscrape

Everything a maintainable scraper needs.

Built on the modern Python stack — Pydantic V2, httpx, parsel — and designed so you spend time on your data, not your selectors.

🧩
Declarative Models

Subclass ScraperModel, annotate your fields — that's your scraper. No soup loops, no manual None checks.

⛓️
Selector Chains

List multiple CSS selectors, XPath expressions, or regex patterns per field. First match wins, automatically.

📡
Drift Detection

When a fallback fires, you get a SelectorDriftWarning with the exact field and selector. Catch breakage before it breaks.

Pydantic Validation

Get float, not "$19.99". Types are enforced at parse time. Bad data raises a clear, typed exception.

Async Ready

Every constructor has an async twin. Scrape hundreds of pages concurrently with asyncio and httpx.

🔧
Transform Pipeline

Pass a lambda to clean raw strings before validation. Strip symbols, convert units, normalise casing — inline.


From zero to typed data in minutes.

Real examples — from a simple CSS extraction to async scraping with transforms.

from topscrape import ScraperModel, Field

class Product(ScraperModel):
    title:  str   = Field(selectors=["h1.title", "h1"])
    price:  str   = Field(selectors=[".product-price", "[data-price]"])
    rating: float = Field(selectors=["[data-rating]"], attr="data-rating")
    badge:  str   = Field(selectors=[".badge"], default="N/A")

# Sync — one line fetch + parse
product = Product.from_url("https://example.com/item/1")

print(product.title)   # → "Best Laptop 2024"
print(product.rating)  # → 4.7  (float, not "4.7")
print(product.badge)   # → "N/A"  (default used)
from topscrape import ScraperModel, Field

class Product(ScraperModel):
    # transform runs BEFORE Pydantic validation
    price: float = Field(
        selectors=[".price", "[data-price]", "//span[@itemprop='price']"],
        transform=lambda v: v.replace("$", "").replace(",", ""),
    )
    # XPath selector — any string starting with //
    sku: str = Field(selectors=["//meta[@name='sku']"], attr="content")
    # Regex selector — prefix with r:
    version: str = Field(selectors=[r"r:v(\d+\.\d+\.\d+)"], default="")

product = Product.from_html(html_string)
print(product.price)  # → 1299.99  ← float, not "$1,299.99"
import asyncio
from topscrape import ScraperModel, Field

class Article(ScraperModel):
    title:   str = Field(selectors=["h1", ".article-title"])
    author:  str = Field(selectors=[".byline"], default="Unknown")

async def scrape_many(urls: list[str]):
    tasks = [Article.from_url_async(url) for url in urls]
    articles = await asyncio.gather(*tasks)
    return articles

# Runs all requests concurrently
articles = asyncio.run(scrape_many(urls))
for a in articles:
    print(a.title, "—", a.author)
from topscrape import ScraperModel, Field

class ProductPage(ScraperModel):
    # multiple=True → returns list of all matches
    features: list[str] = Field(
        selectors=[".feature-list li", ".specs td"],
        multiple=True,
    )
    images: list[str] = Field(
        selectors=["img.gallery-photo"],
        attr="src",
        multiple=True,
    )

page = ProductPage.from_html(html)
print(page.features)  # → ["16GB RAM", "1TB SSD", "4K Display"]
print(len(page.images))  # → 5
# Extract the page title
$ topscrape https://example.com "title"
Example Domain

# Try two selectors — fallback chain
$ topscrape https://example.com ".price" "[data-price]"
[DRIFT WARNING] Primary selector '.price' failed; used fallback[1] '[data-price]'.
19.99

# Extract an attribute
$ topscrape https://example.com "a.hero-link" --attr href

# All matches as JSON
$ topscrape https://example.com "li.feature" --all --json

Know when a site changes, before it breaks.

topscrape tracks which selector in your chain matched. The moment a fallback fires, you get a precise warning — not a silent None.

// Selector chain evaluation — field: price
.product-price
.price
[data-price]
⚠ UserWarning: [Selector Drift] Field 'price' on <https://shop.example.com/item/1>:
  primary selector '.product-price' failed;
  used fallback[2] '[data-price]'.

Everything you need, nothing you don't.

Field( selectors, ... )
Parameter Type Description
selectors list[str] Ordered list to try. CSS by default; prefix // for XPath; prefix r: for regex.
attr str | None HTML attribute to extract instead of text content. e.g. "href", "src".
transform callable | None Applied to the raw string before Pydantic validation. Use to clean, strip, or convert.
default Any Returned when all selectors fail. Default is ... (field is required).
multiple bool Return all matches as a list instead of only the first. Default: False.
ScraperModel — constructors
Method Description
.from_html(html, url="") Parse a raw HTML string.
.from_url(url, **kwargs) Fetch + parse synchronously. Extra kwargs forwarded to httpx.
await .from_url_async(url, **kwargs) Async fetch + parse. Ideal for concurrent scraping with asyncio.gather.
.from_selector(selector, url="") Parse from an existing parsel.Selector — reuse one fetch for multiple models.