Define your data model once. topscrape tries every selector in your chain, validates the result with Pydantic, and warns you the moment something drifts.
pip install topscrape
Built on the modern Python stack — Pydantic V2, httpx, parsel — and designed so you spend time on your data, not your selectors.
Subclass ScraperModel, annotate your fields — that's your scraper. No soup loops, no manual None checks.
List multiple CSS selectors, XPath expressions, or regex patterns per field. First match wins, automatically.
When a fallback fires, you get a SelectorDriftWarning with the exact field and selector. Catch breakage before it breaks.
Get float, not "$19.99". Types are enforced at parse time. Bad data raises a clear, typed exception.
Every constructor has an async twin. Scrape hundreds of pages concurrently with asyncio and httpx.
Pass a lambda to clean raw strings before validation. Strip symbols, convert units, normalise casing — inline.
Real examples — from a simple CSS extraction to async scraping with transforms.
from topscrape import ScraperModel, Field class Product(ScraperModel): title: str = Field(selectors=["h1.title", "h1"]) price: str = Field(selectors=[".product-price", "[data-price]"]) rating: float = Field(selectors=["[data-rating]"], attr="data-rating") badge: str = Field(selectors=[".badge"], default="N/A") # Sync — one line fetch + parse product = Product.from_url("https://example.com/item/1") print(product.title) # → "Best Laptop 2024" print(product.rating) # → 4.7 (float, not "4.7") print(product.badge) # → "N/A" (default used)
from topscrape import ScraperModel, Field class Product(ScraperModel): # transform runs BEFORE Pydantic validation price: float = Field( selectors=[".price", "[data-price]", "//span[@itemprop='price']"], transform=lambda v: v.replace("$", "").replace(",", ""), ) # XPath selector — any string starting with // sku: str = Field(selectors=["//meta[@name='sku']"], attr="content") # Regex selector — prefix with r: version: str = Field(selectors=[r"r:v(\d+\.\d+\.\d+)"], default="") product = Product.from_html(html_string) print(product.price) # → 1299.99 ← float, not "$1,299.99"
import asyncio from topscrape import ScraperModel, Field class Article(ScraperModel): title: str = Field(selectors=["h1", ".article-title"]) author: str = Field(selectors=[".byline"], default="Unknown") async def scrape_many(urls: list[str]): tasks = [Article.from_url_async(url) for url in urls] articles = await asyncio.gather(*tasks) return articles # Runs all requests concurrently articles = asyncio.run(scrape_many(urls)) for a in articles: print(a.title, "—", a.author)
from topscrape import ScraperModel, Field class ProductPage(ScraperModel): # multiple=True → returns list of all matches features: list[str] = Field( selectors=[".feature-list li", ".specs td"], multiple=True, ) images: list[str] = Field( selectors=["img.gallery-photo"], attr="src", multiple=True, ) page = ProductPage.from_html(html) print(page.features) # → ["16GB RAM", "1TB SSD", "4K Display"] print(len(page.images)) # → 5
# Extract the page title $ topscrape https://example.com "title" Example Domain # Try two selectors — fallback chain $ topscrape https://example.com ".price" "[data-price]" [DRIFT WARNING] Primary selector '.price' failed; used fallback[1] '[data-price]'. 19.99 # Extract an attribute $ topscrape https://example.com "a.hero-link" --attr href # All matches as JSON $ topscrape https://example.com "li.feature" --all --json
topscrape tracks which selector in your chain matched. The moment a fallback fires, you get a precise warning — not a silent None.
| Parameter | Type | Description |
|---|---|---|
selectors |
list[str] |
Ordered list to try. CSS by default; prefix // for XPath; prefix r: for regex. |
attr |
str | None |
HTML attribute to extract instead of text content. e.g. "href", "src". |
transform |
callable | None |
Applied to the raw string before Pydantic validation. Use to clean, strip, or convert. |
default |
Any |
Returned when all selectors fail. Default is ... (field is required). |
multiple |
bool |
Return all matches as a list instead of only the first. Default: False. |
| Method | Description |
|---|---|
.from_html(html, url="") |
Parse a raw HTML string. |
.from_url(url, **kwargs) |
Fetch + parse synchronously. Extra kwargs forwarded to httpx. |
await .from_url_async(url, **kwargs) |
Async fetch + parse. Ideal for concurrent scraping with asyncio.gather. |
.from_selector(selector, url="") |
Parse from an existing parsel.Selector — reuse one fetch for multiple models. |