TF-IDF counts words. semantic-keywords
understands meaning. Extracts keywords from text or files using sentence
embeddings and Maximal Marginal Relevance — fully offline, no API key required.
No Python setup required — run semantic-keywords
directly in a container.
── pull and run ───────────────────────────────────────────
$ docker pull ronaldgosso/semantic-keywords
── inline text ────────────────────────────────────────────
$ docker run --rm ronaldgosso/semantic-keywords "Tanzania fintech mobile money"
mobile money
fintech startups
east africa
── extract from a file ────────────────────────────────────
$ docker run --rm -v ./documents:/data ronaldgosso/semantic-keywords --file /data/report.pdf
── interactive mode ───────────────────────────────────────
$ docker run --rm -it ronaldgosso/semantic-keywords
── docker compose (persistent model cache) ────────────────
$ mkdir -p data && cp report.pdf data/
$ docker compose run --rm semkw --file /data/report.pdf --scores
No Python, no venv — just pull the image and run.
Model downloads once and stays cached across runs.
Built for linux/amd64 and linux/arm64 automatically.
Full guide → README_DOCKER.md
Not word frequency. Not document position. Actual meaning — via 384-dimensional sentence embeddings.
Knows that "mobile money" and "fintech" are related, even if one never appears in the document.
Maximal Marginal Relevance returns 5 different keywords — not 5 paraphrases of the same idea.
Pass a file path directly — extract keywords from PDFs, plain text, or markdown without pre-reading.
Download the model once (90 MB). Runs forever with no internet, no API key, no rate limits.
Pick fast, balanced, or accurate — the package detects what you have
downloaded.
Run semkw for a guided prompt — type text,
drop a file path, or pipe from stdin.
Two functions: extract() for text, extract_file() for files. Both return the same
ranked dict list.
Encrypted PDF? Scanned image? Missing model? Every failure surfaces a clear, actionable message.
Pass any supported file directly — no pre-reading, no manual text extraction.
from semantic_keywords import extract_file
# One-call file extraction
result = extract_file("annual_report.pdf", top_n=10)
print(result["file"]) # "annual_report.pdf"
print(result["size_kb"]) # 284.1
print(result["words"]) # 6203
for kw in result["keywords"]:
print(kw["score"], kw["keyword"])
# Two-step: read file then extract separately
from semantic_keywords import read_file, extract
text = read_file("notes.txt")
results = extract(text, top_n=5)
The extract_file()
return value:
pypdf —
install with pip install "semantic-keywords[files]".
Image-only / scanned PDFs contain no extractable text and must be run through OCR first.
Password-protected PDFs must be decrypted before use.
from semantic_keywords import extract
# From text — basic
results = extract("Tanzania is a hub for mobile money and fintech startups.")
for r in results:
print(r["score"], r["keyword"])
# From text — full control
results = extract(
text = "your paragraph here",
top_n = 10,
min_score = 0.25,
diversity = 0.7,
model = "balanced",
)
# From file — same parameters
result = extract_file(
file_path = "report.pdf",
top_n = 10,
min_score = 0.25,
diversity = 0.7,
model = "accurate",
)
All parameters — applies to both extract() and extract_file():
Interactive guided session, inline text, or file path — all through semkw.
── inline text ────────────────────────────────────────────
$ semkw "Tanzania fintech mobile money startups" -n 5 --scores
Top 5 keywords:
# Keyword Score Relevance
── ───────────────────── ─────── ──────────────────────
1 mobile money 0.5134
2 fintech startups 0.4901
3 east africa 0.4710
4 financial access 0.4502
5 agricultural tools 0.4388
── file extraction ────────────────────────────────────────
$ semkw --file annual_report.pdf -n 10 --scores
Reading 'annual_report.pdf' (284.1 KB) ...
Words : 6203
Model : fast
Top 10 keywords:
1 revenue growth 0.5341
2 market expansion 0.5102
3 operating costs 0.4880
4 digital strategy 0.4601
5 customer retention 0.4419
── other modes ────────────────────────────────────────────
$ semkw # interactive guided mode
$ semkw --file notes.txt -n 3 # txt file, top 3
$ semkw --file report.md --model accurate
$ semkw --list-models # show downloaded models
$ echo "deep learning neural" | semkw -n 3
All flags:
The package auto-detects which models you have downloaded and presents a menu. No configuration file needed.
| Alias | HuggingFace model | Size | Speed | Note |
|---|---|---|---|---|
| fast default | all-MiniLM-L6-v2 | 90 MB | fastest | Great for most use cases |
| balanced | all-MiniLM-L12-v2 | 120 MB | medium | Slightly better accuracy |
| accurate | all-mpnet-base-v2 | 420 MB | slower | Best quality on CPU |
| custom | any HuggingFace name | varies | varies | Advanced users |
Download models interactively:
$ python download_model.py
semantic-keywords — model downloader
# Alias HuggingFace name Size Status
-- ---------- ---------------------------------- ------ --------------
v [1] fast all-MiniLM-L6-v2 90MB downloaded
[2] balanced all-MiniLM-L12-v2 120MB not downloaded
[3] accurate all-mpnet-base-v2 420MB not downloaded
Your choice: 2
Downloading [balanced]... Done.