Python · NLP · Offline AI · PDF Support

Keywords that
actually understand
your text.

TF-IDF counts words. semantic-keywords understands meaning. Extracts keywords from text or files using sentence embeddings and Maximal Marginal Relevance — fully offline, no API key required.

$ pip install semantic-keywords
$ pip install "semantic-keywords[files]"
$ pip install -e ".[dev]"

Run without installing.

No Python setup required — run semantic-keywords directly in a container.

Quick start with Docker
── pull and run ───────────────────────────────────────────
$ docker pull ronaldgosso/semantic-keywords

── inline text ────────────────────────────────────────────
$ docker run --rm ronaldgosso/semantic-keywords "Tanzania fintech mobile money"
mobile money
fintech startups
east africa

── extract from a file ────────────────────────────────────
$ docker run --rm -v ./documents:/data ronaldgosso/semantic-keywords --file /data/report.pdf

── interactive mode ───────────────────────────────────────
$ docker run --rm -it ronaldgosso/semantic-keywords

── docker compose (persistent model cache) ────────────────
$ mkdir -p data && cp report.pdf data/
$ docker compose run --rm semkw --file /data/report.pdf --scores
container

Zero install

No Python, no venv — just pull the image and run.

cache

Persistent cache

Model downloads once and stays cached across runs.

multi

Multi-platform

Built for linux/amd64 and linux/arm64 automatically.

Full guide → README_DOCKER.md

Built on real semantics.

Not word frequency. Not document position. Actual meaning — via 384-dimensional sentence embeddings.

embedding

Semantic understanding

Knows that "mobile money" and "fintech" are related, even if one never appears in the document.

mmr

MMR diversity

Maximal Marginal Relevance returns 5 different keywords — not 5 paraphrases of the same idea.

pdf · txt · md

File extraction

Pass a file path directly — extract keywords from PDFs, plain text, or markdown without pre-reading.

offline

Fully offline

Download the model once (90 MB). Runs forever with no internet, no API key, no rate limits.

flexible

Three model tiers

Pick fast, balanced, or accurate — the package detects what you have downloaded.

cli

Interactive CLI

Run semkw for a guided prompt — type text, drop a file path, or pipe from stdin.

api

Clean Python API

Two functions: extract() for text, extract_file() for files. Both return the same ranked dict list.

errors

Helpful error messages

Encrypted PDF? Scanned image? Missing model? Every failure surfaces a clear, actionable message.

From file to keywords in one call.

Pass any supported file directly — no pre-reading, no manual text extraction.

.pdf .txt .md
from semantic_keywords import extract_file

# One-call file extraction
result = extract_file("annual_report.pdf", top_n=10)

print(result["file"])      # "annual_report.pdf"
print(result["size_kb"])   # 284.1
print(result["words"])     # 6203

for kw in result["keywords"]:
    print(kw["score"], kw["keyword"])

# Two-step: read file then extract separately
from semantic_keywords import read_file, extract

text    = read_file("notes.txt")
results = extract(text, top_n=5)

The extract_file() return value:

"file" str — filename only, not full path
"size_kb" float — file size in KB
"words" int — word count of extracted text
"model" str — model alias used
"keywords" list[dict] — [{"keyword": str, "score": float}, ...]
PDF note: Requires pypdf — install with pip install "semantic-keywords[files]". Image-only / scanned PDFs contain no extractable text and must be run through OCR first. Password-protected PDFs must be decrypted before use.

Two functions. Same clean output.

from semantic_keywords import extract

# From text — basic
results = extract("Tanzania is a hub for mobile money and fintech startups.")

for r in results:
    print(r["score"], r["keyword"])

# From text — full control
results = extract(
    text      = "your paragraph here",
    top_n     = 10,
    min_score = 0.25,
    diversity = 0.7,
    model     = "balanced",
)

# From file — same parameters
result = extract_file(
    file_path = "report.pdf",
    top_n     = 10,
    min_score = 0.25,
    diversity = 0.7,
    model     = "accurate",
)

All parameters — applies to both extract() and extract_file():

top_n 5 Maximum keywords to return.
min_score 0.20 Minimum cosine similarity threshold (0.0–1.0). Higher = stricter.
diversity 0.7 MMR balance: 0.0 = most relevant, 1.0 = most varied.
model "fast" "fast" · "balanced" · "accurate" · any HuggingFace model name.
max_words 3 Maximum words per candidate phrase (1–3 recommended).

Three modes. One command.

Interactive guided session, inline text, or file path — all through semkw.

PowerShell / Terminal
── inline text ────────────────────────────────────────────
$ semkw "Tanzania fintech mobile money startups" -n 5 --scores

  Top 5 keywords:

  #     Keyword                Score    Relevance
  ──    ─────────────────────  ───────  ──────────────────────
  1     mobile money           0.5134   ████████████████░░░░░░░░
  2     fintech startups       0.4901   ██████████████░░░░░░░░░░
  3     east africa            0.4710   █████████████░░░░░░░░░░░
  4     financial access       0.4502   ████████████░░░░░░░░░░░░
  5     agricultural tools     0.4388   ████████████░░░░░░░░░░░░

── file extraction ────────────────────────────────────────
$ semkw --file annual_report.pdf -n 10 --scores

  Reading 'annual_report.pdf'  (284.1 KB) ...
  Words : 6203
  Model : fast

  Top 10 keywords:

  1     revenue growth         0.5341   ███████████████░░░░░░░░░
  2     market expansion       0.5102   ██████████████░░░░░░░░░░
  3     operating costs        0.4880   █████████████░░░░░░░░░░░
  4     digital strategy       0.4601   ████████████░░░░░░░░░░░░
  5     customer retention     0.4419   ████████████░░░░░░░░░░░░

── other modes ────────────────────────────────────────────
$ semkw                          # interactive guided mode
$ semkw --file notes.txt -n 3    # txt file, top 3
$ semkw --file report.md --model accurate
$ semkw --list-models            # show downloaded models
$ echo "deep learning neural" | semkw -n 3

All flags:

--file, -f Path to a .pdf, .txt, or .md file.
--top, -n 5 Maximum keywords to return.
--model, -m auto fast · balanced · accurate
--min-score 0.20 Minimum cosine similarity (0.0–1.0).
--diversity 0.7 MMR balance factor (0.0–1.0).
--scores off Print ranked score table instead of plain keyword list.
--list-models Show all models and download status, then exit.

Pick your power level.

The package auto-detects which models you have downloaded and presents a menu. No configuration file needed.

Alias HuggingFace model Size Speed Note
fast default all-MiniLM-L6-v2 90 MB fastest Great for most use cases
balanced all-MiniLM-L12-v2 120 MB medium Slightly better accuracy
accurate all-mpnet-base-v2 420 MB slower Best quality on CPU
custom any HuggingFace name varies varies Advanced users

Download models interactively:

$ python download_model.py

  semantic-keywords — model downloader

     #    Alias       HuggingFace name                    Size    Status
     --   ----------  ----------------------------------  ------  --------------
   v [1]  fast        all-MiniLM-L6-v2                    90MB    downloaded
     [2]  balanced    all-MiniLM-L12-v2                   120MB   not downloaded
     [3]  accurate    all-mpnet-base-v2                   420MB   not downloaded

  Your choice: 2
  Downloading [balanced]...  Done.