2025-05-24 12:15:48 +02:00

416 B

Chunker

Extract text, chunk it, and save images from a PDF.

chunks is a List[str] of ~800-token strings (100-token overlap). Outputs (text files and images) are written under extracted_content/<pdf_basename>/.

Usage

from chunker import Chunker

chunker = Chunker("path/to/file.pdf")
chunks = chunker.run()



Setup:
    pip install -r requirements.txt
    python -m spacy download xx_ent_wiki_sm