416 B
416 B
Chunker
Extract text, chunk it, and save images from a PDF.
chunks is a List[str] of ~800-token strings (100-token overlap). Outputs (text files and images) are written under extracted_content/<pdf_basename>/.
Usage
from chunker import Chunker
chunker = Chunker("path/to/file.pdf")
chunks = chunker.run()
Setup:
pip install -r requirements.txt
python -m spacy download xx_ent_wiki_sm