2025-05-24 12:15:48 +02:00

22 lines
416 B
Markdown

# Chunker
Extract text, chunk it, and save images from a PDF.
chunks is a List[str] of ~800-token strings (100-token overlap).
Outputs (text files and images) are written under extracted_content/<pdf_basename>/.
## Usage
```python
from chunker import Chunker
chunker = Chunker("path/to/file.pdf")
chunks = chunker.run()
Setup:
pip install -r requirements.txt
python -m spacy download xx_ent_wiki_sm