22 lines
416 B
Markdown
22 lines
416 B
Markdown
# Chunker
|
|
|
|
Extract text, chunk it, and save images from a PDF.
|
|
|
|
chunks is a List[str] of ~800-token strings (100-token overlap).
|
|
Outputs (text files and images) are written under extracted_content/<pdf_basename>/.
|
|
## Usage
|
|
|
|
```python
|
|
from chunker import Chunker
|
|
|
|
chunker = Chunker("path/to/file.pdf")
|
|
chunks = chunker.run()
|
|
|
|
|
|
|
|
Setup:
|
|
pip install -r requirements.txt
|
|
python -m spacy download xx_ent_wiki_sm
|
|
|
|
|