- build_trainset gains --radius (chars each side of the cited name) and --out;
merge-gap scales with radius. Default 600 unchanged.
- trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split,
but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median
~3.7k tokens/sample). Keeps the 100% name-in-text guarantee.
- DATASET.md documents both context sizes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Track trainset/train/val/test.jsonl (focused excerpts only, no multi-MB text)
so the dataset can be used directly without the ~5h rebuild.
- DATASET.md: record schema, both serializations, load snippets (plain Python +
HF datasets), a text->triples fine-tuning sketch, eval notes, provenance.
- .gitignore: keep only the 100s-of-MB intermediates (samples_full) ignored.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>