2 Commits

Author SHA1 Message Date
Florian Herzog
9dc870b8d0 Add 3x-context dataset variant (trainset --radius)
- build_trainset gains --radius (chars each side of the cited name) and --out;
  merge-gap scales with radius. Default 600 unchanged.
- trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split,
  but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median
  ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee.
- DATASET.md documents both context sizes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 16:37:30 +02:00
Florian Herzog
e35d98c2cd Commit training-ready dataset (~6 MB) + DATASET.md usage guide
- Track trainset/train/val/test.jsonl (focused excerpts only, no multi-MB text)
  so the dataset can be used directly without the ~5h rebuild.
- DATASET.md: record schema, both serializations, load snippets (plain Python +
  HF datasets), a text->triples fine-tuning sketch, eval notes, provenance.
- .gitignore: keep only the 100s-of-MB intermediates (samples_full) ignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 16:22:39 +02:00