fund_rfid_data

herzogfloria/fund_rfid_data

Fork 0

Commit Graph

Author	SHA1	Message	Date
Florian Herzog	9dc870b8d0	Add 3x-context dataset variant (trainset --radius) - build_trainset gains --radius (chars each side of the cited name) and --out; merge-gap scales with radius. Default 600 unchanged. - trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split, but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee. - DATASET.md documents both context sizes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-10 16:37:30 +02:00
Florian Herzog	e35d98c2cd	Commit training-ready dataset (~6 MB) + DATASET.md usage guide - Track trainset/train/val/test.jsonl (focused excerpts only, no multi-MB text) so the dataset can be used directly without the ~5h rebuild. - DATASET.md: record schema, both serializations, load snippets (plain Python + HF datasets), a text->triples fine-tuning sketch, eval notes, provenance. - .gitignore: keep only the 100s-of-MB intermediates (samples_full) ignored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-10 16:22:39 +02:00

Author

SHA1

Message

Date

Florian Herzog

9dc870b8d0

Add 3x-context dataset variant (trainset --radius)

- build_trainset gains --radius (chars each side of the cited name) and --out;
  merge-gap scales with radius. Default 600 unchanged.
- trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split,
  but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median
  ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee.
- DATASET.md documents both context sizes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 16:37:30 +02:00

Florian Herzog

e35d98c2cd

Commit training-ready dataset (~6 MB) + DATASET.md usage guide

- Track trainset/train/val/test.jsonl (focused excerpts only, no multi-MB text)
  so the dataset can be used directly without the ~5h rebuild.
- DATASET.md: record schema, both serializations, load snippets (plain Python +
  HF datasets), a text->triples fine-tuning sketch, eval notes, provenance.
- .gitignore: keep only the 100s-of-MB intermediates (samples_full) ignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 16:22:39 +02:00

2 Commits