1 Commits

Author SHA1 Message Date
Florian Herzog
991715ab76 Add LLM role-check grounding + labelled training-set pipeline
- llm_extract.py: match mode now window-parallel with retrieval pre-filter,
  claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call;
  n_failed_windows/ok flags so an interrupted run never records bogus 0s.
- build_rdf_dataset.py:
  - gold now includes the share-class level (hasShareClass/ticker/className)
  - grounding modes alias|llm|name|context|none (--grounding); llm reads the
    role-check verdicts from match_all.jsonl
  - label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE
  - trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut
    around the actual provider statement (evidence), not the multi-MB book
  - split --src to split trainset.jsonl (trust-level, no leakage)
- helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume),
  finalize_dataset.sh
- final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples,
  10,689 grounded triples, train/val/test 264/35/35

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:52:50 +02:00