fund_rfid_data

herzogfloria/fund_rfid_data

Fork 0

Commit Graph

Author	SHA1	Message	Date
Florian Herzog	991715ab76	Add LLM role-check grounding + labelled training-set pipeline - llm_extract.py: match mode now window-parallel with retrieval pre-filter, claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call; n_failed_windows/ok flags so an interrupted run never records bogus 0s. - build_rdf_dataset.py: - gold now includes the share-class level (hasShareClass/ticker/className) - grounding modes alias\|llm\|name\|context\|none (--grounding); llm reads the role-check verdicts from match_all.jsonl - label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE - trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut around the actual provider statement (evidence), not the multi-MB book - split --src to split trainset.jsonl (trust-level, no leakage) - helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume), finalize_dataset.sh - final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples, 10,689 grounded triples, train/val/test 264/35/35 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-10 13:52:50 +02:00

Author

SHA1

Message

Date

Florian Herzog

991715ab76

Add LLM role-check grounding + labelled training-set pipeline

- llm_extract.py: match mode now window-parallel with retrieval pre-filter,
  claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call;
  n_failed_windows/ok flags so an interrupted run never records bogus 0s.
- build_rdf_dataset.py:
  - gold now includes the share-class level (hasShareClass/ticker/className)
  - grounding modes alias|llm|name|context|none (--grounding); llm reads the
    role-check verdicts from match_all.jsonl
  - label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE
  - trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut
    around the actual provider statement (evidence), not the multi-MB book
  - split --src to split trainset.jsonl (trust-level, no leakage)
- helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume),
  finalize_dataset.sh
- final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples,
  10,689 grounded triples, train/val/test 264/35/35

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 13:52:50 +02:00

1 Commits