fund_rfid_data/finalize_dataset.sh
Florian Herzog 991715ab76 Add LLM role-check grounding + labelled training-set pipeline
- llm_extract.py: match mode now window-parallel with retrieval pre-filter,
  claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call;
  n_failed_windows/ok flags so an interrupted run never records bogus 0s.
- build_rdf_dataset.py:
  - gold now includes the share-class level (hasShareClass/ticker/className)
  - grounding modes alias|llm|name|context|none (--grounding); llm reads the
    role-check verdicts from match_all.jsonl
  - label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE
  - trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut
    around the actual provider statement (evidence), not the multi-MB book
  - split --src to split trainset.jsonl (trust-level, no leakage)
- helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume),
  finalize_dataset.sh
- final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples,
  10,689 grounded triples, train/val/test 264/35/35

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:52:50 +02:00

49 lines
1.8 KiB
Bash

#!/usr/bin/env bash
# ------------------------------------------------------------------
# finalize_dataset.sh — konsolidiert alle Match-Teile und baut den
# trainingsfertigen Datensatz: label -> trainset -> split.
#
# bash finalize_dataset.sh
#
# Idempotent: konsolidiert nur SAUBERE Trusts (n_failed_windows==0) aus
# allen match_*.jsonl Teil-Dateien in match_all.jsonl, dann Pipeline.
# ------------------------------------------------------------------
set -euo pipefail
cd "$(dirname "$0")"
python3 - <<'PY'
import json, glob, os
parts = (["data/rdf_poc/match_all_clean79.jsonl",
"data/rdf_poc/match_remaining.jsonl"]
+ glob.glob("data/rdf_poc/match_remaining_*.jsonl") + glob.glob("data/rdf_poc/match_remaining_final.jsonl"))
good = {}
for p in parts:
if not os.path.exists(p): continue
for l in open(p):
try: r = json.loads(l)
except: continue
if r.get("n_failed_windows", 0) == 0 and r.get("triples") is not None:
good[r["cik"]] = r
with open("data/rdf_poc/match_all.jsonl", "w") as f:
for r in good.values():
f.write(json.dumps(r, ensure_ascii=False) + "\n")
allin = [json.loads(l) for l in open("data/rdf_poc/match_input.jsonl")]
remaining = [r for r in allin if r["cik"] not in good]
tg = tt = 0
for r in good.values():
for t in r["triples"]:
tt += 1; tg += 1 if t.get("llm_grounded") else 0
print(f"konsolidiert: {len(good)}/335 Trusts, {100*tg/max(1,tt):.0f}% Uebereinstimmung, "
f"{len(remaining)} noch offen")
PY
echo ""
echo "=== label (FULL/PARTIAL/NONE) ==="
python3 build_rdf_dataset.py label
echo ""
echo "=== trainset (Auszuege + grounded Tripel) ==="
python3 build_rdf_dataset.py trainset
echo ""
echo "Fertig. trainset.jsonl ist der trainingsfertige Datensatz."
echo "Split: python3 build_rdf_dataset.py split (auf samples-Basis)"