fund_rfid_data/resume_match.sh
Florian Herzog 991715ab76 Add LLM role-check grounding + labelled training-set pipeline
- llm_extract.py: match mode now window-parallel with retrieval pre-filter,
  claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call;
  n_failed_windows/ok flags so an interrupted run never records bogus 0s.
- build_rdf_dataset.py:
  - gold now includes the share-class level (hasShareClass/ticker/className)
  - grounding modes alias|llm|name|context|none (--grounding); llm reads the
    role-check verdicts from match_all.jsonl
  - label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE
  - trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut
    around the actual provider statement (evidence), not the multi-MB book
  - split --src to split trainset.jsonl (trust-level, no leakage)
- helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume),
  finalize_dataset.sh
- final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples,
  10,689 grounded triples, train/val/test 264/35/35

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:52:50 +02:00

73 lines
3.0 KiB
Bash

#!/usr/bin/env bash
# ------------------------------------------------------------------
# resume_match.sh — LLM-Match sauber fortsetzen nach Unterbruch
# (Deckel zu / Schlaf / Crash).
#
# bash resume_match.sh
#
# Idee: Alle bereits sauber gematchten Trusts (ueber ALLE Teil-Dateien,
# OHNE fehlgeschlagene Fenster) werden in data/rdf_poc/match_all.jsonl
# konsolidiert; alle fehlenden ODER crash-verseuchten Trusts werden neu
# gematcht. Beliebig oft wiederholbar — jeder Lauf macht nur den Rest.
#
# Sicher gegen Schlaf: laeuft unter caffeinate. ABER: bei DECKEL ZU auf
# einem MacBook ohne externen Monitor schlaeft das System trotzdem
# (clamshell). Dann einfach nach dem Aufklappen erneut starten.
# ------------------------------------------------------------------
set -euo pipefail
cd "$(dirname "$0")"
OUT="data/rdf_poc/match_all.jsonl"
INPUT="data/rdf_poc/match_input.jsonl"
WIN=120000; OVERLAP=10000; WORKERS=2; WINWORKERS=6
# 1) Konsolidiere alle sauberen Ergebnisse aus allen match_*.jsonl Teil-Dateien
python3 - "$OUT" "$INPUT" <<'PY'
import json, glob, sys, os
out_path, input_path = sys.argv[1], sys.argv[2]
parts = sorted(glob.glob("data/rdf_poc/match_all.jsonl")
+ glob.glob("data/rdf_poc/match_all_clean*.jsonl")
+ glob.glob("data/rdf_poc/match_remaining*.jsonl"))
good = {} # cik -> record, only if no failed windows
for p in parts:
if not os.path.exists(p): continue
for l in open(p):
try: r = json.loads(l)
except: continue
if r.get("n_failed_windows", 0) == 0 and r.get("triples") is not None:
good[r["cik"]] = r # last clean wins
with open(out_path, "w") as f:
for r in good.values():
f.write(json.dumps(r, ensure_ascii=False) + "\n")
# Restliche (fehlende oder verseuchte) Trusts -> remaining-Input
allin = [json.loads(l) for l in open(input_path)]
remaining = [r for r in allin if r["cik"] not in good]
with open("data/rdf_poc/match_input_remaining.jsonl", "w") as f:
for r in remaining:
f.write(json.dumps(r) + "\n")
print(f"sauber konsolidiert: {len(good)} Trusts -> {out_path}")
print(f"noch zu matchen: {len(remaining)} Trusts")
PY
REMAIN=$(wc -l < data/rdf_poc/match_input_remaining.jsonl | tr -d ' ')
if [ "$REMAIN" = "0" ]; then
echo "ALLES fertig — match_all.jsonl ist vollstaendig."
exit 0
fi
echo "Starte Nachlauf fuer $REMAIN Trusts (unter caffeinate)..."
# unter caffeinate, damit System-Schlaf (nicht Deckel) den Lauf nicht killt
caffeinate -i -m -s python3 -u -c "
import llm_extract as L, sys
L.WINDOW_WORKERS=$WINWORKERS
sys.argv=['x','--mode','match','--backend','vllm',
'--in','data/rdf_poc/match_input_remaining.jsonl',
'--out','data/rdf_poc/match_remaining_$(date +%H%M%S).jsonl',
'--window','$WIN','--overlap','$OVERLAP','--num-ctx','131072','--workers','$WORKERS']
L.main()
"
echo ""
echo "Nachlauf-Teil fertig. Erneut 'bash resume_match.sh' ausfuehren,"
echo "um zu konsolidieren und (falls noch Rest) weiterzumachen."