- llm_extract.py: match mode now window-parallel with retrieval pre-filter,
claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call;
n_failed_windows/ok flags so an interrupted run never records bogus 0s.
- build_rdf_dataset.py:
- gold now includes the share-class level (hasShareClass/ticker/className)
- grounding modes alias|llm|name|context|none (--grounding); llm reads the
role-check verdicts from match_all.jsonl
- label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE
- trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut
around the actual provider statement (evidence), not the multi-MB book
- split --src to split trainset.jsonl (trust-level, no leakage)
- helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume),
finalize_dataset.sh
- final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples,
10,689 grounded triples, train/val/test 264/35/35
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
73 lines
3.0 KiB
Bash
73 lines
3.0 KiB
Bash
#!/usr/bin/env bash
|
|
# ------------------------------------------------------------------
|
|
# resume_match.sh — LLM-Match sauber fortsetzen nach Unterbruch
|
|
# (Deckel zu / Schlaf / Crash).
|
|
#
|
|
# bash resume_match.sh
|
|
#
|
|
# Idee: Alle bereits sauber gematchten Trusts (ueber ALLE Teil-Dateien,
|
|
# OHNE fehlgeschlagene Fenster) werden in data/rdf_poc/match_all.jsonl
|
|
# konsolidiert; alle fehlenden ODER crash-verseuchten Trusts werden neu
|
|
# gematcht. Beliebig oft wiederholbar — jeder Lauf macht nur den Rest.
|
|
#
|
|
# Sicher gegen Schlaf: laeuft unter caffeinate. ABER: bei DECKEL ZU auf
|
|
# einem MacBook ohne externen Monitor schlaeft das System trotzdem
|
|
# (clamshell). Dann einfach nach dem Aufklappen erneut starten.
|
|
# ------------------------------------------------------------------
|
|
set -euo pipefail
|
|
cd "$(dirname "$0")"
|
|
|
|
OUT="data/rdf_poc/match_all.jsonl"
|
|
INPUT="data/rdf_poc/match_input.jsonl"
|
|
WIN=120000; OVERLAP=10000; WORKERS=2; WINWORKERS=6
|
|
|
|
# 1) Konsolidiere alle sauberen Ergebnisse aus allen match_*.jsonl Teil-Dateien
|
|
python3 - "$OUT" "$INPUT" <<'PY'
|
|
import json, glob, sys, os
|
|
out_path, input_path = sys.argv[1], sys.argv[2]
|
|
parts = sorted(glob.glob("data/rdf_poc/match_all.jsonl")
|
|
+ glob.glob("data/rdf_poc/match_all_clean*.jsonl")
|
|
+ glob.glob("data/rdf_poc/match_remaining*.jsonl"))
|
|
good = {} # cik -> record, only if no failed windows
|
|
for p in parts:
|
|
if not os.path.exists(p): continue
|
|
for l in open(p):
|
|
try: r = json.loads(l)
|
|
except: continue
|
|
if r.get("n_failed_windows", 0) == 0 and r.get("triples") is not None:
|
|
good[r["cik"]] = r # last clean wins
|
|
with open(out_path, "w") as f:
|
|
for r in good.values():
|
|
f.write(json.dumps(r, ensure_ascii=False) + "\n")
|
|
# Restliche (fehlende oder verseuchte) Trusts -> remaining-Input
|
|
allin = [json.loads(l) for l in open(input_path)]
|
|
remaining = [r for r in allin if r["cik"] not in good]
|
|
with open("data/rdf_poc/match_input_remaining.jsonl", "w") as f:
|
|
for r in remaining:
|
|
f.write(json.dumps(r) + "\n")
|
|
print(f"sauber konsolidiert: {len(good)} Trusts -> {out_path}")
|
|
print(f"noch zu matchen: {len(remaining)} Trusts")
|
|
PY
|
|
|
|
REMAIN=$(wc -l < data/rdf_poc/match_input_remaining.jsonl | tr -d ' ')
|
|
if [ "$REMAIN" = "0" ]; then
|
|
echo "ALLES fertig — match_all.jsonl ist vollstaendig."
|
|
exit 0
|
|
fi
|
|
|
|
echo "Starte Nachlauf fuer $REMAIN Trusts (unter caffeinate)..."
|
|
# unter caffeinate, damit System-Schlaf (nicht Deckel) den Lauf nicht killt
|
|
caffeinate -i -m -s python3 -u -c "
|
|
import llm_extract as L, sys
|
|
L.WINDOW_WORKERS=$WINWORKERS
|
|
sys.argv=['x','--mode','match','--backend','vllm',
|
|
'--in','data/rdf_poc/match_input_remaining.jsonl',
|
|
'--out','data/rdf_poc/match_remaining_$(date +%H%M%S).jsonl',
|
|
'--window','$WIN','--overlap','$OVERLAP','--num-ctx','131072','--workers','$WORKERS']
|
|
L.main()
|
|
"
|
|
|
|
echo ""
|
|
echo "Nachlauf-Teil fertig. Erneut 'bash resume_match.sh' ausfuehren,"
|
|
echo "um zu konsolidieren und (falls noch Rest) weiterzumachen."
|