- llm_extract.py: match mode now window-parallel with retrieval pre-filter,
claim dedup, retry, and enable_thinking=false (vLLM) -> ~36x faster per call;
n_failed_windows/ok flags so an interrupted run never records bogus 0s.
- build_rdf_dataset.py:
- gold now includes the share-class level (hasShareClass/ticker/className)
- grounding modes alias|llm|name|context|none (--grounding); llm reads the
role-check verdicts from match_all.jsonl
- label stage: per-triple extractable + per-sample FULL/PARTIAL/NONE
- trainset stage: combines GROUNDED triples with focused TEXT EXCERPTS cut
around the actual provider statement (evidence), not the multi-MB book
- split --src to split trainset.jsonl (trust-level, no leakage)
- helper scripts: watch_match.sh, resume_match.sh (crash/sleep-safe resume),
finalize_dataset.sh
- final dataset: 335/335 trusts, 85% text<->gold agreement, 334 samples,
10,689 grounded triples, train/val/test 264/35/35
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
49 lines
1.8 KiB
Bash
49 lines
1.8 KiB
Bash
#!/usr/bin/env bash
|
|
# ------------------------------------------------------------------
|
|
# finalize_dataset.sh — konsolidiert alle Match-Teile und baut den
|
|
# trainingsfertigen Datensatz: label -> trainset -> split.
|
|
#
|
|
# bash finalize_dataset.sh
|
|
#
|
|
# Idempotent: konsolidiert nur SAUBERE Trusts (n_failed_windows==0) aus
|
|
# allen match_*.jsonl Teil-Dateien in match_all.jsonl, dann Pipeline.
|
|
# ------------------------------------------------------------------
|
|
set -euo pipefail
|
|
cd "$(dirname "$0")"
|
|
|
|
python3 - <<'PY'
|
|
import json, glob, os
|
|
parts = (["data/rdf_poc/match_all_clean79.jsonl",
|
|
"data/rdf_poc/match_remaining.jsonl"]
|
|
+ glob.glob("data/rdf_poc/match_remaining_*.jsonl") + glob.glob("data/rdf_poc/match_remaining_final.jsonl"))
|
|
good = {}
|
|
for p in parts:
|
|
if not os.path.exists(p): continue
|
|
for l in open(p):
|
|
try: r = json.loads(l)
|
|
except: continue
|
|
if r.get("n_failed_windows", 0) == 0 and r.get("triples") is not None:
|
|
good[r["cik"]] = r
|
|
with open("data/rdf_poc/match_all.jsonl", "w") as f:
|
|
for r in good.values():
|
|
f.write(json.dumps(r, ensure_ascii=False) + "\n")
|
|
allin = [json.loads(l) for l in open("data/rdf_poc/match_input.jsonl")]
|
|
remaining = [r for r in allin if r["cik"] not in good]
|
|
tg = tt = 0
|
|
for r in good.values():
|
|
for t in r["triples"]:
|
|
tt += 1; tg += 1 if t.get("llm_grounded") else 0
|
|
print(f"konsolidiert: {len(good)}/335 Trusts, {100*tg/max(1,tt):.0f}% Uebereinstimmung, "
|
|
f"{len(remaining)} noch offen")
|
|
PY
|
|
|
|
echo ""
|
|
echo "=== label (FULL/PARTIAL/NONE) ==="
|
|
python3 build_rdf_dataset.py label
|
|
echo ""
|
|
echo "=== trainset (Auszuege + grounded Tripel) ==="
|
|
python3 build_rdf_dataset.py trainset
|
|
echo ""
|
|
echo "Fertig. trainset.jsonl ist der trainingsfertige Datensatz."
|
|
echo "Split: python3 build_rdf_dataset.py split (auf samples-Basis)"
|