Florian Herzog 09798eb27a Add LLM grounding pipeline: current-source fetch, alias + LLM role-check matching
Ensures text<->gold agreement for the text->triple dataset:
- fetch: newest full prospectus book + later 497 supplements (time-aligned with
  the current N-CEN gold; fixes stale 'largest book' picking 2-5yr-old filings)
- grounding: fast alias matcher (name present, variant-tolerant) AND an LLM
  role-check (llm_extract.py match mode, via local Ollama or remote vLLM server)
  that verifies the entity plays that ROLE -- catches right-name/wrong-role cases
  a lexical matcher over-keeps. Validated with a strong model; ~93% of gold names
  are present once granularity+currency are fixed.
- llm_extract.py: extract (baseline) + match (grounding) modes, sliding-window
  with retrieval pre-filter, claim dedup, retry, ollama/vllm backends
- build_rdf_dataset.py: --grounding alias|llm|name|context|none, whole-trust
  samples now filtered, MIN_PROSE stub guard

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 13:45:32 +02:00

fund_rdf_data

A relationship-rich finance dataset for text-to-RDF-triple extraction, built from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis Magical RDF Triples and how to synthetize them.

Each sample pairs a long natural-language prospectus (incl. SAI) (input) with a compact graph of entity-to-entity RDF triples (target) — a fund advised by a manager, distributed by an underwriter, seriesOf a trust, and so on. Unlike Wikidata-derived corpora where text ≈ triples, here the input is far larger than the output, and the target is a genuine knowledge graph rather than flat attributes. Ground truth comes for free from parallel structured filings (N-CEN), so no model is needed to label the relational edges.

Text↔triple agreement — the central requirement. For a text→triple dataset a gold triple is only useful if the fact is actually stated in the input. Three things were needed to achieve high agreement:

  1. Granularity — one sample per trust from the full prospectus+SAI book, not the tiny per-fund summary segment (where most provider names are absent).
  2. Currency — the N-CEN gold is current, so the source must be too: fetch the newest full book plus all later supplements (497), not an old or partial filing. (A "largest book" can be 25 years stale and disagree with the gold.)
  3. Role-correct grounding — a name appearing in the text is not enough; it must appear in that role. A lexical/alias matcher over-keeps (e.g. a bank named as securities-lending agent is not the custodian; a parent company is not the named sub-adviser). An LLM verifier (see below) checks the role and is the accurate filter; the fast alias matcher is a cheap pre-screen.

With (1)+(2), the entity name is present for ~93% of gold triples; the LLM role-check then keeps only those actually asserted in that role.

See dataset_description.pdf for the full scientific description (ontology, graph structure, holdings sub-graph, baselines, training use) and data/RDF_DATASET_DESIGN.md for design notes.

Pipeline

The dataset is built by build_rdf_dataset.py in four stages:

# 1. gold  — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary

# 2. fetch — newest full prospectus book + all later supplements (497) per trust,
#    aligned in time with the current N-CEN gold. (--no-sai to skip SAI docs.)
python build_rdf_dataset.py fetch --limit 435 --max-filings 8   # --ciks ... to target

# 3. samples — one sample per TRUST (whole book -> all the trust's triples),
#    filtered by the grounding mode (default alias; use llm for role-correct).
python build_rdf_dataset.py samples --whole-trust --grounding alias

# (optional) LLM role-check grounding: run the match, then rebuild with --grounding llm
python llm_extract.py --mode match --backend vllm \
    --in data/rdf_poc/match_input.jsonl --out data/rdf_poc/match_all.jsonl --workers 3
python build_rdf_dataset.py samples --whole-trust --grounding llm

# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split

# or run all four:
python build_rdf_dataset.py all --limit 435

The four stages above produce the complete dataset (samples.jsonl and the train/val/test splits). Nothing below is required to build or use it.

Scoring (optional)

score_baseline.py scores any model's predictions against the N-CEN gold (triple-level P/R/F1, per relation), and also provides a no-model lexical lower bound:

python score_baseline.py stringmatch          # no-model lexical lower bound
python score_baseline.py model --pred preds.jsonl

Optional: LLM extractability check

This is a diagnostic / quality-assurance tool only — it is not part of the dataset build and is not needed to train on or use the dataset.

A target triple is only useful if the fact is actually present in the input text. Whether that holds is a semantic question that lexical substring/keyword matching cannot answer reliably (a brand name like "John Hancock" matches the fund heading, not the administrator role; "administration services are provided by X" uses no fixed keyword). llm_extract.py therefore lets a strong open-source instruct model (local, via Ollama; default qwen3.6:35b) read each segment + ontology and extract the triples it can find. What it extracts is, by construction, present in the text — so its output doubles as (a) a strong extraction baseline and (b) a semantic extractability check on the gold.

# requires a running Ollama server with the model pulled
python llm_extract.py --in data/rdf_poc/test.jsonl --out data/rdf_poc/preds_qwen.jsonl
python score_baseline.py model --pred data/rdf_poc/preds_qwen.jsonl

Long inputs → sliding window. A single full-book input (0.51 MB) is slow in one huge-context call and suffers from "lost in the middle". llm_extract.py therefore slides an overlapping window over any text longer than --window (default 40 KB, --overlap 8 KB), extracts per window, and unions the de-duplicated triples — so service-provider facts that live only in the SAI section near the end are reliably seen by some window. This windowing is an inference strategy for the check, not a transformation of the dataset (the stored input_text always remains the full text).

Granularity is what determines agreement. Scored against the small per-fund summary segment, only advisedBy/subAdvisedBy are reliably present (~0.50.6 recall) and custodian collapses to ~0.05 — the motivating Cambria case, where the foreign sub-custodians filed in N-CEN never appear in the summary. Scored against the full single-book trust prospectus+SAI, the gold names are present ~90% overall (custodian 0.81, administrator/adviser 0.94, seriesOf 0.98): the information was never missing, the per-fund segment was simply the wrong unit. The genuinely non-text facts (foreign sub-custodians) stay correctly absent, so they should not be training targets. This is why the recommended build is one sample per trust from one full book.

Sample format

Each line of samples.jsonl / train|val|test.jsonl is a JSON record:

field meaning
input_text prospectus prose for the fund (model input)
ontology inferred meta-schema (subject type → predicate → object type)
target_triples structured {s,p,o,grounded} list (grounded = object name appears in input_text)
target_serialized marker form (<triple_start> …) for Models 2/4
target_serialized_plain Turtle-like form, no special tokens, for Models 1/3
cik, series_id, fund, trust_name identifiers
stats input/target sizes, triple count, n_grounded, text:json ratio

Relations

Entity-to-entity edges (gold from N-CEN / Series-Class): seriesOf, advisedBy, subAdvisedBy, transferAgent, administrator, underwrittenBy.

custodian is dropped by default (--custodian-scope none): custodian names — especially foreign sub-custodians — appear only in the structured N-CEN table and in no prose document (the summary prospectus says only "the custodian"), so they are not extractable from text. The primary custodian is named only in the separately-filed SAI (N-1A Part B), which is not part of the input. Use --custodian-scope primary or all to re-include it if you add the SAI as input.

Prose-grounding: every triple carries a grounded flag (object name present in the sample's input). Across the full build ~80 % of triples are grounded (per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 8084 %, transferAgent 72 %, underwrittenBy 62 %). Filter on grounded to train/evaluate only on text-extractable targets.

Holdings edges (holds/issuedBy/domiciledIn, gold from N-PORT) are a planned second track from annual-report (N-CSR) commentary — see the description PDF.

Data sources

All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data (prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample files are git-ignored because they are reproducible from the commands above; only the lightweight structured gold graph (data/rdf_poc/gold_graphs.jsonl) is committed.

Requirements

pip install -r requirements.txt
Description
Code for fund SEC RFID data
Readme MIT 2.2 MiB
Languages
Python 85.2%
TeX 14.4%
Shell 0.4%