Ensures text<->gold agreement for the text->triple dataset: - fetch: newest full prospectus book + later 497 supplements (time-aligned with the current N-CEN gold; fixes stale 'largest book' picking 2-5yr-old filings) - grounding: fast alias matcher (name present, variant-tolerant) AND an LLM role-check (llm_extract.py match mode, via local Ollama or remote vLLM server) that verifies the entity plays that ROLE -- catches right-name/wrong-role cases a lexical matcher over-keeps. Validated with a strong model; ~93% of gold names are present once granularity+currency are fixed. - llm_extract.py: extract (baseline) + match (grounding) modes, sliding-window with retrieval pre-filter, claim dedup, retry, ollama/vllm backends - build_rdf_dataset.py: --grounding alias|llm|name|context|none, whole-trust samples now filtered, MIN_PROSE stub guard Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fund_rdf_data
A relationship-rich finance dataset for text-to-RDF-triple extraction, built from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis Magical RDF Triples and how to synthetize them.
Each sample pairs a long natural-language prospectus (incl. SAI) (input) with a compact graph of entity-to-entity RDF triples (target) — a fund advised by a manager, distributed by an underwriter, seriesOf a trust, and so on. Unlike Wikidata-derived corpora where text ≈ triples, here the input is far larger than the output, and the target is a genuine knowledge graph rather than flat attributes. Ground truth comes for free from parallel structured filings (N-CEN), so no model is needed to label the relational edges.
Text↔triple agreement — the central requirement. For a text→triple dataset a gold triple is only useful if the fact is actually stated in the input. Three things were needed to achieve high agreement:
- Granularity — one sample per trust from the full prospectus+SAI book, not the tiny per-fund summary segment (where most provider names are absent).
- Currency — the N-CEN gold is current, so the source must be too: fetch the newest full book plus all later supplements (497), not an old or partial filing. (A "largest book" can be 2–5 years stale and disagree with the gold.)
- Role-correct grounding — a name appearing in the text is not enough; it must appear in that role. A lexical/alias matcher over-keeps (e.g. a bank named as securities-lending agent is not the custodian; a parent company is not the named sub-adviser). An LLM verifier (see below) checks the role and is the accurate filter; the fast alias matcher is a cheap pre-screen.
With (1)+(2), the entity name is present for ~93% of gold triples; the LLM role-check then keeps only those actually asserted in that role.
See dataset_description.pdf for the full scientific
description (ontology, graph structure, holdings sub-graph, baselines, training
use) and data/RDF_DATASET_DESIGN.md for design
notes.
Pipeline
The dataset is built by build_rdf_dataset.py in four stages:
# 1. gold — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary
# 2. fetch — newest full prospectus book + all later supplements (497) per trust,
# aligned in time with the current N-CEN gold. (--no-sai to skip SAI docs.)
python build_rdf_dataset.py fetch --limit 435 --max-filings 8 # --ciks ... to target
# 3. samples — one sample per TRUST (whole book -> all the trust's triples),
# filtered by the grounding mode (default alias; use llm for role-correct).
python build_rdf_dataset.py samples --whole-trust --grounding alias
# (optional) LLM role-check grounding: run the match, then rebuild with --grounding llm
python llm_extract.py --mode match --backend vllm \
--in data/rdf_poc/match_input.jsonl --out data/rdf_poc/match_all.jsonl --workers 3
python build_rdf_dataset.py samples --whole-trust --grounding llm
# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split
# or run all four:
python build_rdf_dataset.py all --limit 435
The four stages above produce the complete dataset (samples.jsonl and the
train/val/test splits). Nothing below is required to build or use it.
Scoring (optional)
score_baseline.py scores any model's predictions against
the N-CEN gold (triple-level P/R/F1, per relation), and also provides a no-model
lexical lower bound:
python score_baseline.py stringmatch # no-model lexical lower bound
python score_baseline.py model --pred preds.jsonl
Optional: LLM extractability check
This is a diagnostic / quality-assurance tool only — it is not part of the dataset build and is not needed to train on or use the dataset.
A target triple is only useful if the fact is actually present in the input text.
Whether that holds is a semantic question that lexical substring/keyword matching
cannot answer reliably (a brand name like "John Hancock" matches the fund heading,
not the administrator role; "administration services are provided by X" uses no
fixed keyword). llm_extract.py therefore lets a strong
open-source instruct model (local, via Ollama; default qwen3.6:35b) read each
segment + ontology and extract the triples it can find. What it extracts is, by
construction, present in the text — so its output doubles as (a) a strong
extraction baseline and (b) a semantic extractability check on the gold.
# requires a running Ollama server with the model pulled
python llm_extract.py --in data/rdf_poc/test.jsonl --out data/rdf_poc/preds_qwen.jsonl
python score_baseline.py model --pred data/rdf_poc/preds_qwen.jsonl
Long inputs → sliding window. A single full-book input (0.5–1 MB) is slow in
one huge-context call and suffers from "lost in the middle". llm_extract.py
therefore slides an overlapping window over any text longer than --window
(default 40 KB, --overlap 8 KB), extracts per window, and unions the
de-duplicated triples — so service-provider facts that live only in the SAI
section near the end are reliably seen by some window. This windowing is an
inference strategy for the check, not a transformation of the dataset (the
stored input_text always remains the full text).
Granularity is what determines agreement. Scored against the small per-fund
summary segment, only advisedBy/subAdvisedBy are reliably present (~0.5–0.6
recall) and custodian collapses to ~0.05 — the motivating Cambria case, where
the foreign sub-custodians filed in N-CEN never appear in the summary. Scored
against the full single-book trust prospectus+SAI, the gold names are present
~90% overall (custodian 0.81, administrator/adviser 0.94, seriesOf 0.98): the
information was never missing, the per-fund segment was simply the wrong unit. The
genuinely non-text facts (foreign sub-custodians) stay correctly absent, so they
should not be training targets. This is why the recommended build is one sample
per trust from one full book.
Sample format
Each line of samples.jsonl / train|val|test.jsonl is a JSON record:
| field | meaning |
|---|---|
input_text |
prospectus prose for the fund (model input) |
ontology |
inferred meta-schema (subject type → predicate → object type) |
target_triples |
structured {s,p,o,grounded} list (grounded = object name appears in input_text) |
target_serialized |
marker form (<triple_start> …) for Models 2/4 |
target_serialized_plain |
Turtle-like form, no special tokens, for Models 1/3 |
cik, series_id, fund, trust_name |
identifiers |
stats |
input/target sizes, triple count, n_grounded, text:json ratio |
Relations
Entity-to-entity edges (gold from N-CEN / Series-Class):
seriesOf, advisedBy, subAdvisedBy, transferAgent, administrator,
underwrittenBy.
custodian is dropped by default (--custodian-scope none): custodian names
— especially foreign sub-custodians — appear only in the structured N-CEN table
and in no prose document (the summary prospectus says only "the custodian"),
so they are not extractable from text. The primary custodian is named only in the
separately-filed SAI (N-1A Part B), which is not part of the input. Use
--custodian-scope primary or all to re-include it if you add the SAI as input.
Prose-grounding: every triple carries a grounded flag (object name present
in the sample's input). Across the full build ~80 % of triples are grounded
(per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 80–84 %,
transferAgent 72 %, underwrittenBy 62 %). Filter on grounded to train/evaluate
only on text-extractable targets.
Holdings edges (holds/issuedBy/domiciledIn, gold from N-PORT) are a planned
second track from annual-report (N-CSR) commentary — see the description PDF.
Data sources
All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data
(prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample
files are git-ignored because they are reproducible from the commands above;
only the lightweight structured gold graph (data/rdf_poc/gold_graphs.jsonl) is
committed.
Requirements
pip install -r requirements.txt