Go to file

Florian Herzog 00f51859e0 Drop non-extractable custodian relation; add per-triple grounded flag

Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN,
never in the prospectus prose, so they are not a valid text->triple target.
Per-fund the custodian object name occurs in only 28% of segments, the weakest
of all relations. Default is now --custodian-scope none.

Every triple now carries a 'grounded' boolean (object name present in the
sample's input text); 80% of triples are grounded across the full build. This
lets training/eval restrict to text-extractable targets.

- build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats
- gold rebuilt without custodian (15,739 -> 12,694 edges)
- dataset_description + README updated (custodian dropped, grounding documented)

Reported by thesis author: Citibank custodians in triples for 0001529390 never
appear in that prospectus text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 10:34:14 +02:00

data

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

.gitignore

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

build_rdf_dataset.py

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

dataset_description.pdf

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

dataset_description.tex

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

documentation.pdf

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

ESMA_FUND_DATA_RESEARCH.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

fetch_filings.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

fetch_universe.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

fund_db.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

LICENSE

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

load_ncen.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

load_nport.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

load_xbrl_rr.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

OPENFUNDS_PUBLIC_DATA_SOURCES.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

pipeline.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

README.md

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

requirements.txt

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

score_baseline.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

SEC_FUND_DATA_RESEARCH.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

sec_fund_fetcher.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

SEC_REFERENCE_DATA_vs_OPENFUNDS.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

README.md

fund_rdf_data

A relationship-rich finance dataset for text-to-RDF-triple extraction, built from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis Magical RDF Triples and how to synthetize them.

Each sample pairs a long natural-language prospectus section (input) with a compact graph of entity-to-entity RDF triples (target) — a fund advised by a manager, distributed by an underwriter, seriesOf a trust, and so on. Unlike Wikidata-derived corpora where text ≈ triples, here the input is ~20–50× larger than the output, and the target is a genuine knowledge graph rather than flat attributes. Ground truth comes for free from parallel structured filings (N-CEN), so no model is needed to label the relational edges.

See dataset_description.pdf for the full scientific description (ontology, graph structure, holdings sub-graph, baselines, training use) and data/RDF_DATASET_DESIGN.md for design notes.

Pipeline

The dataset is built by build_rdf_dataset.py in four stages:

# 1. gold  — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary

# 2. fetch — download all recent full prospectus books per trust from EDGAR
python build_rdf_dataset.py fetch --limit 435 --max-filings 8

# 3. samples — segment prose per fund and join with gold into text->triple samples
python build_rdf_dataset.py samples

# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split

# or run all four:
python build_rdf_dataset.py all --limit 435

score_baseline.py computes a no-model string-match baseline and scores strong-model predictions against the gold:

python score_baseline.py stringmatch          # no-model lower bound
python score_baseline.py model --pred preds.jsonl

Sample format

Each line of samples.jsonl / train|val|test.jsonl is a JSON record:

field	meaning
`input_text`	prospectus prose for the fund (model input)
`ontology`	inferred meta-schema (subject type → predicate → object type)
`target_triples`	structured `{s,p,o,grounded}` list (`grounded` = object name appears in `input_text`)
`target_serialized`	marker form (`<triple_start>` …) for Models 2/4
`target_serialized_plain`	Turtle-like form, no special tokens, for Models 1/3
`cik`, `series_id`, `fund`, `trust_name`	identifiers
`stats`	input/target sizes, triple count, `n_grounded`, text:json ratio

Relations

Entity-to-entity edges (gold from N-CEN / Series-Class): seriesOf, advisedBy, subAdvisedBy, transferAgent, administrator, underwrittenBy.

custodian is dropped by default (--custodian-scope none): custodian names — especially foreign sub-custodians — appear only in the structured N-CEN table and in no prose document (the summary prospectus says only "the custodian"), so they are not extractable from text. The primary custodian is named only in the separately-filed SAI (N-1A Part B), which is not part of the input. Use --custodian-scope primary or all to re-include it if you add the SAI as input.

Prose-grounding: every triple carries a grounded flag (object name present in the sample's input). Across the full build ~80 % of triples are grounded (per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 80–84 %, transferAgent 72 %, underwrittenBy 62 %). Filter on grounded to train/evaluate only on text-extractable targets.

Holdings edges (holds/issuedBy/domiciledIn, gold from N-PORT) are a planned second track from annual-report (N-CSR) commentary — see the description PDF.

Data sources

All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data (prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample files are git-ignored because they are reproducible from the commands above; only the lightweight structured gold graph (data/rdf_poc/gold_graphs.jsonl) is committed.

Requirements

pip install -r requirements.txt

README.md Unescape Escape

fund_rdf_data

Pipeline

Sample format

Relations

Data sources

Requirements

README.md