Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

fund_rdf_data

A relationship-rich finance dataset for text-to-RDF-triple extraction, built from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis Magical RDF Triples and how to synthetize them.

Each sample pairs a long natural-language prospectus section (input) with a compact graph of entity-to-entity RDF triples (target) — a fund advised by a manager, distributed by an underwriter, seriesOf a trust, and so on. Unlike Wikidata-derived corpora where text ≈ triples, here the input is ~2050× larger than the output, and the target is a genuine knowledge graph rather than flat attributes. Ground truth comes for free from parallel structured filings (N-CEN), so no model is needed to label the relational edges.

See dataset_description.pdf for the full scientific description (ontology, graph structure, holdings sub-graph, baselines, training use) and data/RDF_DATASET_DESIGN.md for design notes.

Pipeline

The dataset is built by build_rdf_dataset.py in four stages:

# 1. gold  — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary

# 2. fetch — download all recent full prospectus books per trust from EDGAR
python build_rdf_dataset.py fetch --limit 435 --max-filings 8

# 3. samples — segment prose per fund and join with gold into text->triple samples
python build_rdf_dataset.py samples

# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split

# or run all four:
python build_rdf_dataset.py all --limit 435

score_baseline.py computes a no-model string-match baseline and scores strong-model predictions against the gold:

python score_baseline.py stringmatch          # no-model lower bound
python score_baseline.py model --pred preds.jsonl

Sample format

Each line of samples.jsonl / train|val|test.jsonl is a JSON record:

field meaning
input_text prospectus prose for the fund (model input)
ontology inferred meta-schema (subject type → predicate → object type)
target_triples structured {s,p,o} list
target_serialized marker form (<triple_start> …) for Models 2/4
target_serialized_plain Turtle-like form, no special tokens, for Models 1/3
cik, series_id, fund, trust_name identifiers
stats input/target sizes, triple count, text:json ratio

Relations

Entity-to-entity edges (gold from N-CEN / Series-Class): seriesOf, advisedBy, subAdvisedBy, transferAgent, custodian (primary only by default), administrator, underwrittenBy. Holdings edges (holds/issuedBy/domiciledIn, gold from N-PORT) are a planned second track from annual-report (N-CSR) commentary — see the description PDF.

Data sources

All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data (prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample files are git-ignored because they are reproducible from the commands above; only the lightweight structured gold graph (data/rdf_poc/gold_graphs.jsonl) is committed.

Requirements

pip install -r requirements.txt
Description
Code for fund SEC RFID data
Readme MIT 9 MiB
Languages
Python 83.4%
TeX 13%
Shell 3.6%