# fund_rdf_data A relationship-rich **finance dataset for text-to-RDF-triple extraction**, built from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis *Magical RDF Triples and how to synthetize them*. Each sample pairs a long natural-language **prospectus section** (input) with a compact graph of **entity-to-entity RDF triples** (target) — a fund *advised by* a manager, *distributed by* an underwriter, *seriesOf* a trust, and so on. Unlike Wikidata-derived corpora where text ≈ triples, here the input is ~20–50× larger than the output, and the target is a genuine knowledge graph rather than flat attributes. Ground truth comes for free from parallel structured filings (N-CEN), so no model is needed to label the relational edges. See [`dataset_description.pdf`](dataset_description.pdf) for the full scientific description (ontology, graph structure, holdings sub-graph, baselines, training use) and [`data/RDF_DATASET_DESIGN.md`](data/RDF_DATASET_DESIGN.md) for design notes. ## Pipeline The dataset is built by [`build_rdf_dataset.py`](build_rdf_dataset.py) in four stages: ```bash # 1. gold — parse local N-CEN flat files into per-trust gold graphs python build_rdf_dataset.py gold --custodian-scope primary # 2. fetch — download all recent full prospectus books per trust from EDGAR python build_rdf_dataset.py fetch --limit 435 --max-filings 8 # 3. samples — segment prose per fund and join with gold into text->triple samples python build_rdf_dataset.py samples # 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage) python build_rdf_dataset.py split # or run all four: python build_rdf_dataset.py all --limit 435 ``` [`score_baseline.py`](score_baseline.py) computes a no-model string-match baseline and scores strong-model predictions against the gold: ```bash python score_baseline.py stringmatch # no-model lower bound python score_baseline.py model --pred preds.jsonl ``` ## Sample format Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record: | field | meaning | |---|---| | `input_text` | prospectus prose for the fund (model input) | | `ontology` | inferred meta-schema (subject type → predicate → object type) | | `target_triples` | structured `{s,p,o}` list | | `target_serialized` | marker form (`` …) for Models 2/4 | | `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 | | `cik`, `series_id`, `fund`, `trust_name` | identifiers | | `stats` | input/target sizes, triple count, text:json ratio | ## Relations Entity-to-entity edges (gold from N-CEN / Series-Class): `seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary only by default), `administrator`, `underwrittenBy`. Holdings edges (`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track from annual-report (N-CSR) commentary — see the description PDF. ## Data sources All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data (prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample files are **git-ignored** because they are reproducible from the commands above; only the lightweight structured gold graph (`data/rdf_poc/gold_graphs.jsonl`) is committed. ## Requirements ```bash pip install -r requirements.txt ```