# fund_rdf_data

A relationship-rich **finance dataset for text-to-RDF-triple extraction**, built
from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis
*Magical RDF Triples and how to synthetize them*.

Each sample pairs a long natural-language **prospectus section** (input) with a
compact graph of **entity-to-entity RDF triples** (target) — a fund *advised by*
a manager, *distributed by* an underwriter, *seriesOf* a trust, and so on. Unlike
Wikidata-derived corpora where text ≈ triples, here the input is ~20–50× larger
than the output, and the target is a genuine knowledge graph rather than flat
attributes. Ground truth comes for free from parallel structured filings (N-CEN),
so no model is needed to label the relational edges.

See [`dataset_description.pdf`](dataset_description.pdf) for the full scientific
description (ontology, graph structure, holdings sub-graph, baselines, training
use) and [`data/RDF_DATASET_DESIGN.md`](data/RDF_DATASET_DESIGN.md) for design
notes.

## Pipeline

The dataset is built by [`build_rdf_dataset.py`](build_rdf_dataset.py) in four stages:

```bash
# 1. gold  — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary

# 2. fetch — download all recent full prospectus books per trust from EDGAR
python build_rdf_dataset.py fetch --limit 435 --max-filings 8

# 3. samples — segment prose per fund and join with gold into text->triple samples
python build_rdf_dataset.py samples

# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split

# or run all four:
python build_rdf_dataset.py all --limit 435
```

[`score_baseline.py`](score_baseline.py) computes a no-model string-match baseline
and scores strong-model predictions against the gold:

```bash
python score_baseline.py stringmatch          # no-model lower bound
python score_baseline.py model --pred preds.jsonl
```

## Sample format

Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:

| field | meaning |
|---|---|
| `input_text` | prospectus prose for the fund (model input) |
| `ontology` | inferred meta-schema (subject type → predicate → object type) |
| `target_triples` | structured `{s,p,o}` list |
| `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
| `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
| `cik`, `series_id`, `fund`, `trust_name` | identifiers |
| `stats` | input/target sizes, triple count, text:json ratio |

## Relations

Entity-to-entity edges (gold from N-CEN / Series-Class):
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary
only by default), `administrator`, `underwrittenBy`. Holdings edges
(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
from annual-report (N-CSR) commentary — see the description PDF.

## Data sources

All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data
(prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample
files are **git-ignored** because they are reproducible from the commands above;
only the lightweight structured gold graph (`data/rdf_poc/gold_graphs.jsonl`) is
committed.

## Requirements

```bash
pip install -r requirements.txt
```