Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
84 lines
3.3 KiB
Markdown
84 lines
3.3 KiB
Markdown
# fund_rdf_data
|
||
|
||
A relationship-rich **finance dataset for text-to-RDF-triple extraction**, built
|
||
from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis
|
||
*Magical RDF Triples and how to synthetize them*.
|
||
|
||
Each sample pairs a long natural-language **prospectus section** (input) with a
|
||
compact graph of **entity-to-entity RDF triples** (target) — a fund *advised by*
|
||
a manager, *distributed by* an underwriter, *seriesOf* a trust, and so on. Unlike
|
||
Wikidata-derived corpora where text ≈ triples, here the input is ~20–50× larger
|
||
than the output, and the target is a genuine knowledge graph rather than flat
|
||
attributes. Ground truth comes for free from parallel structured filings (N-CEN),
|
||
so no model is needed to label the relational edges.
|
||
|
||
See [`dataset_description.pdf`](dataset_description.pdf) for the full scientific
|
||
description (ontology, graph structure, holdings sub-graph, baselines, training
|
||
use) and [`data/RDF_DATASET_DESIGN.md`](data/RDF_DATASET_DESIGN.md) for design
|
||
notes.
|
||
|
||
## Pipeline
|
||
|
||
The dataset is built by [`build_rdf_dataset.py`](build_rdf_dataset.py) in four stages:
|
||
|
||
```bash
|
||
# 1. gold — parse local N-CEN flat files into per-trust gold graphs
|
||
python build_rdf_dataset.py gold --custodian-scope primary
|
||
|
||
# 2. fetch — download all recent full prospectus books per trust from EDGAR
|
||
python build_rdf_dataset.py fetch --limit 435 --max-filings 8
|
||
|
||
# 3. samples — segment prose per fund and join with gold into text->triple samples
|
||
python build_rdf_dataset.py samples
|
||
|
||
# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
|
||
python build_rdf_dataset.py split
|
||
|
||
# or run all four:
|
||
python build_rdf_dataset.py all --limit 435
|
||
```
|
||
|
||
[`score_baseline.py`](score_baseline.py) computes a no-model string-match baseline
|
||
and scores strong-model predictions against the gold:
|
||
|
||
```bash
|
||
python score_baseline.py stringmatch # no-model lower bound
|
||
python score_baseline.py model --pred preds.jsonl
|
||
```
|
||
|
||
## Sample format
|
||
|
||
Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:
|
||
|
||
| field | meaning |
|
||
|---|---|
|
||
| `input_text` | prospectus prose for the fund (model input) |
|
||
| `ontology` | inferred meta-schema (subject type → predicate → object type) |
|
||
| `target_triples` | structured `{s,p,o}` list |
|
||
| `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
|
||
| `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
|
||
| `cik`, `series_id`, `fund`, `trust_name` | identifiers |
|
||
| `stats` | input/target sizes, triple count, text:json ratio |
|
||
|
||
## Relations
|
||
|
||
Entity-to-entity edges (gold from N-CEN / Series-Class):
|
||
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary
|
||
only by default), `administrator`, `underwrittenBy`. Holdings edges
|
||
(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
|
||
from annual-report (N-CSR) commentary — see the description PDF.
|
||
|
||
## Data sources
|
||
|
||
All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data
|
||
(prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample
|
||
files are **git-ignored** because they are reproducible from the commands above;
|
||
only the lightweight structured gold graph (`data/rdf_poc/gold_graphs.jsonl`) is
|
||
committed.
|
||
|
||
## Requirements
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|