fund_rfid_data/README.md
Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

84 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# fund_rdf_data
A relationship-rich **finance dataset for text-to-RDF-triple extraction**, built
from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis
*Magical RDF Triples and how to synthetize them*.
Each sample pairs a long natural-language **prospectus section** (input) with a
compact graph of **entity-to-entity RDF triples** (target) — a fund *advised by*
a manager, *distributed by* an underwriter, *seriesOf* a trust, and so on. Unlike
Wikidata-derived corpora where text ≈ triples, here the input is ~2050× larger
than the output, and the target is a genuine knowledge graph rather than flat
attributes. Ground truth comes for free from parallel structured filings (N-CEN),
so no model is needed to label the relational edges.
See [`dataset_description.pdf`](dataset_description.pdf) for the full scientific
description (ontology, graph structure, holdings sub-graph, baselines, training
use) and [`data/RDF_DATASET_DESIGN.md`](data/RDF_DATASET_DESIGN.md) for design
notes.
## Pipeline
The dataset is built by [`build_rdf_dataset.py`](build_rdf_dataset.py) in four stages:
```bash
# 1. gold — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary
# 2. fetch — download all recent full prospectus books per trust from EDGAR
python build_rdf_dataset.py fetch --limit 435 --max-filings 8
# 3. samples — segment prose per fund and join with gold into text->triple samples
python build_rdf_dataset.py samples
# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split
# or run all four:
python build_rdf_dataset.py all --limit 435
```
[`score_baseline.py`](score_baseline.py) computes a no-model string-match baseline
and scores strong-model predictions against the gold:
```bash
python score_baseline.py stringmatch # no-model lower bound
python score_baseline.py model --pred preds.jsonl
```
## Sample format
Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:
| field | meaning |
|---|---|
| `input_text` | prospectus prose for the fund (model input) |
| `ontology` | inferred meta-schema (subject type → predicate → object type) |
| `target_triples` | structured `{s,p,o}` list |
| `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
| `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
| `cik`, `series_id`, `fund`, `trust_name` | identifiers |
| `stats` | input/target sizes, triple count, text:json ratio |
## Relations
Entity-to-entity edges (gold from N-CEN / Series-Class):
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary
only by default), `administrator`, `underwrittenBy`. Holdings edges
(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
from annual-report (N-CSR) commentary — see the description PDF.
## Data sources
All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data
(prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample
files are **git-ignored** because they are reproducible from the commands above;
only the lightweight structured gold graph (`data/rdf_poc/gold_graphs.jsonl`) is
committed.
## Requirements
```bash
pip install -r requirements.txt
```