Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
76 lines
4.1 KiB
Markdown
76 lines
4.1 KiB
Markdown
# SEC Fund Prospectus → RDF Triple Dataset — Design
|
|
|
|
## Goal
|
|
A training/eval dataset for text→RDF-triple extraction where:
|
|
- the INPUT text is much larger than the OUTPUT JSON (realistic "needle in long doc"),
|
|
- the OUTPUT is a genuine **graph of entity→entity relationships**, not flat key→literal attributes,
|
|
- it is finance **reference data** (openfunds-aligned),
|
|
- there is a **non-model gold baseline** AND a path to a **strong-model baseline**.
|
|
|
|
## Why not flat XBRL fee data
|
|
XBRL Risk/Return (management fee, TER, returns) yields only `Fund --pred--> literal`
|
|
attributes — a star of literals, no entity-to-entity edges. Rejected as the *primary*
|
|
graph source because it produces no relational structure. (May be added as optional
|
|
literal-valued triples to enrich the schema, but it is not the point.)
|
|
|
|
## Evidence (sample: Fidelity Oxford Street Trust, 485BPOS 0000028540-25-000048)
|
|
Full prospectus prose = 1,061,532 chars. Relational mentions found in prose:
|
|
adviser 381, sub-adviser 25, distributor 31, custodian 59, transfer agent 24,
|
|
"Trust" 805, "series of" 13, "managed by" 38, named benchmark index.
|
|
=> The prose contains a real multi-entity-type graph.
|
|
|
|
## Target ontology (entity→entity edges)
|
|
```
|
|
Fund seriesOf Trust
|
|
Fund advisedBy InvestmentAdviser
|
|
Fund subAdvisedBy SubAdviser
|
|
Fund distributedBy Distributor
|
|
Fund custodian Custodian
|
|
Fund transferAgent TransferAgent
|
|
Fund managedBy PortfolioManager
|
|
Fund tracksIndex Index (passive/index funds)
|
|
```
|
|
Optional literal-valued triples (attributes) to round out the openfunds record:
|
|
managementFee, netExpenseRatio, return1yr/5yr/10yr, portfolioTurnover, objectiveText.
|
|
|
|
## Baselines (the key selling point)
|
|
1. GOLD, no model:
|
|
- N-CEN service-provider table -> advisedBy, custodian, transferAgent, distributor, auditor (with LEIs)
|
|
- Series/Class CSV -> seriesOf, hasShareClass (Trust->Series->Class skeleton)
|
|
2. SILVER, strong model (GPT-4 / Opus), measured against the N-CEN gold:
|
|
- managedBy (portfolio managers), subAdvisedBy, tracksIndex (named only in prose)
|
|
|
|
## Unit of extraction
|
|
One FUND (series) = one sample:
|
|
input = the prospectus section(s) for that fund (kept within a long context)
|
|
output = the fund's subgraph of triples
|
|
N-CEN/Series-Class gold is keyed at trust/series level, so per-series is the natural join.
|
|
|
|
## Data gaps to fix (local dump)
|
|
- fund_data.db was accidentally deleted (only stale -shm/-wal remain) -> rebuild from data/.
|
|
- load_xbrl_rr.py tag constants are partly stale vs the actual num.tsv vintage
|
|
(e.g. ExpensesOverAssets, DistributionAndService12b1FeesOverAssets,
|
|
AvgAnnlRtrPct/AnnlRtrPct, PortfolioTurnoverRate, BarChart*QuarterlyReturn).
|
|
- Local N-PORT dump is MISSING FUND_REPORTED_HOLDING.tsv and IDENTIFIERS.tsv
|
|
(the holds/issuedBy edges) -> would need re-download if holdings edges are wanted.
|
|
- XBRL flat files contain NO text blocks -> narrative input must be fetched from EDGAR
|
|
filings (filing_text), not from the flat files.
|
|
- N-CEN flat files are NOT yet downloaded locally -> needed for the gold edges.
|
|
|
|
## Holdings edges (deferred to a 2nd track)
|
|
Holdings (`fund holds security`, `security issuedBy issuer`) are NOT in the prospectus.
|
|
Text-bearing sources for holdings:
|
|
- Annual/Semi-Annual Report (N-CSR): "Management Discussion" (MDFP) names TOP holdings
|
|
in prose -> real `holds` edges; gold = N-PORT holdings table. The full Schedule of
|
|
Investments is a table, not prose.
|
|
- Fund fact sheets / PM commentary (fund-company marketing): "Top 10 Holdings" + prose,
|
|
but off-EDGAR and no standardized machine-readable gold.
|
|
Plan: build prospectus->service-provider graph FIRST (clean N-CEN gold). Add an
|
|
MDFP(N-CSR)->holdings dataset as a SECOND finance sub-domain later — this also
|
|
strengthens the thesis's cross-domain generalization claim.
|
|
|
|
## This session: PROOF OF CONCEPT (~20-50 funds)
|
|
End-to-end on a small slice: fetch prospectus text, get N-CEN gold service-provider
|
|
edges + Series/Class structure, emit text->triples samples in the <triple_start>
|
|
RDF marker format, and a gold-baseline scorer. Validate before scaling.
|