# SEC Fund Prospectus → RDF Triple Dataset — Design ## Goal A training/eval dataset for text→RDF-triple extraction where: - the INPUT text is much larger than the OUTPUT JSON (realistic "needle in long doc"), - the OUTPUT is a genuine **graph of entity→entity relationships**, not flat key→literal attributes, - it is finance **reference data** (openfunds-aligned), - there is a **non-model gold baseline** AND a path to a **strong-model baseline**. ## Why not flat XBRL fee data XBRL Risk/Return (management fee, TER, returns) yields only `Fund --pred--> literal` attributes — a star of literals, no entity-to-entity edges. Rejected as the *primary* graph source because it produces no relational structure. (May be added as optional literal-valued triples to enrich the schema, but it is not the point.) ## Evidence (sample: Fidelity Oxford Street Trust, 485BPOS 0000028540-25-000048) Full prospectus prose = 1,061,532 chars. Relational mentions found in prose: adviser 381, sub-adviser 25, distributor 31, custodian 59, transfer agent 24, "Trust" 805, "series of" 13, "managed by" 38, named benchmark index. => The prose contains a real multi-entity-type graph. ## Target ontology (entity→entity edges) ``` Fund seriesOf Trust Fund advisedBy InvestmentAdviser Fund subAdvisedBy SubAdviser Fund distributedBy Distributor Fund custodian Custodian Fund transferAgent TransferAgent Fund managedBy PortfolioManager Fund tracksIndex Index (passive/index funds) ``` Optional literal-valued triples (attributes) to round out the openfunds record: managementFee, netExpenseRatio, return1yr/5yr/10yr, portfolioTurnover, objectiveText. ## Baselines (the key selling point) 1. GOLD, no model: - N-CEN service-provider table -> advisedBy, custodian, transferAgent, distributor, auditor (with LEIs) - Series/Class CSV -> seriesOf, hasShareClass (Trust->Series->Class skeleton) 2. SILVER, strong model (GPT-4 / Opus), measured against the N-CEN gold: - managedBy (portfolio managers), subAdvisedBy, tracksIndex (named only in prose) ## Unit of extraction One FUND (series) = one sample: input = the prospectus section(s) for that fund (kept within a long context) output = the fund's subgraph of triples N-CEN/Series-Class gold is keyed at trust/series level, so per-series is the natural join. ## Data gaps to fix (local dump) - fund_data.db was accidentally deleted (only stale -shm/-wal remain) -> rebuild from data/. - load_xbrl_rr.py tag constants are partly stale vs the actual num.tsv vintage (e.g. ExpensesOverAssets, DistributionAndService12b1FeesOverAssets, AvgAnnlRtrPct/AnnlRtrPct, PortfolioTurnoverRate, BarChart*QuarterlyReturn). - Local N-PORT dump is MISSING FUND_REPORTED_HOLDING.tsv and IDENTIFIERS.tsv (the holds/issuedBy edges) -> would need re-download if holdings edges are wanted. - XBRL flat files contain NO text blocks -> narrative input must be fetched from EDGAR filings (filing_text), not from the flat files. - N-CEN flat files are NOT yet downloaded locally -> needed for the gold edges. ## Holdings edges (deferred to a 2nd track) Holdings (`fund holds security`, `security issuedBy issuer`) are NOT in the prospectus. Text-bearing sources for holdings: - Annual/Semi-Annual Report (N-CSR): "Management Discussion" (MDFP) names TOP holdings in prose -> real `holds` edges; gold = N-PORT holdings table. The full Schedule of Investments is a table, not prose. - Fund fact sheets / PM commentary (fund-company marketing): "Top 10 Holdings" + prose, but off-EDGAR and no standardized machine-readable gold. Plan: build prospectus->service-provider graph FIRST (clean N-CEN gold). Add an MDFP(N-CSR)->holdings dataset as a SECOND finance sub-domain later — this also strengthens the thesis's cross-domain generalization claim. ## This session: PROOF OF CONCEPT (~20-50 funds) End-to-end on a small slice: fetch prospectus text, get N-CEN gold service-provider edges + Series/Class structure, emit text->triples samples in the RDF marker format, and a gold-baseline scorer. Validate before scaling.