fund_rfid_data/OPENFUNDS_PUBLIC_DATA_SOURCES.md
Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

32 KiB
Raw Blame History

Openfunds Fields: Public Structured Data Availability

Executive Summary

This document maps each openfunds field category to publicly available structured data sources — data that is machine-readable, downloadable, and free (or freely accessible via API). The focus is on fields describing the fund itself (asset class, settlement, risk, currencies, hedging, ESG, fees, etc.) rather than EU-specific regulatory fields.

Key Public Structured Data Sources

Source Format Access Coverage Cost
SEC Series/Class CSV CSV Direct download ~100K+ US share classes Free
SEC XBRL Risk/Return XBRL → flat files Quarterly download All US mutual fund prospectuses Free
SEC N-PORT Data Sets XML → flat TSV Quarterly download Monthly holdings for all US funds Free
SEC N-CEN Data Sets XML → flat TSV Annual filing, quarterly sets Service providers, classification Free
SEC Submissions API JSON REST API All SEC filers Free
SEC XBRL Company Facts JSON REST API XBRL-tagged financial data Free
GLEIF LEI Database JSON/CSV API + bulk download 3.19M+ global entities Free (CC0)
OpenFIGI JSON REST API Hundreds of millions of instruments Free

1. Key Fact: Company (OFST001000004999) — 40 fields

These fields identify the management company, custodian, transfer agent, auditor, and other service providers.

Fields with Structured Public Data

OF-ID Field Name Public Source Source Field / Method Structured?
OFST001000 Fund Group Name SEC Submissions API subs.name (entity name) Yes — JSON
OFST001020 ManCo SEC N-CEN ADVISOR table (investment adviser) Yes — TSV
OFST001030 LEI Of ManCo GLEIF LEI Database LEI lookup by entity name Yes — JSON/CSV
OFST001035 Domicile Of ManCo GLEIF LEI Database entity.legalAddress.country Yes — JSON
OFST001050 Fund Guarantor Not in public structured data No
OFST001055 Address of ManCo GLEIF LEI Database entity.legalAddress Yes — JSON
OFST001060 City of ManCo GLEIF LEI Database entity.legalAddress.city Yes — JSON
OFST001065 Fund Website of ManCo SEC Submissions API subs.website Yes — JSON
OFST001100 Fund Promoter Name Not publicly structured No
OFST001105 LEI of Fund Promoter GLEIF LEI Database If name known → LEI lookup Partial
OFST001300 Fund Administrator Name SEC N-CEN SERVICE_PROVIDER table Yes — TSV
OFST001400 Custodian Bank Name SEC N-CEN CUSTODIAN table Yes — TSV
OFST001410 LEI Of Custodian Bank SEC N-CEN + GLEIF N-CEN has LEI fields (since 2025) Yes — TSV
OFST001415 Domicile Of Custodian Bank GLEIF LEI Database Via custodian LEI Yes — JSON
OFST001430 Trustee Name SEC EDGAR HTML filings Unstructured (prospectus text) No
OFST001450 Portfolio Managing Company Name SEC N-CEN ADVISOR table + sub-advisors Yes — TSV
OFST001500 Fund Advisor Name SEC N-CEN ADVISOR table Yes — TSV
OFST001510 Sub-Investment Advisor Name SEC N-CEN Sub-advisor entries Yes — TSV
OFST001600 Auditor Name SEC N-CEN AUDITOR table Yes — TSV
OFST002000 Market Maker Name Not publicly structured for funds No
OFST002700 Transfer Agent Name SEC N-CEN TRANSFER_AGENT table Yes — TSV
OFST002900 GIIN of Fund IRS FATCA list (not easily matched) No

Summary: ~15 of 40 company fields are available as structured public data, primarily from SEC N-CEN (service providers) and GLEIF (entity LEI/address data).


2. Key Fact: Umbrella (OFST005000009999) — 10 fields

OF-ID Field Name Public Source Source Field / Method Structured?
OFST005000 Has Umbrella SEC Series/Class CSV Inferred: multiple Series under same CIK Derivable
OFST005010 Umbrella SEC Series/Class CSV Entity Name (trust name) Yes — CSV
OFST005015 Domicile Of Umbrella SEC Submissions API subs.stateOfIncorporation Yes — JSON
OFST005025 CBI Code of Umbrella Ireland-specific, not in SEC No
OFST005030 CSSF Code of Umbrella Luxembourg-specific, not in SEC No
OFST005040 GIIN of Umbrella Not publicly structured No
OFST010035 LEI Of Umbrella GLEIF LEI Database LEI lookup by trust name Yes — JSON

Summary: 4 of 10 fields available. Umbrella concept maps to SEC "Trust/Registrant" level.


3. Key Fact: Fund (OFST010000019999) — 73 fields

This is the richest category, covering fund identity, investment strategy, structure, currencies, hedging, and product type flags.

3A. Fund Identity & Dates

OF-ID Field Name Public Source Source Field / Method Structured?
OFST010010 Fund Domicile Alpha-2 SEC Submissions API subs.stateOfIncorporation → derive Partial (US state, not ISO)
OFST010020 Legal Fund Name Including Umbrella SEC Series/Class CSV Concatenate Entity Name + Series Name Derivable
OFST010030 LEI Of Fund GLEIF LEI Database LEI search by fund name Yes — JSON
OFST010110 Legal Fund Name Only SEC Series/Class CSV Series Name Yes — CSV
OFST010240 Fund Launch Date SEC XBRL Risk/Return InceptionDate element Yes — XBRL
OFST010250 Fund Valuation Point Prospectus text only No
OFST010300 Investment Objective SEC XBRL Risk/Return ObjectivePrimaryTextBlock Yes — XBRL (text)
OFST010410 Fund Currency SEC N-PORT FUND_REPORTED_INFO.total_assets currency context Partial (all USD for US funds)
OFST010440 Fiscal Year End SEC Submissions API subs.fiscalYearEnd (MMDD format) Yes — JSON
OFST013000 Prospectus Date SEC Submissions API Filing date of latest 485BPOS/N-1A Yes — JSON

3B. Fund Structure & Product Type Flags

OF-ID Field Name Public Source Source Field / Method Structured?
OFST010420 Open-ended Or Closed-ended SEC N-CEN Fund type reported Yes — TSV
OFST010500 Is Fund Of Funds SEC N-CEN Fund-of-funds flag Yes — TSV
OFST010580 Is ETF SEC N-CEN ETF table presence Yes — TSV
OFST010620 Is Tokenized Fund Not in SEC data No
OFST010630 Is Leveraged SEC N-PORT Borrowing data (Item B.2) Derivable
OFST010635 Maximum Leverage In Fund Prospectus text only No
OFST010640 Has 130/30 Strategy Prospectus text only No
OFST010650 Is REIT SEC N-CEN + XBRL Classification data Partial
OFST010660 Is ETC US concept is different No
OFST010665 Is ETN SEC N-CEN Product type Partial
OFST010670 Is Short Derivable from fund name/strategy Derivable (heuristic)
OFST010690 Is Life Fund Not a US concept No
OFST010695 Is Pension Fund Not in SEC fund data No
OFST010720 Is Passive Fund SEC N-CEN INDEX table (tracked index) Derivable
OFST010730 Management Approach Type Prospectus text only No

3C. Currencies & Hedging

OF-ID Field Name Public Source Source Field / Method Structured?
OFST010205 Has Duration Hedge Prospectus text only No
OFST010211 Currency Hedge Portfolio Prospectus text only No
OFST010220 Has Embedded Derivatives SEC N-PORT Derivatives tables (non-empty) Derivable
OFST020261 Currency Hedge Share Class Prospectus text only No
OFST020530 Is Multicurrency Share Class Prospectus text only No
OFST020540 Share Class Currency SEC XBRL Risk/Return Currency context in fee/performance tables Partial (USD implied)

Currency/hedging fields are almost entirely prospectus-derived and NOT available as structured public data. This is a key gap: US funds are almost all USD-denominated, and hedging is described in prospectus narrative text. For LLM training, these fields represent extraction targets.

3D. Replication & Securities Lending

OF-ID Field Name Public Source Source Field / Method Structured?
OFST010900 Replication Methodology First Level Prospectus text only (ETFs) No
OFST010901 Replication Methodology Second Level Prospectus text only (ETFs) No
OFST011000 Has Securities Lending SEC N-PORT SECURITIES_LENDING + BORROWER tables Yes — TSV
OFST011100 Has Swap SEC N-PORT Swap derivative tables Derivable
OFST011110 Swap Counterparty Name SEC N-PORT Counterparty fields in swap tables Yes — TSV

Summary for Fund section: ~25 of 73 fields available as structured data. The major gaps are: currency hedging, replication methodology, valuation timing, management approach, and leverage limits — all prospectus-narrative fields.


4. Key Fact: Share Class (OFST020000049999) — 75 fields

4A. Identifiers

OF-ID Field Name Public Source Source Field / Method Structured?
OFST020000 ISIN OpenFIGI FIGI → ISIN mapping Yes — JSON
OFST020005 CUSIP SEC Series/Class CSV Not directly, but derivable from ISIN Partial
OFST020020 Bloomberg Code Proprietary (not free) No
OFST020025 FIGI Code OpenFIGI Direct lookup by ticker/ISIN Yes — JSON
OFST020040 SEDOL Proprietary (London Stock Exchange) No
OFST020045 NFN Identifier Nasdaq proprietary No
OFST020050 Share Class Extension SEC Series/Class CSV Class Name (parse letter/suffix) Derivable
OFST020060 Full Share Class Name SEC Series/Class CSV Series Name + Class Name Yes — CSV

4B. Share Class Characteristics

OF-ID Field Name Public Source Source Field / Method Structured?
OFST020300 Valuation Frequency Prospectus text only No
OFST020400 Share Class Distribution Policy SEC XBRL Risk/Return Derivable from dividend narrative Partial
OFST020540 Share Class Currency Implied USD for US funds Partial
OFST020545 Share Class Lifecycle SEC Submissions API Filing history + Series/Class CSV status Derivable
OFST020560 Share Class Launch Date SEC XBRL Risk/Return InceptionDate per share class Yes — XBRL
OFST020566 Termination Date SEC Series/Class CSV Class status (active/inactive) Partial
OFST020580 Is Share Class Eligible For UCITS Not applicable to US funds No
OFST023100 Investment Status Prospectus text only No
OFST023200 Benchmark SEC XBRL Risk/Return IndexNoDeductionForFeesExpensesTaxes Yes — XBRL
OFST023800 Index Name (ETF) SEC N-CEN INDEX table Yes — TSV
OFST024000 SRRI EU-specific risk indicator No

Summary: ~10 of 75 fields available. Share class operational details (valuation frequency, dealing days, settlement cycles) are entirely prospectus-derived.


5. Key Fact: Listing (OFST060000064999) — 14 fields

OF-ID Field Name Public Source Source Field / Method Structured?
OFST060000 Bloomberg Code Of Listing Proprietary No
OFST062000 Listing Date Exchange data (not SEC) No
OFST062010 Listing Currency Implied USD for US-listed Partial
OFST062025 Launch Price SEC XBRL Risk/Return Inception price context Partial
OFST062030 Market Identifier Code Not in SEC data directly No
OFST062040 Exchange Place SEC N-CEN (ETFs) Exchange information for ETFs Partial

Summary: 0-2 fields fully structured. Listing data is primarily from exchanges, not SEC filings.


OF-ID Field Name Public Source Source Field / Method Structured?
OFST160039 Is EU Directive Relevant EU-specific No
OFST160040 Type Of EU Directive EU-specific (UCITS/AIF) No
OFST160100 Legal Form SEC Series/Class CSV Entity Org Type Yes — CSV
OFST160150 Home Country Legal Type Of Fund SEC N-CEN Fund type classification Yes — TSV

Summary: 2 of 7 fields available. Most are EU-specific.


7. Classification (OFST350000399999) — 12 fields

OF-ID Field Name Public Source Source Field / Method Structured?
OFST350009 Is Sharia Compliant Not in SEC data No
OFST350015 CFI Code OpenFIGI FIGI metadata includes CFI Partial
OFST350050 Clearstream Asset Category Proprietary classification No
OFST350100 EFAMA Main EFC Category EU classification system No
OFST351295 Is Money Market Fund SEC N-CEN + N-MFP Money market fund flag Yes — TSV
OFST351300 Money Market Type Of Fund SEC N-MFP Fund type in N-MFP data Yes — TSV

Major gap: There is no free, structured, universal fund asset class classification in SEC data. The SEC does not tag funds as "equity", "fixed income", "mixed", etc. in a single structured field. Asset class must be derived from:

  • Fund name heuristics ("Growth Fund" → equity, "Bond Fund" → fixed income)
  • N-PORT holdings data (aggregate asset types held)
  • XBRL strategy narrative text

This is a critical finding for LLM training: asset class classification is an extraction target, not ground truth.


8. Purchase Information / Settlement (OFST400000449999) — 95 fields

This is the largest gap between openfunds and public data. Settlement and dealing information is almost entirely found only in prospectus text.

OF-ID Field Name Public Source Structured?
OFST400200 Minimal Initial Subscription Category No
OFST400230 Minimal Initial Subscription In Amount SEC XBRL Risk/Return PartialMinimumInvestment element exists but inconsistently tagged
OFST401002 Pricing Methodology No
OFST402500 Maximal Number Of Possible Decimals Shares No
OFST405521-405532 Subscription Trade Cycle / Dealing Days No
OFST410060 Cut-off Date Offset for Subscription No
OFST410100 Cut-off Time For Subscription No
OFST410700 Settlement Period For Subscription No
OFST410950 Has Lock-up For Redemption No
OFST420200-420265 Redemption Minimums/Maximums No
OFST420630 Bank Details (SSI for Payments) No
OFST425561-425572 Redemption Trade Cycle / Dealing Days No
OFST430100 Cut-off Time For Redemption No
OFST430150 Settlement Period For Redemption No

Summary: 0-1 of 95 fields available as structured data. Settlement cycles, cut-off times, dealing days, minimum investments, and payment details are exclusively in prospectus text. This is arguably the highest-value category for LLM extraction — these fields are critical for fund operations but exist only in legal documents.


9. Fees, Costs and Expenses (OFST450100499999) — 62 fields

This is the strongest area for SEC structured data, thanks to the XBRL Risk/Return fee tables.

OF-ID Field Name Public Source Source Field Structured?
OFST451027 Has Performance Fee SEC XBRL Risk/Return Fee narrative/table Partial
OFST451030 Performance Fee in Prospectus SEC XBRL Risk/Return Not separately tagged Partial
OFST451305 Applied Subscription Fee SEC XBRL Risk/Return MaximumSalesChargeImposedOnPurchasesOverOfferingPrice Yes — XBRL
OFST451320 Maximum Subscription Fee SEC XBRL Risk/Return MaximumSalesChargeImposedOnPurchasesOverOfferingPrice Yes — XBRL
OFST451385 Has Early Redemption Fee SEC XBRL Risk/Return RedemptionFeeOverRedemption Yes — XBRL
OFST451390 Has CDSC Fee SEC XBRL Risk/Return MaximumDeferredSalesChargeOverOther Yes — XBRL
OFST451405 Redemption Fee SEC XBRL Risk/Return RedemptionFeeOverRedemption Yes — XBRL
OFST452000 Management Fee Applied SEC XBRL Risk/Return ManagementFeesOverAssets Yes — XBRL
OFST452100 TER Excluding Performance Fee SEC XBRL Risk/Return NetExpensesOverAssets (OER equivalent) Yes — XBRL
OFST452200 Ongoing Charges SEC XBRL Risk/Return TotalAnnualFundOperatingExpensesOverAssets Yes — XBRL
OFST453151 Is Trailer Fee Clean Not in SEC data No
OFST454150 Has Separate Distribution Fee SEC XBRL Risk/Return Distribution12b1FeesOverAssets Yes — XBRL
OFST454160 Distribution Fee SEC XBRL Risk/Return Distribution12b1FeesOverAssets Yes — XBRL
Fee Waiver / Reimbursement SEC XBRL Risk/Return FeeWaiverOrReimbursementOverAssets Yes — XBRL
Other Expenses SEC XBRL Risk/Return OtherExpensesOverAssets Yes — XBRL
Expense Example (1yr/3yr/5yr/10yr) SEC XBRL Risk/Return ExpenseExampleYear01 through Year10 Yes — XBRL

Key XBRL Fee Elements (complete shareholder fee table)

MaximumSalesChargeImposedOnPurchasesOverOfferingPrice
MaximumDeferredSalesChargeOverOther
MaximumSalesChargeOnReinvestedDividendsAndDistributionsOverOther
RedemptionFeeOverRedemption
MaximumAccountFee
ManagementFeesOverAssets
Distribution12b1FeesOverAssets
OtherExpensesOverAssets
AcquiredFundFeesAndExpensesOverAssets
TotalAnnualFundOperatingExpensesOverAssets
FeeWaiverOrReimbursementOverAssets
TotalAnnualFundOperatingExpensesAfterFeeWaiverOverAssets
ExpenseExampleYear01 / Year03 / Year05 / Year10
ExpenseExampleNoRedemptionYear01 / Year03 / Year05 / Year10

Summary: ~15 of 62 fee fields are available as structured XBRL data. The SEC fee taxonomy is detailed for US-style fees (sales charges, 12b-1, management fee, expense ratio) but does not cover European concepts like custodian fee breakdown, trailer fee clean status, or performance fee details (hurdle rate, high water mark).


10. Solvency II (OFST500000519999) — 13 fields

OF-ID Field Name Public Source Structured?
All 13 fields SCR Market Risk, Tripartite Reports No — entirely EU insurance regulation

Summary: 0 of 13 fields available. Solvency II is a European directive not applicable to US SEC data.


11. Taxes (OFST800000819999) — 27 fields

OF-ID Field Name Public Source Structured?
OFST809200 Is US Tax Forms W8/W9 Needed Prospectus text only
OFST809210 Is US K1 Reporting Required SEC N-CEN Partnership/LP fund flags
OFST809250 Is Flow-Through Entity By US Tax Law Prospectus text only
OFST809511 FATCA Status IRS data, not SEC structured
OFST809520 Subject To FATCA Withholding Prospectus text only
OFST801011 Is Austrian Tax Reporting Fund Austria-specific
OFST802001802045 German Tax fields (8 fields) Germany-specific
OFST802500 Luxembourg Taxe d'Abonnement Luxembourg-specific
OFST808008808100 Swiss Tax fields (3 fields) Switzerland-specific
OFST809015 Has UK Reporting Status UK-specific

Summary: 0-1 of 27 fields available. Tax fields are overwhelmingly jurisdiction-specific (DE, AT, CH, LU, UK, FR, ES) and not in SEC data. The few US-relevant fields (FATCA, K-1) are in prospectus text.


12. ESG Data (OFST820000849999) — 65 fields

OF-ID Field Name Public Source Structured?
OFST820110-820280 Carbon Intensity / Footprint / Absolute GHG (18 fields) No — not yet required in SEC filings for funds
OFST820290-820360 Fossil Fuel Exposure (8 fields) No
OFST820370-820380 Net Zero Commitments (2 fields) No
OFST820390 Implied Temperature Rise No
OFST820440-820460 GHG Reduction Goals (3 fields) No
OFST820470-820540 Climate Stewardship (8 fields) No
OFST820600-820675 AMAS / ACT Signatory fields (8 fields) No — Swiss specific
OFST830000-830210 UK SDR fields (12 fields) No — UK specific
OFST001025 Is UN PRI Signatory UN PRI website Partial — searchable but not API

Current state of SEC ESG data: The SEC adopted climate disclosure rules in March 2024 (effective May 2024), but these apply to operating companies, not investment funds. The Investment Company Names Rule (addressing ESG fund naming) has compliance dates of June 2026 / December 2026. As of February 2026, there is no SEC-mandated structured ESG data for funds comparable to EU SFDR.

Summary: 0 of 65 ESG fields are available as structured public data from SEC. ESG fund data is available from commercial providers (Morningstar, MSCI, Sustainalytics) but not from any free structured public source.


13. Dynamic Data: Prices & AuM (OFDY000001000999) — 20 fields

OF-ID Field Name Public Source Source Field Structured?
OFDY000010 Price Currency Implied USD for US funds Partial
OFDY000035 Valuation NAV SEC N-PORT Not directly; XBRL Company Facts for some Partial
OFDY000060 AuM Fund SEC N-PORT FUND_REPORTED_INFO.total_assets Yes — TSV
OFDY000070 AuM Share Class SEC N-PORT Per-class AuM when reported Partial
OFDY000075 NoS Share Class Not in SEC structured data No

Summary: 1-2 of 20 fields available. Daily NAV prices are not in SEC structured data (available from commercial sources). Fund-level AuM is in N-PORT.


14. Dynamic Data: Performance & Risk (OFDY025000049999) — 4 fields

These 4 fields are Germany-specific (equity participation ratio, total fund asset share, etc.) and not available from SEC.

Additional performance data in SEC: While openfunds has few performance OFDY fields, SEC XBRL Risk/Return provides:

SEC Element Description Structured?
AnnualReturn20XX Calendar year annual returns (1yr10yr) Yes — XBRL
HighestQuarterlyReturnLabel/Value Best quarter return Yes — XBRL
LowestQuarterlyReturnLabel/Value Worst quarter return Yes — XBRL
AverageAnnualReturnYear01/05/10/SinceInception Average annual returns Yes — XBRL
BarChartClosingTextBlock Performance chart narrative Yes — XBRL (text)

And N-PORT provides:

N-PORT Field Description Structured?
MONTHLY_TOTAL_RETURN Monthly returns by class Yes — TSV
MONTHLY_RETURN_CAT_INSTRUMENT Returns by asset category Yes — TSV
FUND_VAR_INFO Value-at-Risk Yes — TSV
INTEREST_RATE_RISK DV01/DV100 by maturity bucket Yes — TSV

15. Portfolio Holdings (OFPH000001999999) — 92 fields

N-PORT is the primary source. SEC requires monthly portfolio disclosure.

OF-ID Field Name N-PORT Source Structured?
OFPH000010 Holding as at Date Reporting date Yes
OFPH000020 Portfolio Currency Fund currency context Yes (USD)
OFPH000100 Holding ISIN IDENTIFIERS table Yes
OFPH000130 Holding Ticker IDENTIFIERS table Yes
OFPH000145 Holding CUSIP IDENTIFIERS table Yes
OFPH000170 Holding FIGI No (use OpenFIGI to map)
OFPH000200 Holding Name FUND_REPORTED_HOLDING.name Yes
OFPH000210 Holding Instrument Type FUND_REPORTED_HOLDING.asset_cat Yes
OFPH000250 Holding Market Value FUND_REPORTED_HOLDING.balance + val_usd Yes
OFPH000300 Holding Net Weight as % FUND_REPORTED_HOLDING.pctVal Yes
OFPH000400 Holding Currency FUND_REPORTED_HOLDING.curCd Yes
OFPH000420 Holding Risk Country FUND_REPORTED_HOLDING.invCountry Yes
OFPH000430 Holding Asset Class FUND_REPORTED_HOLDING.asset_cat Yes
OFPH000440 Holding Credit Rating DEBT_SECURITY fields Yes
OFPH000450 Holding Number of Shares FUND_REPORTED_HOLDING.balance Yes
OFPH000460 Holding Coupon Rate DEBT_SECURITY fields Yes
OFPH000465 Holding Modified Duration No
OFPH000480 Holding Maturity Date DEBT_SECURITY fields Yes
OFPH000600-650 Interest Rate Type / Index / Margin DEBT_SECURITY fields Yes
OFPH000700 Holding Issuer Name FUND_REPORTED_HOLDING.issuerConditionalName Yes
OFPH000710 Holding Issuer LEI FUND_REPORTED_HOLDING.lei Yes
OFPH000712 Holding Issuer Domicile FUND_REPORTED_HOLDING.invCountry Yes
OFPH000730 Holding Strike Price Derivative tables Yes
OFPH000800-870 Underlying Asset fields Derivative tables Yes

Summary: ~35-40 of 92 fields available from N-PORT. The main gaps are: modified/effective duration (calculated, not reported), GICS sector codes (not in N-PORT directly), and European-specific fields (CIC, NACE, EUSIPA, WKN, Valor).


16. Fund Ratios and Exposures (OFRE000001999999) — 42 fields

OF-ID Field Name Public Source Structured?
OFRE000010 Number Of Positions SEC N-PORT Derivable — count holdings
OFRE000200 Exposure To Cash SEC N-PORT Derivable — sum cash-type holdings
OFRE000300-320 Credit Quality fields SEC N-PORT Derivable — aggregate from holdings
OFRE000330 Average Effective Maturity Not directly in N-PORT
OFRE000335 Average Effective Duration Not directly in N-PORT
OFRE000350 Yield To Maturity Not in N-PORT
OFRE000500 Top Ten Positions SEC N-PORT Derivable — sort by weight
OFRE000520 Country Breakdown SEC N-PORT Derivable — aggregate by country
OFRE000540 Currency Breakdown SEC N-PORT Derivable — aggregate by currency
OFRE000560 GICS Equity Sector Breakdown GICS not in N-PORT
OFRE000570 Market Cap Breakdown Not in N-PORT
OFRE000580 Credit Rating Breakdown SEC N-PORT Derivable — aggregate by rating
OFRE000590 Maturity Breakdown SEC N-PORT Derivable — aggregate by maturity
OFRE000600 Asset Class Breakdown SEC N-PORT Derivable — aggregate by asset_cat

Summary: ~10-15 of 42 fields are derivable from N-PORT holdings data. Pre-computed ratios (YTM, duration, OAS) are not available.


17. Portfolio Manager Data (OFPM000001999999) — 8 fields

OF-ID Field Name Public Source Structured?
OFPM000010 Portfolio Manager Name SEC XBRL Risk/Return PortfolioManager text block
OFPM000060 Portfolio Manager Brief Biography SEC XBRL Risk/Return SAI supplement text
Others Year of birth, experience, role Not structured

Summary: 0-1 of 8 fields. Portfolio manager data is in prospectus SAI text, not structured.


Grand Summary: Structured Data Availability by Category

Category Total Fields Structured Public Derivable Not Available
Company (service providers) 40 15 0 25
Umbrella 10 4 1 5
Fund (identity, structure) 73 15 10 48
Share Class 75 8 2 65
Listing 14 0 2 12
Legal Structure 7 2 0 5
Classification 12 2 0 10
Purchase / Settlement 95 0 1 94
Fees 62 15 0 47
Solvency II 13 0 0 13
Taxes 27 0 1 26
ESG 65 0 0 65
Prices / AuM 20 2 0 18
Performance / Risk 4 0 0 4
Portfolio Holdings 92 38 0 54
Ratios / Exposures 42 0 14 28
Portfolio Manager 8 0 1 7
TOTAL 659 ~101 (15%) ~32 (5%) ~526 (80%)

Implications for LLM Training Dataset

What this means:

  1. ~15% of openfunds fields have directly available structured public data (primarily from SEC EDGAR: XBRL fees, N-PORT holdings, N-CEN service providers, Series/Class CSV identifiers).

  2. ~5% are derivable from structured data (e.g., aggregating N-PORT holdings into country/currency/rating breakdowns, counting positions, inferring ETF status from N-CEN index tracking).

  3. ~80% are NOT available as structured public data and exist only in prospectus narrative text.

The 80% gap = the LLM opportunity

The fields that are not available as structured data but are specified in prospectus text represent the core value proposition for LLM extraction:

Category Key Extraction Targets
Settlement / Dealing Cut-off times, settlement periods, dealing days, minimum subscriptions, pricing methodology
Currencies / Hedging Share class currency hedging, portfolio hedging, multicurrency options
Risk Limits Maximum leverage, redemption gates, lock-up periods, side pockets
Asset Class Fund classification (equity/bond/mixed/alternative), investment strategy
Fee Details Performance fee mechanics (hurdle rate, high water mark, crystallization), custodian fees
ESG Sustainability approach, climate targets, exclusion criteria
Tax FATCA status, K-1 requirements, flow-through entity status
  • Ground truth (structured data): Use SEC XBRL fees, N-PORT holdings, N-CEN service providers, and Series/Class CSV as verifiable reference data.
  • Extraction targets (unstructured → structured): Use the 80% of openfunds fields that exist only in prospectus text as the fields the LLM should learn to extract.
  • Validation: For the ~15% structured fields, compare LLM extraction from prospectus text against SEC structured data to measure extraction accuracy.

Appendix: Data Source URLs

Source URL
SEC Series/Class CSV https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information
SEC XBRL Risk/Return Data Sets https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets
SEC N-PORT Data Sets https://www.sec.gov/data-research/sec-markets-data/form-n-port-data-sets
SEC N-CEN Data Sets https://www.sec.gov/data-research/sec-markets-data/form-n-cen-data-sets
SEC Submissions API https://data.sec.gov/submissions/CIK{cik}.json
SEC XBRL Company Facts API https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json
GLEIF LEI Database https://search.gleif.org/ / https://www.gleif.org/en/lei-data/gleif-api
OpenFIGI API https://www.openfigi.com/api
SEC N-MFP (Money Market) https://www.sec.gov/data-research/sec-markets-data