fund_rfid_data/OPENFUNDS_PUBLIC_DATA_SOURCES.md
Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

520 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Openfunds Fields: Public Structured Data Availability
## Executive Summary
This document maps each openfunds field category to publicly available **structured data sources** — data that is machine-readable, downloadable, and free (or freely accessible via API). The focus is on fields describing the fund itself (asset class, settlement, risk, currencies, hedging, ESG, fees, etc.) rather than EU-specific regulatory fields.
### Key Public Structured Data Sources
| Source | Format | Access | Coverage | Cost |
|--------|--------|--------|----------|------|
| **SEC Series/Class CSV** | CSV | Direct download | ~100K+ US share classes | Free |
| **SEC XBRL Risk/Return** | XBRL → flat files | Quarterly download | All US mutual fund prospectuses | Free |
| **SEC N-PORT Data Sets** | XML → flat TSV | Quarterly download | Monthly holdings for all US funds | Free |
| **SEC N-CEN Data Sets** | XML → flat TSV | Annual filing, quarterly sets | Service providers, classification | Free |
| **SEC Submissions API** | JSON | REST API | All SEC filers | Free |
| **SEC XBRL Company Facts** | JSON | REST API | XBRL-tagged financial data | Free |
| **GLEIF LEI Database** | JSON/CSV | API + bulk download | 3.19M+ global entities | Free (CC0) |
| **OpenFIGI** | JSON | REST API | Hundreds of millions of instruments | Free |
---
## 1. Key Fact: Company (OFST001000004999) — 40 fields
These fields identify the management company, custodian, transfer agent, auditor, and other service providers.
### Fields with Structured Public Data
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST001000 | Fund Group Name | SEC Submissions API | `subs.name` (entity name) | **Yes** — JSON |
| OFST001020 | ManCo | SEC N-CEN | ADVISOR table (investment adviser) | **Yes** — TSV |
| OFST001030 | LEI Of ManCo | GLEIF LEI Database | LEI lookup by entity name | **Yes** — JSON/CSV |
| OFST001035 | Domicile Of ManCo | GLEIF LEI Database | `entity.legalAddress.country` | **Yes** — JSON |
| OFST001050 | Fund Guarantor | — | Not in public structured data | No |
| OFST001055 | Address of ManCo | GLEIF LEI Database | `entity.legalAddress` | **Yes** — JSON |
| OFST001060 | City of ManCo | GLEIF LEI Database | `entity.legalAddress.city` | **Yes** — JSON |
| OFST001065 | Fund Website of ManCo | SEC Submissions API | `subs.website` | **Yes** — JSON |
| OFST001100 | Fund Promoter Name | — | Not publicly structured | No |
| OFST001105 | LEI of Fund Promoter | GLEIF LEI Database | If name known → LEI lookup | **Partial** |
| OFST001300 | Fund Administrator Name | SEC N-CEN | SERVICE_PROVIDER table | **Yes** — TSV |
| OFST001400 | Custodian Bank Name | SEC N-CEN | CUSTODIAN table | **Yes** — TSV |
| OFST001410 | LEI Of Custodian Bank | SEC N-CEN + GLEIF | N-CEN has LEI fields (since 2025) | **Yes** — TSV |
| OFST001415 | Domicile Of Custodian Bank | GLEIF LEI Database | Via custodian LEI | **Yes** — JSON |
| OFST001430 | Trustee Name | SEC EDGAR HTML filings | Unstructured (prospectus text) | No |
| OFST001450 | Portfolio Managing Company Name | SEC N-CEN | ADVISOR table + sub-advisors | **Yes** — TSV |
| OFST001500 | Fund Advisor Name | SEC N-CEN | ADVISOR table | **Yes** — TSV |
| OFST001510 | Sub-Investment Advisor Name | SEC N-CEN | Sub-advisor entries | **Yes** — TSV |
| OFST001600 | Auditor Name | SEC N-CEN | AUDITOR table | **Yes** — TSV |
| OFST002000 | Market Maker Name | — | Not publicly structured for funds | No |
| OFST002700 | Transfer Agent Name | SEC N-CEN | TRANSFER_AGENT table | **Yes** — TSV |
| OFST002900 | GIIN of Fund | — | IRS FATCA list (not easily matched) | No |
**Summary**: ~15 of 40 company fields are available as structured public data, primarily from SEC N-CEN (service providers) and GLEIF (entity LEI/address data).
---
## 2. Key Fact: Umbrella (OFST005000009999) — 10 fields
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST005000 | Has Umbrella | SEC Series/Class CSV | Inferred: multiple Series under same CIK | **Derivable** |
| OFST005010 | Umbrella | SEC Series/Class CSV | `Entity Name` (trust name) | **Yes** — CSV |
| OFST005015 | Domicile Of Umbrella | SEC Submissions API | `subs.stateOfIncorporation` | **Yes** — JSON |
| OFST005025 | CBI Code of Umbrella | — | Ireland-specific, not in SEC | No |
| OFST005030 | CSSF Code of Umbrella | — | Luxembourg-specific, not in SEC | No |
| OFST005040 | GIIN of Umbrella | — | Not publicly structured | No |
| OFST010035 | LEI Of Umbrella | GLEIF LEI Database | LEI lookup by trust name | **Yes** — JSON |
**Summary**: 4 of 10 fields available. Umbrella concept maps to SEC "Trust/Registrant" level.
---
## 3. Key Fact: Fund (OFST010000019999) — 73 fields
This is the richest category, covering fund identity, investment strategy, structure, currencies, hedging, and product type flags.
### 3A. Fund Identity & Dates
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST010010 | Fund Domicile Alpha-2 | SEC Submissions API | `subs.stateOfIncorporation` → derive | **Partial** (US state, not ISO) |
| OFST010020 | Legal Fund Name Including Umbrella | SEC Series/Class CSV | Concatenate Entity Name + Series Name | **Derivable** |
| OFST010030 | LEI Of Fund | GLEIF LEI Database | LEI search by fund name | **Yes** — JSON |
| OFST010110 | Legal Fund Name Only | SEC Series/Class CSV | `Series Name` | **Yes** — CSV |
| OFST010240 | Fund Launch Date | SEC XBRL Risk/Return | `InceptionDate` element | **Yes** — XBRL |
| OFST010250 | Fund Valuation Point | — | Prospectus text only | No |
| OFST010300 | Investment Objective | SEC XBRL Risk/Return | `ObjectivePrimaryTextBlock` | **Yes** — XBRL (text) |
| OFST010410 | Fund Currency | SEC N-PORT | `FUND_REPORTED_INFO.total_assets` currency context | **Partial** (all USD for US funds) |
| OFST010440 | Fiscal Year End | SEC Submissions API | `subs.fiscalYearEnd` (MMDD format) | **Yes** — JSON |
| OFST013000 | Prospectus Date | SEC Submissions API | Filing date of latest 485BPOS/N-1A | **Yes** — JSON |
### 3B. Fund Structure & Product Type Flags
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST010420 | Open-ended Or Closed-ended | SEC N-CEN | Fund type reported | **Yes** — TSV |
| OFST010500 | Is Fund Of Funds | SEC N-CEN | Fund-of-funds flag | **Yes** — TSV |
| OFST010580 | Is ETF | SEC N-CEN | ETF table presence | **Yes** — TSV |
| OFST010620 | Is Tokenized Fund | — | Not in SEC data | No |
| OFST010630 | Is Leveraged | SEC N-PORT | Borrowing data (Item B.2) | **Derivable** |
| OFST010635 | Maximum Leverage In Fund | — | Prospectus text only | No |
| OFST010640 | Has 130/30 Strategy | — | Prospectus text only | No |
| OFST010650 | Is REIT | SEC N-CEN + XBRL | Classification data | **Partial** |
| OFST010660 | Is ETC | — | US concept is different | No |
| OFST010665 | Is ETN | SEC N-CEN | Product type | **Partial** |
| OFST010670 | Is Short | — | Derivable from fund name/strategy | **Derivable** (heuristic) |
| OFST010690 | Is Life Fund | — | Not a US concept | No |
| OFST010695 | Is Pension Fund | — | Not in SEC fund data | No |
| OFST010720 | Is Passive Fund | SEC N-CEN | INDEX table (tracked index) | **Derivable** |
| OFST010730 | Management Approach Type | — | Prospectus text only | No |
### 3C. Currencies & Hedging
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST010205 | Has Duration Hedge | — | Prospectus text only | No |
| OFST010211 | Currency Hedge Portfolio | — | Prospectus text only | No |
| OFST010220 | Has Embedded Derivatives | SEC N-PORT | Derivatives tables (non-empty) | **Derivable** |
| OFST020261 | Currency Hedge Share Class | — | Prospectus text only | No |
| OFST020530 | Is Multicurrency Share Class | — | Prospectus text only | No |
| OFST020540 | Share Class Currency | SEC XBRL Risk/Return | Currency context in fee/performance tables | **Partial** (USD implied) |
**Currency/hedging fields are almost entirely prospectus-derived and NOT available as structured public data.** This is a key gap: US funds are almost all USD-denominated, and hedging is described in prospectus narrative text. For LLM training, these fields represent extraction targets.
### 3D. Replication & Securities Lending
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST010900 | Replication Methodology First Level | — | Prospectus text only (ETFs) | No |
| OFST010901 | Replication Methodology Second Level | — | Prospectus text only (ETFs) | No |
| OFST011000 | Has Securities Lending | SEC N-PORT | SECURITIES_LENDING + BORROWER tables | **Yes** — TSV |
| OFST011100 | Has Swap | SEC N-PORT | Swap derivative tables | **Derivable** |
| OFST011110 | Swap Counterparty Name | SEC N-PORT | Counterparty fields in swap tables | **Yes** — TSV |
**Summary for Fund section**: ~25 of 73 fields available as structured data. The major gaps are: currency hedging, replication methodology, valuation timing, management approach, and leverage limits — all prospectus-narrative fields.
---
## 4. Key Fact: Share Class (OFST020000049999) — 75 fields
### 4A. Identifiers
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST020000 | ISIN | OpenFIGI | FIGI → ISIN mapping | **Yes** — JSON |
| OFST020005 | CUSIP | SEC Series/Class CSV | Not directly, but derivable from ISIN | **Partial** |
| OFST020020 | Bloomberg Code | — | Proprietary (not free) | No |
| OFST020025 | FIGI Code | OpenFIGI | Direct lookup by ticker/ISIN | **Yes** — JSON |
| OFST020040 | SEDOL | — | Proprietary (London Stock Exchange) | No |
| OFST020045 | NFN Identifier | — | Nasdaq proprietary | No |
| OFST020050 | Share Class Extension | SEC Series/Class CSV | `Class Name` (parse letter/suffix) | **Derivable** |
| OFST020060 | Full Share Class Name | SEC Series/Class CSV | `Series Name` + `Class Name` | **Yes** — CSV |
### 4B. Share Class Characteristics
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST020300 | Valuation Frequency | — | Prospectus text only | No |
| OFST020400 | Share Class Distribution Policy | SEC XBRL Risk/Return | Derivable from dividend narrative | **Partial** |
| OFST020540 | Share Class Currency | — | Implied USD for US funds | **Partial** |
| OFST020545 | Share Class Lifecycle | SEC Submissions API | Filing history + Series/Class CSV status | **Derivable** |
| OFST020560 | Share Class Launch Date | SEC XBRL Risk/Return | `InceptionDate` per share class | **Yes** — XBRL |
| OFST020566 | Termination Date | SEC Series/Class CSV | Class status (active/inactive) | **Partial** |
| OFST020580 | Is Share Class Eligible For UCITS | — | Not applicable to US funds | No |
| OFST023100 | Investment Status | — | Prospectus text only | No |
| OFST023200 | Benchmark | SEC XBRL Risk/Return | `IndexNoDeductionForFeesExpensesTaxes` | **Yes** — XBRL |
| OFST023800 | Index Name (ETF) | SEC N-CEN | INDEX table | **Yes** — TSV |
| OFST024000 | SRRI | — | EU-specific risk indicator | No |
**Summary**: ~10 of 75 fields available. Share class operational details (valuation frequency, dealing days, settlement cycles) are entirely prospectus-derived.
---
## 5. Key Fact: Listing (OFST060000064999) — 14 fields
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST060000 | Bloomberg Code Of Listing | — | Proprietary | No |
| OFST062000 | Listing Date | — | Exchange data (not SEC) | No |
| OFST062010 | Listing Currency | — | Implied USD for US-listed | **Partial** |
| OFST062025 | Launch Price | SEC XBRL Risk/Return | Inception price context | **Partial** |
| OFST062030 | Market Identifier Code | — | Not in SEC data directly | No |
| OFST062040 | Exchange Place | SEC N-CEN (ETFs) | Exchange information for ETFs | **Partial** |
**Summary**: 0-2 fields fully structured. Listing data is primarily from exchanges, not SEC filings.
---
## 6. Legal Structure (OFST160000164999) — 7 fields
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST160039 | Is EU Directive Relevant | — | EU-specific | No |
| OFST160040 | Type Of EU Directive | — | EU-specific (UCITS/AIF) | No |
| OFST160100 | Legal Form | SEC Series/Class CSV | `Entity Org Type` | **Yes** — CSV |
| OFST160150 | Home Country Legal Type Of Fund | SEC N-CEN | Fund type classification | **Yes** — TSV |
**Summary**: 2 of 7 fields available. Most are EU-specific.
---
## 7. Classification (OFST350000399999) — 12 fields
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|-------|-----------|--------------|----------------------|-------------|
| OFST350009 | Is Sharia Compliant | — | Not in SEC data | No |
| OFST350015 | CFI Code | OpenFIGI | FIGI metadata includes CFI | **Partial** |
| OFST350050 | Clearstream Asset Category | — | Proprietary classification | No |
| OFST350100 | EFAMA Main EFC Category | — | EU classification system | No |
| OFST351295 | Is Money Market Fund | SEC N-CEN + N-MFP | Money market fund flag | **Yes** — TSV |
| OFST351300 | Money Market Type Of Fund | SEC N-MFP | Fund type in N-MFP data | **Yes** — TSV |
**Major gap**: There is **no free, structured, universal fund asset class classification** in SEC data. The SEC does not tag funds as "equity", "fixed income", "mixed", etc. in a single structured field. Asset class must be derived from:
- Fund name heuristics ("Growth Fund" → equity, "Bond Fund" → fixed income)
- N-PORT holdings data (aggregate asset types held)
- XBRL strategy narrative text
This is a critical finding for LLM training: **asset class classification is an extraction target, not ground truth.**
---
## 8. Purchase Information / Settlement (OFST400000449999) — 95 fields
This is the **largest gap** between openfunds and public data. Settlement and dealing information is almost entirely found only in prospectus text.
| OF-ID | Field Name | Public Source | Structured? |
|-------|-----------|--------------|-------------|
| OFST400200 | Minimal Initial Subscription Category | — | No |
| OFST400230 | Minimal Initial Subscription In Amount | SEC XBRL Risk/Return | **Partial**`MinimumInvestment` element exists but inconsistently tagged |
| OFST401002 | Pricing Methodology | — | No |
| OFST402500 | Maximal Number Of Possible Decimals Shares | — | No |
| OFST405521-405532 | Subscription Trade Cycle / Dealing Days | — | No |
| OFST410060 | Cut-off Date Offset for Subscription | — | No |
| OFST410100 | Cut-off Time For Subscription | — | No |
| OFST410700 | Settlement Period For Subscription | — | No |
| OFST410950 | Has Lock-up For Redemption | — | No |
| OFST420200-420265 | Redemption Minimums/Maximums | — | No |
| OFST420630 | Bank Details (SSI for Payments) | — | No |
| OFST425561-425572 | Redemption Trade Cycle / Dealing Days | — | No |
| OFST430100 | Cut-off Time For Redemption | — | No |
| OFST430150 | Settlement Period For Redemption | — | No |
**Summary**: **0-1 of 95 fields** available as structured data. Settlement cycles, cut-off times, dealing days, minimum investments, and payment details are exclusively in prospectus text. This is arguably the highest-value category for LLM extraction — these fields are critical for fund operations but exist only in legal documents.
---
## 9. Fees, Costs and Expenses (OFST450100499999) — 62 fields
This is the **strongest area for SEC structured data**, thanks to the XBRL Risk/Return fee tables.
| OF-ID | Field Name | Public Source | Source Field | Structured? |
|-------|-----------|--------------|-------------|-------------|
| OFST451027 | Has Performance Fee | SEC XBRL Risk/Return | Fee narrative/table | **Partial** |
| OFST451030 | Performance Fee in Prospectus | SEC XBRL Risk/Return | Not separately tagged | **Partial** |
| OFST451305 | Applied Subscription Fee | SEC XBRL Risk/Return | `MaximumSalesChargeImposedOnPurchasesOverOfferingPrice` | **Yes** — XBRL |
| OFST451320 | Maximum Subscription Fee | SEC XBRL Risk/Return | `MaximumSalesChargeImposedOnPurchasesOverOfferingPrice` | **Yes** — XBRL |
| OFST451385 | Has Early Redemption Fee | SEC XBRL Risk/Return | `RedemptionFeeOverRedemption` | **Yes** — XBRL |
| OFST451390 | Has CDSC Fee | SEC XBRL Risk/Return | `MaximumDeferredSalesChargeOverOther` | **Yes** — XBRL |
| OFST451405 | Redemption Fee | SEC XBRL Risk/Return | `RedemptionFeeOverRedemption` | **Yes** — XBRL |
| OFST452000 | Management Fee Applied | SEC XBRL Risk/Return | `ManagementFeesOverAssets` | **Yes** — XBRL |
| OFST452100 | TER Excluding Performance Fee | SEC XBRL Risk/Return | `NetExpensesOverAssets` (OER equivalent) | **Yes** — XBRL |
| OFST452200 | Ongoing Charges | SEC XBRL Risk/Return | `TotalAnnualFundOperatingExpensesOverAssets` | **Yes** — XBRL |
| OFST453151 | Is Trailer Fee Clean | — | Not in SEC data | No |
| OFST454150 | Has Separate Distribution Fee | SEC XBRL Risk/Return | `Distribution12b1FeesOverAssets` | **Yes** — XBRL |
| OFST454160 | Distribution Fee | SEC XBRL Risk/Return | `Distribution12b1FeesOverAssets` | **Yes** — XBRL |
| — | Fee Waiver / Reimbursement | SEC XBRL Risk/Return | `FeeWaiverOrReimbursementOverAssets` | **Yes** — XBRL |
| — | Other Expenses | SEC XBRL Risk/Return | `OtherExpensesOverAssets` | **Yes** — XBRL |
| — | Expense Example (1yr/3yr/5yr/10yr) | SEC XBRL Risk/Return | `ExpenseExampleYear01` through `Year10` | **Yes** — XBRL |
### Key XBRL Fee Elements (complete shareholder fee table)
```
MaximumSalesChargeImposedOnPurchasesOverOfferingPrice
MaximumDeferredSalesChargeOverOther
MaximumSalesChargeOnReinvestedDividendsAndDistributionsOverOther
RedemptionFeeOverRedemption
MaximumAccountFee
ManagementFeesOverAssets
Distribution12b1FeesOverAssets
OtherExpensesOverAssets
AcquiredFundFeesAndExpensesOverAssets
TotalAnnualFundOperatingExpensesOverAssets
FeeWaiverOrReimbursementOverAssets
TotalAnnualFundOperatingExpensesAfterFeeWaiverOverAssets
ExpenseExampleYear01 / Year03 / Year05 / Year10
ExpenseExampleNoRedemptionYear01 / Year03 / Year05 / Year10
```
**Summary**: ~15 of 62 fee fields are available as structured XBRL data. The SEC fee taxonomy is detailed for US-style fees (sales charges, 12b-1, management fee, expense ratio) but does not cover European concepts like custodian fee breakdown, trailer fee clean status, or performance fee details (hurdle rate, high water mark).
---
## 10. Solvency II (OFST500000519999) — 13 fields
| OF-ID | Field Name | Public Source | Structured? |
|-------|-----------|--------------|-------------|
| All 13 fields | SCR Market Risk, Tripartite Reports | — | **No** — entirely EU insurance regulation |
**Summary**: 0 of 13 fields available. Solvency II is a European directive not applicable to US SEC data.
---
## 11. Taxes (OFST800000819999) — 27 fields
| OF-ID | Field Name | Public Source | Structured? |
|-------|-----------|--------------|-------------|
| OFST809200 | Is US Tax Forms W8/W9 Needed | — | Prospectus text only | No |
| OFST809210 | Is US K1 Reporting Required | SEC N-CEN | Partnership/LP fund flags | **Partial** |
| OFST809250 | Is Flow-Through Entity By US Tax Law | — | Prospectus text only | No |
| OFST809511 | FATCA Status | — | IRS data, not SEC structured | No |
| OFST809520 | Subject To FATCA Withholding | — | Prospectus text only | No |
| OFST801011 | Is Austrian Tax Reporting Fund | — | Austria-specific | No |
| OFST802001802045 | German Tax fields (8 fields) | — | Germany-specific | No |
| OFST802500 | Luxembourg Taxe d'Abonnement | — | Luxembourg-specific | No |
| OFST808008808100 | Swiss Tax fields (3 fields) | — | Switzerland-specific | No |
| OFST809015 | Has UK Reporting Status | — | UK-specific | No |
**Summary**: 0-1 of 27 fields available. Tax fields are overwhelmingly jurisdiction-specific (DE, AT, CH, LU, UK, FR, ES) and not in SEC data. The few US-relevant fields (FATCA, K-1) are in prospectus text.
---
## 12. ESG Data (OFST820000849999) — 65 fields
| OF-ID | Field Name | Public Source | Structured? |
|-------|-----------|--------------|-------------|
| OFST820110-820280 | Carbon Intensity / Footprint / Absolute GHG (18 fields) | — | **No** — not yet required in SEC filings for funds |
| OFST820290-820360 | Fossil Fuel Exposure (8 fields) | — | **No** |
| OFST820370-820380 | Net Zero Commitments (2 fields) | — | **No** |
| OFST820390 | Implied Temperature Rise | — | **No** |
| OFST820440-820460 | GHG Reduction Goals (3 fields) | — | **No** |
| OFST820470-820540 | Climate Stewardship (8 fields) | — | **No** |
| OFST820600-820675 | AMAS / ACT Signatory fields (8 fields) | — | **No** — Swiss specific |
| OFST830000-830210 | UK SDR fields (12 fields) | — | **No** — UK specific |
| OFST001025 | Is UN PRI Signatory | UN PRI website | **Partial** — searchable but not API |
**Current state of SEC ESG data**: The SEC adopted climate disclosure rules in March 2024 (effective May 2024), but these apply to operating companies, **not investment funds**. The Investment Company Names Rule (addressing ESG fund naming) has compliance dates of June 2026 / December 2026. As of February 2026, there is no SEC-mandated structured ESG data for funds comparable to EU SFDR.
**Summary**: **0 of 65 ESG fields** are available as structured public data from SEC. ESG fund data is available from commercial providers (Morningstar, MSCI, Sustainalytics) but not from any free structured public source.
---
## 13. Dynamic Data: Prices & AuM (OFDY000001000999) — 20 fields
| OF-ID | Field Name | Public Source | Source Field | Structured? |
|-------|-----------|--------------|-------------|-------------|
| OFDY000010 | Price Currency | — | Implied USD for US funds | **Partial** |
| OFDY000035 | Valuation NAV | SEC N-PORT | Not directly; XBRL Company Facts for some | **Partial** |
| OFDY000060 | AuM Fund | SEC N-PORT | `FUND_REPORTED_INFO.total_assets` | **Yes** — TSV |
| OFDY000070 | AuM Share Class | SEC N-PORT | Per-class AuM when reported | **Partial** |
| OFDY000075 | NoS Share Class | — | Not in SEC structured data | No |
**Summary**: 1-2 of 20 fields available. Daily NAV prices are not in SEC structured data (available from commercial sources). Fund-level AuM is in N-PORT.
---
## 14. Dynamic Data: Performance & Risk (OFDY025000049999) — 4 fields
These 4 fields are Germany-specific (equity participation ratio, total fund asset share, etc.) and not available from SEC.
**Additional performance data in SEC**: While openfunds has few performance OFDY fields, SEC XBRL Risk/Return provides:
| SEC Element | Description | Structured? |
|-------------|-------------|-------------|
| `AnnualReturn20XX` | Calendar year annual returns (1yr10yr) | **Yes** — XBRL |
| `HighestQuarterlyReturnLabel/Value` | Best quarter return | **Yes** — XBRL |
| `LowestQuarterlyReturnLabel/Value` | Worst quarter return | **Yes** — XBRL |
| `AverageAnnualReturnYear01/05/10/SinceInception` | Average annual returns | **Yes** — XBRL |
| `BarChartClosingTextBlock` | Performance chart narrative | **Yes** — XBRL (text) |
And N-PORT provides:
| N-PORT Field | Description | Structured? |
|-------------|-------------|-------------|
| `MONTHLY_TOTAL_RETURN` | Monthly returns by class | **Yes** — TSV |
| `MONTHLY_RETURN_CAT_INSTRUMENT` | Returns by asset category | **Yes** — TSV |
| `FUND_VAR_INFO` | Value-at-Risk | **Yes** — TSV |
| `INTEREST_RATE_RISK` | DV01/DV100 by maturity bucket | **Yes** — TSV |
---
## 15. Portfolio Holdings (OFPH000001999999) — 92 fields
N-PORT is the primary source. SEC requires monthly portfolio disclosure.
| OF-ID | Field Name | N-PORT Source | Structured? |
|-------|-----------|--------------|-------------|
| OFPH000010 | Holding as at Date | Reporting date | **Yes** |
| OFPH000020 | Portfolio Currency | Fund currency context | **Yes** (USD) |
| OFPH000100 | Holding ISIN | IDENTIFIERS table | **Yes** |
| OFPH000130 | Holding Ticker | IDENTIFIERS table | **Yes** |
| OFPH000145 | Holding CUSIP | IDENTIFIERS table | **Yes** |
| OFPH000170 | Holding FIGI | — | No (use OpenFIGI to map) |
| OFPH000200 | Holding Name | `FUND_REPORTED_HOLDING.name` | **Yes** |
| OFPH000210 | Holding Instrument Type | `FUND_REPORTED_HOLDING.asset_cat` | **Yes** |
| OFPH000250 | Holding Market Value | `FUND_REPORTED_HOLDING.balance` + `val_usd` | **Yes** |
| OFPH000300 | Holding Net Weight as % | `FUND_REPORTED_HOLDING.pctVal` | **Yes** |
| OFPH000400 | Holding Currency | `FUND_REPORTED_HOLDING.curCd` | **Yes** |
| OFPH000420 | Holding Risk Country | `FUND_REPORTED_HOLDING.invCountry` | **Yes** |
| OFPH000430 | Holding Asset Class | `FUND_REPORTED_HOLDING.asset_cat` | **Yes** |
| OFPH000440 | Holding Credit Rating | DEBT_SECURITY fields | **Yes** |
| OFPH000450 | Holding Number of Shares | `FUND_REPORTED_HOLDING.balance` | **Yes** |
| OFPH000460 | Holding Coupon Rate | DEBT_SECURITY fields | **Yes** |
| OFPH000465 | Holding Modified Duration | — | No |
| OFPH000480 | Holding Maturity Date | DEBT_SECURITY fields | **Yes** |
| OFPH000600-650 | Interest Rate Type / Index / Margin | DEBT_SECURITY fields | **Yes** |
| OFPH000700 | Holding Issuer Name | `FUND_REPORTED_HOLDING.issuerConditionalName` | **Yes** |
| OFPH000710 | Holding Issuer LEI | `FUND_REPORTED_HOLDING.lei` | **Yes** |
| OFPH000712 | Holding Issuer Domicile | `FUND_REPORTED_HOLDING.invCountry` | **Yes** |
| OFPH000730 | Holding Strike Price | Derivative tables | **Yes** |
| OFPH000800-870 | Underlying Asset fields | Derivative tables | **Yes** |
**Summary**: ~35-40 of 92 fields available from N-PORT. The main gaps are: modified/effective duration (calculated, not reported), GICS sector codes (not in N-PORT directly), and European-specific fields (CIC, NACE, EUSIPA, WKN, Valor).
---
## 16. Fund Ratios and Exposures (OFRE000001999999) — 42 fields
| OF-ID | Field Name | Public Source | Structured? |
|-------|-----------|--------------|-------------|
| OFRE000010 | Number Of Positions | SEC N-PORT | **Derivable** — count holdings |
| OFRE000200 | Exposure To Cash | SEC N-PORT | **Derivable** — sum cash-type holdings |
| OFRE000300-320 | Credit Quality fields | SEC N-PORT | **Derivable** — aggregate from holdings |
| OFRE000330 | Average Effective Maturity | — | Not directly in N-PORT | No |
| OFRE000335 | Average Effective Duration | — | Not directly in N-PORT | No |
| OFRE000350 | Yield To Maturity | — | Not in N-PORT | No |
| OFRE000500 | Top Ten Positions | SEC N-PORT | **Derivable** — sort by weight |
| OFRE000520 | Country Breakdown | SEC N-PORT | **Derivable** — aggregate by country |
| OFRE000540 | Currency Breakdown | SEC N-PORT | **Derivable** — aggregate by currency |
| OFRE000560 | GICS Equity Sector Breakdown | — | GICS not in N-PORT | No |
| OFRE000570 | Market Cap Breakdown | — | Not in N-PORT | No |
| OFRE000580 | Credit Rating Breakdown | SEC N-PORT | **Derivable** — aggregate by rating |
| OFRE000590 | Maturity Breakdown | SEC N-PORT | **Derivable** — aggregate by maturity |
| OFRE000600 | Asset Class Breakdown | SEC N-PORT | **Derivable** — aggregate by asset_cat |
**Summary**: ~10-15 of 42 fields are derivable from N-PORT holdings data. Pre-computed ratios (YTM, duration, OAS) are not available.
---
## 17. Portfolio Manager Data (OFPM000001999999) — 8 fields
| OF-ID | Field Name | Public Source | Structured? |
|-------|-----------|--------------|-------------|
| OFPM000010 | Portfolio Manager Name | SEC XBRL Risk/Return | `PortfolioManager` text block | **Partial** (text, not structured) |
| OFPM000060 | Portfolio Manager Brief Biography | SEC XBRL Risk/Return | SAI supplement text | **Partial** (text) |
| Others | Year of birth, experience, role | — | Not structured | No |
**Summary**: 0-1 of 8 fields. Portfolio manager data is in prospectus SAI text, not structured.
---
## Grand Summary: Structured Data Availability by Category
| Category | Total Fields | Structured Public | Derivable | Not Available |
|----------|-------------|------------------|-----------|---------------|
| **Company** (service providers) | 40 | **15** | 0 | 25 |
| **Umbrella** | 10 | **4** | 1 | 5 |
| **Fund** (identity, structure) | 73 | **15** | 10 | 48 |
| **Share Class** | 75 | **8** | 2 | 65 |
| **Listing** | 14 | **0** | 2 | 12 |
| **Legal Structure** | 7 | **2** | 0 | 5 |
| **Classification** | 12 | **2** | 0 | 10 |
| **Purchase / Settlement** | 95 | **0** | 1 | 94 |
| **Fees** | 62 | **15** | 0 | 47 |
| **Solvency II** | 13 | **0** | 0 | 13 |
| **Taxes** | 27 | **0** | 1 | 26 |
| **ESG** | 65 | **0** | 0 | 65 |
| **Prices / AuM** | 20 | **2** | 0 | 18 |
| **Performance / Risk** | 4 | **0** | 0 | 4 |
| **Portfolio Holdings** | 92 | **38** | 0 | 54 |
| **Ratios / Exposures** | 42 | **0** | 14 | 28 |
| **Portfolio Manager** | 8 | **0** | 1 | 7 |
| **TOTAL** | **659** | **~101 (15%)** | **~32 (5%)** | **~526 (80%)** |
---
## Implications for LLM Training Dataset
### What this means:
1. **~15% of openfunds fields** have directly available structured public data (primarily from SEC EDGAR: XBRL fees, N-PORT holdings, N-CEN service providers, Series/Class CSV identifiers).
2. **~5% are derivable** from structured data (e.g., aggregating N-PORT holdings into country/currency/rating breakdowns, counting positions, inferring ETF status from N-CEN index tracking).
3. **~80% are NOT available** as structured public data and exist only in prospectus narrative text.
### The 80% gap = the LLM opportunity
The fields that are **not** available as structured data but **are** specified in prospectus text represent the core value proposition for LLM extraction:
| Category | Key Extraction Targets |
|----------|----------------------|
| **Settlement / Dealing** | Cut-off times, settlement periods, dealing days, minimum subscriptions, pricing methodology |
| **Currencies / Hedging** | Share class currency hedging, portfolio hedging, multicurrency options |
| **Risk Limits** | Maximum leverage, redemption gates, lock-up periods, side pockets |
| **Asset Class** | Fund classification (equity/bond/mixed/alternative), investment strategy |
| **Fee Details** | Performance fee mechanics (hurdle rate, high water mark, crystallization), custodian fees |
| **ESG** | Sustainability approach, climate targets, exclusion criteria |
| **Tax** | FATCA status, K-1 requirements, flow-through entity status |
### Recommended approach for training data:
- **Ground truth (structured data)**: Use SEC XBRL fees, N-PORT holdings, N-CEN service providers, and Series/Class CSV as verifiable reference data.
- **Extraction targets (unstructured → structured)**: Use the 80% of openfunds fields that exist only in prospectus text as the fields the LLM should learn to extract.
- **Validation**: For the ~15% structured fields, compare LLM extraction from prospectus text against SEC structured data to measure extraction accuracy.
---
## Appendix: Data Source URLs
| Source | URL |
|--------|-----|
| SEC Series/Class CSV | https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information |
| SEC XBRL Risk/Return Data Sets | https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets |
| SEC N-PORT Data Sets | https://www.sec.gov/data-research/sec-markets-data/form-n-port-data-sets |
| SEC N-CEN Data Sets | https://www.sec.gov/data-research/sec-markets-data/form-n-cen-data-sets |
| SEC Submissions API | https://data.sec.gov/submissions/CIK{cik}.json |
| SEC XBRL Company Facts API | https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json |
| GLEIF LEI Database | https://search.gleif.org/ / https://www.gleif.org/en/lei-data/gleif-api |
| OpenFIGI API | https://www.openfigi.com/api |
| SEC N-MFP (Money Market) | https://www.sec.gov/data-research/sec-markets-data |