Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
520 lines
32 KiB
Markdown
520 lines
32 KiB
Markdown
# Openfunds Fields: Public Structured Data Availability
|
||
|
||
## Executive Summary
|
||
|
||
This document maps each openfunds field category to publicly available **structured data sources** — data that is machine-readable, downloadable, and free (or freely accessible via API). The focus is on fields describing the fund itself (asset class, settlement, risk, currencies, hedging, ESG, fees, etc.) rather than EU-specific regulatory fields.
|
||
|
||
### Key Public Structured Data Sources
|
||
|
||
| Source | Format | Access | Coverage | Cost |
|
||
|--------|--------|--------|----------|------|
|
||
| **SEC Series/Class CSV** | CSV | Direct download | ~100K+ US share classes | Free |
|
||
| **SEC XBRL Risk/Return** | XBRL → flat files | Quarterly download | All US mutual fund prospectuses | Free |
|
||
| **SEC N-PORT Data Sets** | XML → flat TSV | Quarterly download | Monthly holdings for all US funds | Free |
|
||
| **SEC N-CEN Data Sets** | XML → flat TSV | Annual filing, quarterly sets | Service providers, classification | Free |
|
||
| **SEC Submissions API** | JSON | REST API | All SEC filers | Free |
|
||
| **SEC XBRL Company Facts** | JSON | REST API | XBRL-tagged financial data | Free |
|
||
| **GLEIF LEI Database** | JSON/CSV | API + bulk download | 3.19M+ global entities | Free (CC0) |
|
||
| **OpenFIGI** | JSON | REST API | Hundreds of millions of instruments | Free |
|
||
|
||
---
|
||
|
||
## 1. Key Fact: Company (OFST001000–004999) — 40 fields
|
||
|
||
These fields identify the management company, custodian, transfer agent, auditor, and other service providers.
|
||
|
||
### Fields with Structured Public Data
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST001000 | Fund Group Name | SEC Submissions API | `subs.name` (entity name) | **Yes** — JSON |
|
||
| OFST001020 | ManCo | SEC N-CEN | ADVISOR table (investment adviser) | **Yes** — TSV |
|
||
| OFST001030 | LEI Of ManCo | GLEIF LEI Database | LEI lookup by entity name | **Yes** — JSON/CSV |
|
||
| OFST001035 | Domicile Of ManCo | GLEIF LEI Database | `entity.legalAddress.country` | **Yes** — JSON |
|
||
| OFST001050 | Fund Guarantor | — | Not in public structured data | No |
|
||
| OFST001055 | Address of ManCo | GLEIF LEI Database | `entity.legalAddress` | **Yes** — JSON |
|
||
| OFST001060 | City of ManCo | GLEIF LEI Database | `entity.legalAddress.city` | **Yes** — JSON |
|
||
| OFST001065 | Fund Website of ManCo | SEC Submissions API | `subs.website` | **Yes** — JSON |
|
||
| OFST001100 | Fund Promoter Name | — | Not publicly structured | No |
|
||
| OFST001105 | LEI of Fund Promoter | GLEIF LEI Database | If name known → LEI lookup | **Partial** |
|
||
| OFST001300 | Fund Administrator Name | SEC N-CEN | SERVICE_PROVIDER table | **Yes** — TSV |
|
||
| OFST001400 | Custodian Bank Name | SEC N-CEN | CUSTODIAN table | **Yes** — TSV |
|
||
| OFST001410 | LEI Of Custodian Bank | SEC N-CEN + GLEIF | N-CEN has LEI fields (since 2025) | **Yes** — TSV |
|
||
| OFST001415 | Domicile Of Custodian Bank | GLEIF LEI Database | Via custodian LEI | **Yes** — JSON |
|
||
| OFST001430 | Trustee Name | SEC EDGAR HTML filings | Unstructured (prospectus text) | No |
|
||
| OFST001450 | Portfolio Managing Company Name | SEC N-CEN | ADVISOR table + sub-advisors | **Yes** — TSV |
|
||
| OFST001500 | Fund Advisor Name | SEC N-CEN | ADVISOR table | **Yes** — TSV |
|
||
| OFST001510 | Sub-Investment Advisor Name | SEC N-CEN | Sub-advisor entries | **Yes** — TSV |
|
||
| OFST001600 | Auditor Name | SEC N-CEN | AUDITOR table | **Yes** — TSV |
|
||
| OFST002000 | Market Maker Name | — | Not publicly structured for funds | No |
|
||
| OFST002700 | Transfer Agent Name | SEC N-CEN | TRANSFER_AGENT table | **Yes** — TSV |
|
||
| OFST002900 | GIIN of Fund | — | IRS FATCA list (not easily matched) | No |
|
||
|
||
**Summary**: ~15 of 40 company fields are available as structured public data, primarily from SEC N-CEN (service providers) and GLEIF (entity LEI/address data).
|
||
|
||
---
|
||
|
||
## 2. Key Fact: Umbrella (OFST005000–009999) — 10 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST005000 | Has Umbrella | SEC Series/Class CSV | Inferred: multiple Series under same CIK | **Derivable** |
|
||
| OFST005010 | Umbrella | SEC Series/Class CSV | `Entity Name` (trust name) | **Yes** — CSV |
|
||
| OFST005015 | Domicile Of Umbrella | SEC Submissions API | `subs.stateOfIncorporation` | **Yes** — JSON |
|
||
| OFST005025 | CBI Code of Umbrella | — | Ireland-specific, not in SEC | No |
|
||
| OFST005030 | CSSF Code of Umbrella | — | Luxembourg-specific, not in SEC | No |
|
||
| OFST005040 | GIIN of Umbrella | — | Not publicly structured | No |
|
||
| OFST010035 | LEI Of Umbrella | GLEIF LEI Database | LEI lookup by trust name | **Yes** — JSON |
|
||
|
||
**Summary**: 4 of 10 fields available. Umbrella concept maps to SEC "Trust/Registrant" level.
|
||
|
||
---
|
||
|
||
## 3. Key Fact: Fund (OFST010000–019999) — 73 fields
|
||
|
||
This is the richest category, covering fund identity, investment strategy, structure, currencies, hedging, and product type flags.
|
||
|
||
### 3A. Fund Identity & Dates
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST010010 | Fund Domicile Alpha-2 | SEC Submissions API | `subs.stateOfIncorporation` → derive | **Partial** (US state, not ISO) |
|
||
| OFST010020 | Legal Fund Name Including Umbrella | SEC Series/Class CSV | Concatenate Entity Name + Series Name | **Derivable** |
|
||
| OFST010030 | LEI Of Fund | GLEIF LEI Database | LEI search by fund name | **Yes** — JSON |
|
||
| OFST010110 | Legal Fund Name Only | SEC Series/Class CSV | `Series Name` | **Yes** — CSV |
|
||
| OFST010240 | Fund Launch Date | SEC XBRL Risk/Return | `InceptionDate` element | **Yes** — XBRL |
|
||
| OFST010250 | Fund Valuation Point | — | Prospectus text only | No |
|
||
| OFST010300 | Investment Objective | SEC XBRL Risk/Return | `ObjectivePrimaryTextBlock` | **Yes** — XBRL (text) |
|
||
| OFST010410 | Fund Currency | SEC N-PORT | `FUND_REPORTED_INFO.total_assets` currency context | **Partial** (all USD for US funds) |
|
||
| OFST010440 | Fiscal Year End | SEC Submissions API | `subs.fiscalYearEnd` (MMDD format) | **Yes** — JSON |
|
||
| OFST013000 | Prospectus Date | SEC Submissions API | Filing date of latest 485BPOS/N-1A | **Yes** — JSON |
|
||
|
||
### 3B. Fund Structure & Product Type Flags
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST010420 | Open-ended Or Closed-ended | SEC N-CEN | Fund type reported | **Yes** — TSV |
|
||
| OFST010500 | Is Fund Of Funds | SEC N-CEN | Fund-of-funds flag | **Yes** — TSV |
|
||
| OFST010580 | Is ETF | SEC N-CEN | ETF table presence | **Yes** — TSV |
|
||
| OFST010620 | Is Tokenized Fund | — | Not in SEC data | No |
|
||
| OFST010630 | Is Leveraged | SEC N-PORT | Borrowing data (Item B.2) | **Derivable** |
|
||
| OFST010635 | Maximum Leverage In Fund | — | Prospectus text only | No |
|
||
| OFST010640 | Has 130/30 Strategy | — | Prospectus text only | No |
|
||
| OFST010650 | Is REIT | SEC N-CEN + XBRL | Classification data | **Partial** |
|
||
| OFST010660 | Is ETC | — | US concept is different | No |
|
||
| OFST010665 | Is ETN | SEC N-CEN | Product type | **Partial** |
|
||
| OFST010670 | Is Short | — | Derivable from fund name/strategy | **Derivable** (heuristic) |
|
||
| OFST010690 | Is Life Fund | — | Not a US concept | No |
|
||
| OFST010695 | Is Pension Fund | — | Not in SEC fund data | No |
|
||
| OFST010720 | Is Passive Fund | SEC N-CEN | INDEX table (tracked index) | **Derivable** |
|
||
| OFST010730 | Management Approach Type | — | Prospectus text only | No |
|
||
|
||
### 3C. Currencies & Hedging
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST010205 | Has Duration Hedge | — | Prospectus text only | No |
|
||
| OFST010211 | Currency Hedge Portfolio | — | Prospectus text only | No |
|
||
| OFST010220 | Has Embedded Derivatives | SEC N-PORT | Derivatives tables (non-empty) | **Derivable** |
|
||
| OFST020261 | Currency Hedge Share Class | — | Prospectus text only | No |
|
||
| OFST020530 | Is Multicurrency Share Class | — | Prospectus text only | No |
|
||
| OFST020540 | Share Class Currency | SEC XBRL Risk/Return | Currency context in fee/performance tables | **Partial** (USD implied) |
|
||
|
||
**Currency/hedging fields are almost entirely prospectus-derived and NOT available as structured public data.** This is a key gap: US funds are almost all USD-denominated, and hedging is described in prospectus narrative text. For LLM training, these fields represent extraction targets.
|
||
|
||
### 3D. Replication & Securities Lending
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST010900 | Replication Methodology First Level | — | Prospectus text only (ETFs) | No |
|
||
| OFST010901 | Replication Methodology Second Level | — | Prospectus text only (ETFs) | No |
|
||
| OFST011000 | Has Securities Lending | SEC N-PORT | SECURITIES_LENDING + BORROWER tables | **Yes** — TSV |
|
||
| OFST011100 | Has Swap | SEC N-PORT | Swap derivative tables | **Derivable** |
|
||
| OFST011110 | Swap Counterparty Name | SEC N-PORT | Counterparty fields in swap tables | **Yes** — TSV |
|
||
|
||
**Summary for Fund section**: ~25 of 73 fields available as structured data. The major gaps are: currency hedging, replication methodology, valuation timing, management approach, and leverage limits — all prospectus-narrative fields.
|
||
|
||
---
|
||
|
||
## 4. Key Fact: Share Class (OFST020000–049999) — 75 fields
|
||
|
||
### 4A. Identifiers
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST020000 | ISIN | OpenFIGI | FIGI → ISIN mapping | **Yes** — JSON |
|
||
| OFST020005 | CUSIP | SEC Series/Class CSV | Not directly, but derivable from ISIN | **Partial** |
|
||
| OFST020020 | Bloomberg Code | — | Proprietary (not free) | No |
|
||
| OFST020025 | FIGI Code | OpenFIGI | Direct lookup by ticker/ISIN | **Yes** — JSON |
|
||
| OFST020040 | SEDOL | — | Proprietary (London Stock Exchange) | No |
|
||
| OFST020045 | NFN Identifier | — | Nasdaq proprietary | No |
|
||
| OFST020050 | Share Class Extension | SEC Series/Class CSV | `Class Name` (parse letter/suffix) | **Derivable** |
|
||
| OFST020060 | Full Share Class Name | SEC Series/Class CSV | `Series Name` + `Class Name` | **Yes** — CSV |
|
||
|
||
### 4B. Share Class Characteristics
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST020300 | Valuation Frequency | — | Prospectus text only | No |
|
||
| OFST020400 | Share Class Distribution Policy | SEC XBRL Risk/Return | Derivable from dividend narrative | **Partial** |
|
||
| OFST020540 | Share Class Currency | — | Implied USD for US funds | **Partial** |
|
||
| OFST020545 | Share Class Lifecycle | SEC Submissions API | Filing history + Series/Class CSV status | **Derivable** |
|
||
| OFST020560 | Share Class Launch Date | SEC XBRL Risk/Return | `InceptionDate` per share class | **Yes** — XBRL |
|
||
| OFST020566 | Termination Date | SEC Series/Class CSV | Class status (active/inactive) | **Partial** |
|
||
| OFST020580 | Is Share Class Eligible For UCITS | — | Not applicable to US funds | No |
|
||
| OFST023100 | Investment Status | — | Prospectus text only | No |
|
||
| OFST023200 | Benchmark | SEC XBRL Risk/Return | `IndexNoDeductionForFeesExpensesTaxes` | **Yes** — XBRL |
|
||
| OFST023800 | Index Name (ETF) | SEC N-CEN | INDEX table | **Yes** — TSV |
|
||
| OFST024000 | SRRI | — | EU-specific risk indicator | No |
|
||
|
||
**Summary**: ~10 of 75 fields available. Share class operational details (valuation frequency, dealing days, settlement cycles) are entirely prospectus-derived.
|
||
|
||
---
|
||
|
||
## 5. Key Fact: Listing (OFST060000–064999) — 14 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST060000 | Bloomberg Code Of Listing | — | Proprietary | No |
|
||
| OFST062000 | Listing Date | — | Exchange data (not SEC) | No |
|
||
| OFST062010 | Listing Currency | — | Implied USD for US-listed | **Partial** |
|
||
| OFST062025 | Launch Price | SEC XBRL Risk/Return | Inception price context | **Partial** |
|
||
| OFST062030 | Market Identifier Code | — | Not in SEC data directly | No |
|
||
| OFST062040 | Exchange Place | SEC N-CEN (ETFs) | Exchange information for ETFs | **Partial** |
|
||
|
||
**Summary**: 0-2 fields fully structured. Listing data is primarily from exchanges, not SEC filings.
|
||
|
||
---
|
||
|
||
## 6. Legal Structure (OFST160000–164999) — 7 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST160039 | Is EU Directive Relevant | — | EU-specific | No |
|
||
| OFST160040 | Type Of EU Directive | — | EU-specific (UCITS/AIF) | No |
|
||
| OFST160100 | Legal Form | SEC Series/Class CSV | `Entity Org Type` | **Yes** — CSV |
|
||
| OFST160150 | Home Country Legal Type Of Fund | SEC N-CEN | Fund type classification | **Yes** — TSV |
|
||
|
||
**Summary**: 2 of 7 fields available. Most are EU-specific.
|
||
|
||
---
|
||
|
||
## 7. Classification (OFST350000–399999) — 12 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field / Method | Structured? |
|
||
|-------|-----------|--------------|----------------------|-------------|
|
||
| OFST350009 | Is Sharia Compliant | — | Not in SEC data | No |
|
||
| OFST350015 | CFI Code | OpenFIGI | FIGI metadata includes CFI | **Partial** |
|
||
| OFST350050 | Clearstream Asset Category | — | Proprietary classification | No |
|
||
| OFST350100 | EFAMA Main EFC Category | — | EU classification system | No |
|
||
| OFST351295 | Is Money Market Fund | SEC N-CEN + N-MFP | Money market fund flag | **Yes** — TSV |
|
||
| OFST351300 | Money Market Type Of Fund | SEC N-MFP | Fund type in N-MFP data | **Yes** — TSV |
|
||
|
||
**Major gap**: There is **no free, structured, universal fund asset class classification** in SEC data. The SEC does not tag funds as "equity", "fixed income", "mixed", etc. in a single structured field. Asset class must be derived from:
|
||
- Fund name heuristics ("Growth Fund" → equity, "Bond Fund" → fixed income)
|
||
- N-PORT holdings data (aggregate asset types held)
|
||
- XBRL strategy narrative text
|
||
|
||
This is a critical finding for LLM training: **asset class classification is an extraction target, not ground truth.**
|
||
|
||
---
|
||
|
||
## 8. Purchase Information / Settlement (OFST400000–449999) — 95 fields
|
||
|
||
This is the **largest gap** between openfunds and public data. Settlement and dealing information is almost entirely found only in prospectus text.
|
||
|
||
| OF-ID | Field Name | Public Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| OFST400200 | Minimal Initial Subscription Category | — | No |
|
||
| OFST400230 | Minimal Initial Subscription In Amount | SEC XBRL Risk/Return | **Partial** — `MinimumInvestment` element exists but inconsistently tagged |
|
||
| OFST401002 | Pricing Methodology | — | No |
|
||
| OFST402500 | Maximal Number Of Possible Decimals Shares | — | No |
|
||
| OFST405521-405532 | Subscription Trade Cycle / Dealing Days | — | No |
|
||
| OFST410060 | Cut-off Date Offset for Subscription | — | No |
|
||
| OFST410100 | Cut-off Time For Subscription | — | No |
|
||
| OFST410700 | Settlement Period For Subscription | — | No |
|
||
| OFST410950 | Has Lock-up For Redemption | — | No |
|
||
| OFST420200-420265 | Redemption Minimums/Maximums | — | No |
|
||
| OFST420630 | Bank Details (SSI for Payments) | — | No |
|
||
| OFST425561-425572 | Redemption Trade Cycle / Dealing Days | — | No |
|
||
| OFST430100 | Cut-off Time For Redemption | — | No |
|
||
| OFST430150 | Settlement Period For Redemption | — | No |
|
||
|
||
**Summary**: **0-1 of 95 fields** available as structured data. Settlement cycles, cut-off times, dealing days, minimum investments, and payment details are exclusively in prospectus text. This is arguably the highest-value category for LLM extraction — these fields are critical for fund operations but exist only in legal documents.
|
||
|
||
---
|
||
|
||
## 9. Fees, Costs and Expenses (OFST450100–499999) — 62 fields
|
||
|
||
This is the **strongest area for SEC structured data**, thanks to the XBRL Risk/Return fee tables.
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field | Structured? |
|
||
|-------|-----------|--------------|-------------|-------------|
|
||
| OFST451027 | Has Performance Fee | SEC XBRL Risk/Return | Fee narrative/table | **Partial** |
|
||
| OFST451030 | Performance Fee in Prospectus | SEC XBRL Risk/Return | Not separately tagged | **Partial** |
|
||
| OFST451305 | Applied Subscription Fee | SEC XBRL Risk/Return | `MaximumSalesChargeImposedOnPurchasesOverOfferingPrice` | **Yes** — XBRL |
|
||
| OFST451320 | Maximum Subscription Fee | SEC XBRL Risk/Return | `MaximumSalesChargeImposedOnPurchasesOverOfferingPrice` | **Yes** — XBRL |
|
||
| OFST451385 | Has Early Redemption Fee | SEC XBRL Risk/Return | `RedemptionFeeOverRedemption` | **Yes** — XBRL |
|
||
| OFST451390 | Has CDSC Fee | SEC XBRL Risk/Return | `MaximumDeferredSalesChargeOverOther` | **Yes** — XBRL |
|
||
| OFST451405 | Redemption Fee | SEC XBRL Risk/Return | `RedemptionFeeOverRedemption` | **Yes** — XBRL |
|
||
| OFST452000 | Management Fee Applied | SEC XBRL Risk/Return | `ManagementFeesOverAssets` | **Yes** — XBRL |
|
||
| OFST452100 | TER Excluding Performance Fee | SEC XBRL Risk/Return | `NetExpensesOverAssets` (OER equivalent) | **Yes** — XBRL |
|
||
| OFST452200 | Ongoing Charges | SEC XBRL Risk/Return | `TotalAnnualFundOperatingExpensesOverAssets` | **Yes** — XBRL |
|
||
| OFST453151 | Is Trailer Fee Clean | — | Not in SEC data | No |
|
||
| OFST454150 | Has Separate Distribution Fee | SEC XBRL Risk/Return | `Distribution12b1FeesOverAssets` | **Yes** — XBRL |
|
||
| OFST454160 | Distribution Fee | SEC XBRL Risk/Return | `Distribution12b1FeesOverAssets` | **Yes** — XBRL |
|
||
| — | Fee Waiver / Reimbursement | SEC XBRL Risk/Return | `FeeWaiverOrReimbursementOverAssets` | **Yes** — XBRL |
|
||
| — | Other Expenses | SEC XBRL Risk/Return | `OtherExpensesOverAssets` | **Yes** — XBRL |
|
||
| — | Expense Example (1yr/3yr/5yr/10yr) | SEC XBRL Risk/Return | `ExpenseExampleYear01` through `Year10` | **Yes** — XBRL |
|
||
|
||
### Key XBRL Fee Elements (complete shareholder fee table)
|
||
|
||
```
|
||
MaximumSalesChargeImposedOnPurchasesOverOfferingPrice
|
||
MaximumDeferredSalesChargeOverOther
|
||
MaximumSalesChargeOnReinvestedDividendsAndDistributionsOverOther
|
||
RedemptionFeeOverRedemption
|
||
MaximumAccountFee
|
||
ManagementFeesOverAssets
|
||
Distribution12b1FeesOverAssets
|
||
OtherExpensesOverAssets
|
||
AcquiredFundFeesAndExpensesOverAssets
|
||
TotalAnnualFundOperatingExpensesOverAssets
|
||
FeeWaiverOrReimbursementOverAssets
|
||
TotalAnnualFundOperatingExpensesAfterFeeWaiverOverAssets
|
||
ExpenseExampleYear01 / Year03 / Year05 / Year10
|
||
ExpenseExampleNoRedemptionYear01 / Year03 / Year05 / Year10
|
||
```
|
||
|
||
**Summary**: ~15 of 62 fee fields are available as structured XBRL data. The SEC fee taxonomy is detailed for US-style fees (sales charges, 12b-1, management fee, expense ratio) but does not cover European concepts like custodian fee breakdown, trailer fee clean status, or performance fee details (hurdle rate, high water mark).
|
||
|
||
---
|
||
|
||
## 10. Solvency II (OFST500000–519999) — 13 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| All 13 fields | SCR Market Risk, Tripartite Reports | — | **No** — entirely EU insurance regulation |
|
||
|
||
**Summary**: 0 of 13 fields available. Solvency II is a European directive not applicable to US SEC data.
|
||
|
||
---
|
||
|
||
## 11. Taxes (OFST800000–819999) — 27 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| OFST809200 | Is US Tax Forms W8/W9 Needed | — | Prospectus text only | No |
|
||
| OFST809210 | Is US K1 Reporting Required | SEC N-CEN | Partnership/LP fund flags | **Partial** |
|
||
| OFST809250 | Is Flow-Through Entity By US Tax Law | — | Prospectus text only | No |
|
||
| OFST809511 | FATCA Status | — | IRS data, not SEC structured | No |
|
||
| OFST809520 | Subject To FATCA Withholding | — | Prospectus text only | No |
|
||
| OFST801011 | Is Austrian Tax Reporting Fund | — | Austria-specific | No |
|
||
| OFST802001–802045 | German Tax fields (8 fields) | — | Germany-specific | No |
|
||
| OFST802500 | Luxembourg Taxe d'Abonnement | — | Luxembourg-specific | No |
|
||
| OFST808008–808100 | Swiss Tax fields (3 fields) | — | Switzerland-specific | No |
|
||
| OFST809015 | Has UK Reporting Status | — | UK-specific | No |
|
||
|
||
**Summary**: 0-1 of 27 fields available. Tax fields are overwhelmingly jurisdiction-specific (DE, AT, CH, LU, UK, FR, ES) and not in SEC data. The few US-relevant fields (FATCA, K-1) are in prospectus text.
|
||
|
||
---
|
||
|
||
## 12. ESG Data (OFST820000–849999) — 65 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| OFST820110-820280 | Carbon Intensity / Footprint / Absolute GHG (18 fields) | — | **No** — not yet required in SEC filings for funds |
|
||
| OFST820290-820360 | Fossil Fuel Exposure (8 fields) | — | **No** |
|
||
| OFST820370-820380 | Net Zero Commitments (2 fields) | — | **No** |
|
||
| OFST820390 | Implied Temperature Rise | — | **No** |
|
||
| OFST820440-820460 | GHG Reduction Goals (3 fields) | — | **No** |
|
||
| OFST820470-820540 | Climate Stewardship (8 fields) | — | **No** |
|
||
| OFST820600-820675 | AMAS / ACT Signatory fields (8 fields) | — | **No** — Swiss specific |
|
||
| OFST830000-830210 | UK SDR fields (12 fields) | — | **No** — UK specific |
|
||
| OFST001025 | Is UN PRI Signatory | UN PRI website | **Partial** — searchable but not API |
|
||
|
||
**Current state of SEC ESG data**: The SEC adopted climate disclosure rules in March 2024 (effective May 2024), but these apply to operating companies, **not investment funds**. The Investment Company Names Rule (addressing ESG fund naming) has compliance dates of June 2026 / December 2026. As of February 2026, there is no SEC-mandated structured ESG data for funds comparable to EU SFDR.
|
||
|
||
**Summary**: **0 of 65 ESG fields** are available as structured public data from SEC. ESG fund data is available from commercial providers (Morningstar, MSCI, Sustainalytics) but not from any free structured public source.
|
||
|
||
---
|
||
|
||
## 13. Dynamic Data: Prices & AuM (OFDY000001–000999) — 20 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Source Field | Structured? |
|
||
|-------|-----------|--------------|-------------|-------------|
|
||
| OFDY000010 | Price Currency | — | Implied USD for US funds | **Partial** |
|
||
| OFDY000035 | Valuation NAV | SEC N-PORT | Not directly; XBRL Company Facts for some | **Partial** |
|
||
| OFDY000060 | AuM Fund | SEC N-PORT | `FUND_REPORTED_INFO.total_assets` | **Yes** — TSV |
|
||
| OFDY000070 | AuM Share Class | SEC N-PORT | Per-class AuM when reported | **Partial** |
|
||
| OFDY000075 | NoS Share Class | — | Not in SEC structured data | No |
|
||
|
||
**Summary**: 1-2 of 20 fields available. Daily NAV prices are not in SEC structured data (available from commercial sources). Fund-level AuM is in N-PORT.
|
||
|
||
---
|
||
|
||
## 14. Dynamic Data: Performance & Risk (OFDY025000–049999) — 4 fields
|
||
|
||
These 4 fields are Germany-specific (equity participation ratio, total fund asset share, etc.) and not available from SEC.
|
||
|
||
**Additional performance data in SEC**: While openfunds has few performance OFDY fields, SEC XBRL Risk/Return provides:
|
||
|
||
| SEC Element | Description | Structured? |
|
||
|-------------|-------------|-------------|
|
||
| `AnnualReturn20XX` | Calendar year annual returns (1yr–10yr) | **Yes** — XBRL |
|
||
| `HighestQuarterlyReturnLabel/Value` | Best quarter return | **Yes** — XBRL |
|
||
| `LowestQuarterlyReturnLabel/Value` | Worst quarter return | **Yes** — XBRL |
|
||
| `AverageAnnualReturnYear01/05/10/SinceInception` | Average annual returns | **Yes** — XBRL |
|
||
| `BarChartClosingTextBlock` | Performance chart narrative | **Yes** — XBRL (text) |
|
||
|
||
And N-PORT provides:
|
||
| N-PORT Field | Description | Structured? |
|
||
|-------------|-------------|-------------|
|
||
| `MONTHLY_TOTAL_RETURN` | Monthly returns by class | **Yes** — TSV |
|
||
| `MONTHLY_RETURN_CAT_INSTRUMENT` | Returns by asset category | **Yes** — TSV |
|
||
| `FUND_VAR_INFO` | Value-at-Risk | **Yes** — TSV |
|
||
| `INTEREST_RATE_RISK` | DV01/DV100 by maturity bucket | **Yes** — TSV |
|
||
|
||
---
|
||
|
||
## 15. Portfolio Holdings (OFPH000001–999999) — 92 fields
|
||
|
||
N-PORT is the primary source. SEC requires monthly portfolio disclosure.
|
||
|
||
| OF-ID | Field Name | N-PORT Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| OFPH000010 | Holding as at Date | Reporting date | **Yes** |
|
||
| OFPH000020 | Portfolio Currency | Fund currency context | **Yes** (USD) |
|
||
| OFPH000100 | Holding ISIN | IDENTIFIERS table | **Yes** |
|
||
| OFPH000130 | Holding Ticker | IDENTIFIERS table | **Yes** |
|
||
| OFPH000145 | Holding CUSIP | IDENTIFIERS table | **Yes** |
|
||
| OFPH000170 | Holding FIGI | — | No (use OpenFIGI to map) |
|
||
| OFPH000200 | Holding Name | `FUND_REPORTED_HOLDING.name` | **Yes** |
|
||
| OFPH000210 | Holding Instrument Type | `FUND_REPORTED_HOLDING.asset_cat` | **Yes** |
|
||
| OFPH000250 | Holding Market Value | `FUND_REPORTED_HOLDING.balance` + `val_usd` | **Yes** |
|
||
| OFPH000300 | Holding Net Weight as % | `FUND_REPORTED_HOLDING.pctVal` | **Yes** |
|
||
| OFPH000400 | Holding Currency | `FUND_REPORTED_HOLDING.curCd` | **Yes** |
|
||
| OFPH000420 | Holding Risk Country | `FUND_REPORTED_HOLDING.invCountry` | **Yes** |
|
||
| OFPH000430 | Holding Asset Class | `FUND_REPORTED_HOLDING.asset_cat` | **Yes** |
|
||
| OFPH000440 | Holding Credit Rating | DEBT_SECURITY fields | **Yes** |
|
||
| OFPH000450 | Holding Number of Shares | `FUND_REPORTED_HOLDING.balance` | **Yes** |
|
||
| OFPH000460 | Holding Coupon Rate | DEBT_SECURITY fields | **Yes** |
|
||
| OFPH000465 | Holding Modified Duration | — | No |
|
||
| OFPH000480 | Holding Maturity Date | DEBT_SECURITY fields | **Yes** |
|
||
| OFPH000600-650 | Interest Rate Type / Index / Margin | DEBT_SECURITY fields | **Yes** |
|
||
| OFPH000700 | Holding Issuer Name | `FUND_REPORTED_HOLDING.issuerConditionalName` | **Yes** |
|
||
| OFPH000710 | Holding Issuer LEI | `FUND_REPORTED_HOLDING.lei` | **Yes** |
|
||
| OFPH000712 | Holding Issuer Domicile | `FUND_REPORTED_HOLDING.invCountry` | **Yes** |
|
||
| OFPH000730 | Holding Strike Price | Derivative tables | **Yes** |
|
||
| OFPH000800-870 | Underlying Asset fields | Derivative tables | **Yes** |
|
||
|
||
**Summary**: ~35-40 of 92 fields available from N-PORT. The main gaps are: modified/effective duration (calculated, not reported), GICS sector codes (not in N-PORT directly), and European-specific fields (CIC, NACE, EUSIPA, WKN, Valor).
|
||
|
||
---
|
||
|
||
## 16. Fund Ratios and Exposures (OFRE000001–999999) — 42 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| OFRE000010 | Number Of Positions | SEC N-PORT | **Derivable** — count holdings |
|
||
| OFRE000200 | Exposure To Cash | SEC N-PORT | **Derivable** — sum cash-type holdings |
|
||
| OFRE000300-320 | Credit Quality fields | SEC N-PORT | **Derivable** — aggregate from holdings |
|
||
| OFRE000330 | Average Effective Maturity | — | Not directly in N-PORT | No |
|
||
| OFRE000335 | Average Effective Duration | — | Not directly in N-PORT | No |
|
||
| OFRE000350 | Yield To Maturity | — | Not in N-PORT | No |
|
||
| OFRE000500 | Top Ten Positions | SEC N-PORT | **Derivable** — sort by weight |
|
||
| OFRE000520 | Country Breakdown | SEC N-PORT | **Derivable** — aggregate by country |
|
||
| OFRE000540 | Currency Breakdown | SEC N-PORT | **Derivable** — aggregate by currency |
|
||
| OFRE000560 | GICS Equity Sector Breakdown | — | GICS not in N-PORT | No |
|
||
| OFRE000570 | Market Cap Breakdown | — | Not in N-PORT | No |
|
||
| OFRE000580 | Credit Rating Breakdown | SEC N-PORT | **Derivable** — aggregate by rating |
|
||
| OFRE000590 | Maturity Breakdown | SEC N-PORT | **Derivable** — aggregate by maturity |
|
||
| OFRE000600 | Asset Class Breakdown | SEC N-PORT | **Derivable** — aggregate by asset_cat |
|
||
|
||
**Summary**: ~10-15 of 42 fields are derivable from N-PORT holdings data. Pre-computed ratios (YTM, duration, OAS) are not available.
|
||
|
||
---
|
||
|
||
## 17. Portfolio Manager Data (OFPM000001–999999) — 8 fields
|
||
|
||
| OF-ID | Field Name | Public Source | Structured? |
|
||
|-------|-----------|--------------|-------------|
|
||
| OFPM000010 | Portfolio Manager Name | SEC XBRL Risk/Return | `PortfolioManager` text block | **Partial** (text, not structured) |
|
||
| OFPM000060 | Portfolio Manager Brief Biography | SEC XBRL Risk/Return | SAI supplement text | **Partial** (text) |
|
||
| Others | Year of birth, experience, role | — | Not structured | No |
|
||
|
||
**Summary**: 0-1 of 8 fields. Portfolio manager data is in prospectus SAI text, not structured.
|
||
|
||
---
|
||
|
||
## Grand Summary: Structured Data Availability by Category
|
||
|
||
| Category | Total Fields | Structured Public | Derivable | Not Available |
|
||
|----------|-------------|------------------|-----------|---------------|
|
||
| **Company** (service providers) | 40 | **15** | 0 | 25 |
|
||
| **Umbrella** | 10 | **4** | 1 | 5 |
|
||
| **Fund** (identity, structure) | 73 | **15** | 10 | 48 |
|
||
| **Share Class** | 75 | **8** | 2 | 65 |
|
||
| **Listing** | 14 | **0** | 2 | 12 |
|
||
| **Legal Structure** | 7 | **2** | 0 | 5 |
|
||
| **Classification** | 12 | **2** | 0 | 10 |
|
||
| **Purchase / Settlement** | 95 | **0** | 1 | 94 |
|
||
| **Fees** | 62 | **15** | 0 | 47 |
|
||
| **Solvency II** | 13 | **0** | 0 | 13 |
|
||
| **Taxes** | 27 | **0** | 1 | 26 |
|
||
| **ESG** | 65 | **0** | 0 | 65 |
|
||
| **Prices / AuM** | 20 | **2** | 0 | 18 |
|
||
| **Performance / Risk** | 4 | **0** | 0 | 4 |
|
||
| **Portfolio Holdings** | 92 | **38** | 0 | 54 |
|
||
| **Ratios / Exposures** | 42 | **0** | 14 | 28 |
|
||
| **Portfolio Manager** | 8 | **0** | 1 | 7 |
|
||
| **TOTAL** | **659** | **~101 (15%)** | **~32 (5%)** | **~526 (80%)** |
|
||
|
||
---
|
||
|
||
## Implications for LLM Training Dataset
|
||
|
||
### What this means:
|
||
|
||
1. **~15% of openfunds fields** have directly available structured public data (primarily from SEC EDGAR: XBRL fees, N-PORT holdings, N-CEN service providers, Series/Class CSV identifiers).
|
||
|
||
2. **~5% are derivable** from structured data (e.g., aggregating N-PORT holdings into country/currency/rating breakdowns, counting positions, inferring ETF status from N-CEN index tracking).
|
||
|
||
3. **~80% are NOT available** as structured public data and exist only in prospectus narrative text.
|
||
|
||
### The 80% gap = the LLM opportunity
|
||
|
||
The fields that are **not** available as structured data but **are** specified in prospectus text represent the core value proposition for LLM extraction:
|
||
|
||
| Category | Key Extraction Targets |
|
||
|----------|----------------------|
|
||
| **Settlement / Dealing** | Cut-off times, settlement periods, dealing days, minimum subscriptions, pricing methodology |
|
||
| **Currencies / Hedging** | Share class currency hedging, portfolio hedging, multicurrency options |
|
||
| **Risk Limits** | Maximum leverage, redemption gates, lock-up periods, side pockets |
|
||
| **Asset Class** | Fund classification (equity/bond/mixed/alternative), investment strategy |
|
||
| **Fee Details** | Performance fee mechanics (hurdle rate, high water mark, crystallization), custodian fees |
|
||
| **ESG** | Sustainability approach, climate targets, exclusion criteria |
|
||
| **Tax** | FATCA status, K-1 requirements, flow-through entity status |
|
||
|
||
### Recommended approach for training data:
|
||
|
||
- **Ground truth (structured data)**: Use SEC XBRL fees, N-PORT holdings, N-CEN service providers, and Series/Class CSV as verifiable reference data.
|
||
- **Extraction targets (unstructured → structured)**: Use the 80% of openfunds fields that exist only in prospectus text as the fields the LLM should learn to extract.
|
||
- **Validation**: For the ~15% structured fields, compare LLM extraction from prospectus text against SEC structured data to measure extraction accuracy.
|
||
|
||
---
|
||
|
||
## Appendix: Data Source URLs
|
||
|
||
| Source | URL |
|
||
|--------|-----|
|
||
| SEC Series/Class CSV | https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information |
|
||
| SEC XBRL Risk/Return Data Sets | https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets |
|
||
| SEC N-PORT Data Sets | https://www.sec.gov/data-research/sec-markets-data/form-n-port-data-sets |
|
||
| SEC N-CEN Data Sets | https://www.sec.gov/data-research/sec-markets-data/form-n-cen-data-sets |
|
||
| SEC Submissions API | https://data.sec.gov/submissions/CIK{cik}.json |
|
||
| SEC XBRL Company Facts API | https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json |
|
||
| GLEIF LEI Database | https://search.gleif.org/ / https://www.gleif.org/en/lei-data/gleif-api |
|
||
| OpenFIGI API | https://www.openfigi.com/api |
|
||
| SEC N-MFP (Money Market) | https://www.sec.gov/data-research/sec-markets-data |
|