fund_rfid_data/SEC_REFERENCE_DATA_vs_OPENFUNDS.md
Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

18 KiB

SEC Reference Data Fields vs. openfunds Data Model

1. Overview of SEC Structured Data Sources

The SEC provides four distinct structured data sources that contain reference data for US-registered funds. Each covers different aspects:

Source Form Content Format Granularity
Series/Class CSV Identity & identifiers CSV/XML Trust → Series → Class
XBRL Risk/Return N-1A (485BPOS, 497K) Prospectus-derived structured data XBRL → flat files Series & Class level
N-PORT Data Sets NPORT-P Portfolio holdings & fund financials XML → flat files Series & Holding level
Submissions API Filing history metadata JSON Entity (CIK) level

2. Complete Field Inventory by SEC Source

2.1 Series/Class Reference CSV

This is the identity backbone — maps the hierarchy of trust → fund → share class.

Field Description openfunds Equivalent
Reporting File Number 811-XXXXX Investment Co. Act number — (no direct equivalent)
CIK Number 10-digit SEC entity identifier — (SEC-specific)
Entity Name Trust/investment company name OFST005010 Umbrella Name
Entity Org Type Organization type code OFST160100 Legal Form
Series ID S###### fund series identifier — (SEC-specific)
Series Name Fund name OFST010110 Legal Fund Name Only
Class ID C###### share class identifier — (SEC-specific)
Class Name Share class name (e.g. "Admiral Shares") OFST020060 Full Share Class Name
Class Ticker Exchange ticker symbol OFST020020 Bloomberg Code (partial)
Address_1, Address_2, City, State, Zip Code Registrant address

Coverage: ~15,000+ investment company trusts, ~50,000+ series, ~100,000+ classes.

2.2 XBRL Risk/Return Summary (from Prospectus — the richest source)

This dataset is extracted from prospectus XBRL filings and is the closest to what openfunds covers. It contains the structured data that prospectuses specify.

A. Fund Identity & Structure

XBRL Element Description Data Type openfunds Equivalent
RiskReturnHeading Prospectus section heading Text
ObjectiveHeading Heading of objectives section Text
ObjectivePrimaryTextBlock Investment objective narrative Text Block OFST010300 Investment Objective
ObjectiveSecondaryTextBlock Additional objective detail Text Block OFST010300 Investment Objective
StrategyHeading Heading of strategy section Text
StrategyNarrativeTextBlock Principal investment strategies Text Block — (no single openfunds equivalent)

B. Fee & Expense Data (Shareholder Fees — paid directly by investor)

XBRL Element Description Data Type openfunds Equivalent
MaximumSalesChargeImposedOnPurchasesOverOfferingPrice Front-end load Ratio OFST451320 Max Subscription Fee In Favour Of Distributor
MaximumDeferredSalesChargeOverOfferingPrice Back-end load (CDSC) Ratio OFST451391 Contingent Deferred Sales Charge Exit Fee
MaximumDeferredSalesChargeOverOther CDSC on other basis Ratio OFST451392 Contingent Deferred Sales Charge Upfront Fee
MaximumSalesChargeOnReinvestedDividendsAndDistributions Load on reinvested dividends Ratio
RedemptionFeeOverRedemption Redemption fee (% of amount) Ratio OFST451440 Max Redemption Fee In Favour Of Fund
RedemptionFee Redemption fee (flat $) Monetary OFST451439 Min Redemption Fee In Favour Of Fund
ExchangeFeeOverRedemption Exchange fee (% of amount) Ratio
ExchangeFee Exchange fee (flat $) Monetary
MaximumAccountFeeOverAssets Account maintenance fee (%) Ratio
MaximumAccountFee Account maintenance fee ($) Monetary
MaximumCumulativeSalesChargeOverOfferingPrice Cumulative max sales charge Ratio

C. Annual Fund Operating Expenses (ongoing costs deducted from fund assets)

XBRL Element Description Data Type openfunds Equivalent
ManagementFeesOverAssets Management fee Ratio OFST452010 Management Fee Maximum
DistributionAndService12b1FeesOverAssets 12b-1 distribution fee Ratio OFST454165 Distribution Fee Maximum
Component1OtherExpensesOverAssets Other expense component 1 Ratio
Component2OtherExpensesOverAssets Other expense component 2 Ratio
Component3OtherExpensesOverAssets Other expense component 3 Ratio
OtherExpensesOverAssets Total other expenses Ratio
AcquiredFundFeesAndExpensesOverAssets Acquired fund fees (fund-of-funds) Ratio
ExpensesOverAssets Total Annual Fund Operating Expenses Ratio OFST452100 TER Excluding Performance Fee
FeeWaiverOrReimbursementOverAssets Fee waiver/reimbursement Ratio
NetExpensesOverAssets Net expenses after waivers Ratio OFST452200 Ongoing Charges

D. Expense Example (hypothetical cost projections)

XBRL Element Description Data Type openfunds Equivalent
ExpenseExampleYear01 Cost for $10K after 1 year Monetary
ExpenseExampleYear03 Cost for $10K after 3 years Monetary
ExpenseExampleYear05 Cost for $10K after 5 years Monetary
ExpenseExampleYear10 Cost for $10K after 10 years Monetary
ExpenseExampleNoRedemptionYear01 Cost if no redemption, 1 year Monetary
ExpenseExampleNoRedemptionYear03 Cost if no redemption, 3 years Monetary
ExpenseExampleNoRedemptionYear05 Cost if no redemption, 5 years Monetary
ExpenseExampleNoRedemptionYear10 Cost if no redemption, 10 years Monetary

E. Performance Data

XBRL Element Description Data Type openfunds Equivalent
AnnualReturn[YYYY] Annual return for calendar year Ratio OFDY025000-range (Performance data)
BarChartHighestQuarterlyReturn Best quarter return Ratio
BarChartLowestQuarterlyReturn Worst quarter return Ratio
BarChartHighestQuarterlyReturnDate Date of best quarter Date
BarChartLowestQuarterlyReturnDate Date of worst quarter Date
AverageAnnualReturnYear01 Average annual return, 1 year Ratio OFDY025000-range
AverageAnnualReturnYear05 Average annual return, 5 years Ratio OFDY025000-range
AverageAnnualReturnYear10 Average annual return, 10 years Ratio OFDY025000-range
AverageAnnualReturnSinceInception Return since inception Ratio
AverageAnnualReturnInceptionDate Inception date Date OFST020560 Share Class Launch Date
Performance dimensions: Before Taxes, After Taxes on Distributions, After Taxes on Distributions and Sales

F. Risk Disclosures

XBRL Element Description Data Type openfunds Equivalent
RiskHeading Risk section heading Text
RiskNarrativeTextBlock Principal risks narrative Text Block
RiskLoseMoney "You may lose money" statement String
RiskMoneyMarketFundMayImposeFeesOrSuspendSales MMF gate/fee risk Boolean
RiskMoneyMarketFundPriceFluctuates MMF NAV fluctuation risk Boolean
BarChartAndPerformanceTableHeading Performance section heading Text
PerformanceNarrativeTextBlock Performance context narrative Text Block

G. Portfolio Turnover

XBRL Element Description Data Type openfunds Equivalent
PortfolioTurnoverHeading Section heading Text
PortfolioTurnoverTextBlock Turnover narrative Text Block
PortfolioTurnoverRate Turnover rate (%) Ratio OFRE000025-range (Fund Ratios)

2.3 N-PORT Data Sets (Portfolio Holdings — quarterly)

This provides dynamic portfolio data not typically in openfunds static fields.

A. Fund-Level Information (FUND_REPORTED_INFO)

Field Description openfunds Equivalent
SERIES_NAME Fund name OFST010110 Legal Fund Name Only
SERIES_ID SEC series identifier
SERIES_LEI LEI of the fund series OFST010030 LEI Of Fund
TOTAL_ASSETS Total assets (USD) OFDY000010-range (AuM/TNA)
TOTAL_LIABILITIES Total liabilities
NET_ASSETS Net assets (TNA) OFDY000010-range
SALES_FLOW_MON1/2/3 Monthly inflows
REDEMPTION_FLOW_MON1/2/3 Monthly outflows
Credit spread sensitivities (3m,1y,5y,10y,30y) Risk measures

B. Interest Rate Risk (INTEREST_RATE_RISK)

Field Description openfunds Equivalent
CURRENCY_CODE Currency of exposure OFST010410 Fund Currency
INTRST_RATE_CHANGE_*_DV01 DV01 by maturity bucket
INTRST_RATE_CHANGE_*_DV100 Impact of 100bp shift

C. Monthly Returns (MONTHLY_TOTAL_RETURN)

Field Description openfunds Equivalent
CLASS_ID Share class identifier
MONTHLY_TOTAL_RETURN1/2/3 Monthly returns per class OFDY025000-range

D. Portfolio Holdings (FUND_REPORTED_HOLDING)

Field Description openfunds Equivalent
ISSUER_NAME Holding issuer name OFPH-range (Portfolio Holdings)
ISSUER_LEI LEI of issuer OFPH-range
ISSUER_TITLE Security title/description OFPH-range
ISSUER_CUSIP CUSIP of holding OFPH-range
BALANCE Position size OFPH-range
UNIT Shares/principal/other OFPH-range
CURRENCY_CODE Currency of holding OFPH-range
CURRENCY_VALUE Value in reporting currency OFPH-range
EXCHANGE_RATE FX rate applied
PERCENTAGE % of net assets OFPH-range
PAYOFF_PROFILE Long/Short/N/A OFPH-range
ASSET_CAT Asset type classification OFST350000 MiFID Securities Classification (concept)
ISSUER_TYPE Corporate/Government/etc.
INVESTMENT_COUNTRY Country of issuer (ISO) OFPH-range
IS_RESTRICTED_SECURITY Restricted security flag
FAIR_VALUE_LEVEL Fair value hierarchy (1/2/3)

E. Holding Identifiers (IDENTIFIERS)

Field Description openfunds Equivalent
IDENTIFIER_ISIN ISIN OFST020000 ISIN
IDENTIFIER_TICKER Ticker
OTHER_IDENTIFIER SEDOL, etc. OFST020040 SEDOL

3. Mapping: What openfunds Fields CAN Be Found in SEC Data?

Fully Available (structured, machine-readable)

openfunds Category openfunds OF-ID openfunds Field SEC Source SEC Field
Key Fact: Company OFST001000 Fund Group Name Series/Class CSV Entity Name
Key Fact: Umbrella OFST005010 Umbrella Name Series/Class CSV Entity Name
Key Fact: Fund OFST010030 LEI Of Fund N-PORT SERIES_LEI
OFST010110 Legal Fund Name Only Series/Class CSV + N-PORT Series Name
OFST010300 Investment Objective XBRL R/R ObjectivePrimaryTextBlock
OFST010410 Fund Currency N-PORT CURRENCY_CODE (inferred)
Key Fact: Share Class OFST020000 ISIN N-PORT Holdings IDENTIFIER_ISIN
OFST020005 CUSIP N-PORT Holdings ISSUER_CUSIP
OFST020040 SEDOL N-PORT Holdings OTHER_IDENTIFIER
OFST020060 Full Share Class Name Series/Class CSV Class Name
Classification OFST350000 Securities Classification N-PORT ASSET_CAT
Fees OFST451320 Max Subscription Fee (Distributor) XBRL R/R MaximumSalesChargeImposedOnPurchasesOverOfferingPrice
OFST451391 CDSC Exit Fee XBRL R/R MaximumDeferredSalesChargeOverOfferingPrice
OFST451440 Max Redemption Fee XBRL R/R RedemptionFeeOverRedemption
OFST452010 Management Fee Maximum XBRL R/R ManagementFeesOverAssets
OFST452100 TER Excl. Performance Fee XBRL R/R ExpensesOverAssets
OFST452200 Ongoing Charges XBRL R/R NetExpensesOverAssets
OFST454165 Distribution Fee Maximum XBRL R/R DistributionAndService12b1FeesOverAssets
Performance OFDY025xxx Return periods XBRL R/R AverageAnnualReturnYear01/05/10
Dynamic: AuM OFDY000xxx TNA / AuM N-PORT NET_ASSETS, TOTAL_ASSETS

Partially Available (derivable from prospectus text, not structured)

These fields exist in the text of the prospectus but are NOT in the SEC structured datasets. They would need to be extracted by an LLM — which is exactly the use case:

openfunds OF-ID openfunds Field Where in Prospectus
OFST010420 Open-ended Or Closed-ended Fund Structure Registration form type implies this (N-1A = open-end)
OFST010440 Fiscal Year End Mentioned in prospectus text, in Submissions JSON
OFST010500 Is Fund Of Funds Inferred from AcquiredFundFeesAndExpensesOverAssets > 0
OFST010580 Is ETF Inferred from form type or share class structure
OFST010720 Is Passive Fund Strategy narrative mentions "index" tracking
OFST010730 Management Approach Type Strategy narrative (active/passive/enhanced)
OFST020300 Valuation Frequency Prospectus "Pricing of Fund Shares" section
OFST020400 Distribution Policy Prospectus "Dividends and Distributions" section
OFST020540 Share Class Currency Inferred; US funds typically USD
OFST020558 Subscription Period Start Date Only for closed-end or interval funds
OFST400xxx Minimum Investment Prospectus "Purchase and Sale of Fund Shares"
OFST451027 Has Performance Fee Prospectus fee table
OFST451100 Hurdle Rate Prospectus fee table
OFST013000 Prospectus Date Filing date in submissions API

NOT Available in SEC Data

These openfunds fields are European/international-specific or distribution-channel-specific and have no SEC equivalent:

openfunds Category Examples
UCITS/AIFMD fields OFST160100 Legal Form (SICAV/FCP), OFST011200 Is UCITS With Leveraged Benchmark
European regulatory OFST350100 EFAMA EFC Category, OFST010075 CSSF Code
Distribution-specific OFST453151 Is Trailer Fee Clean, OFST451305 Applied Subscription Fee
MiFID/PRIIPs/KID OFEM-range (MiFID Template), OFEP-range (PRIIPs Template)
ESG/Sustainability OFST820xxx, OFEE-range (EU sustainability regulation specific)
Country registrations OFST6000XX (country-specific registration fields)
Solvency II OFST500xxx
Swiss/German/UK specific OFST700xxx

4. Summary: Is Asset Class, Currencies, Fees, Risk Data in the SEC Dataset?

Data Category In SEC Structured Data? Source Notes
Asset Class YES N-PORT ASSET_CAT field Values: equity, debt, derivative, etc.
Currencies YES N-PORT CURRENCY_CODE per holding; interest rate risk by currency Per-holding currency + fund-level
Fees (sales loads) YES XBRL R/R Front-end load, back-end load, redemption fee
Fees (operating expenses) YES XBRL R/R Management fee, 12b-1, TER, net expense ratio
Risk data (narrative) YES XBRL R/R Principal risks text block
Risk data (quantitative) YES N-PORT DV01, credit spread sensitivity, VaR
Performance YES XBRL R/R + N-PORT Annual returns, avg annual returns, monthly returns
Investment Objective YES XBRL R/R Full text of objective
Strategy YES XBRL R/R Full text of principal strategies
Portfolio Turnover YES XBRL R/R Turnover rate
Portfolio Holdings YES N-PORT Security-level: name, CUSIP, ISIN, country, asset type, value
Country of Issuer YES N-PORT INVESTMENT_COUNTRY ISO country code per holding
Minimum Investment PARTIAL In prospectus text, not structured LLM extraction target
Distribution Policy PARTIAL In prospectus text, not structured LLM extraction target
ESG/Sustainability NO Not in SEC structured data European regulation specific
UCITS Classification NO N/A for US funds European regulation specific

5. Implication for LLM Training Dataset

The SEC provides an excellent foundation for your LLM training dataset:

Ground Truth (structured) — available directly from SEC:

  • Fee tables (management fee, expense ratio, loads, 12b-1)
  • Performance data (1yr, 5yr, 10yr returns)
  • Investment objective text
  • Principal risks text
  • Portfolio turnover rate
  • Total net assets
  • Fund/class identifiers (CIK, Series ID, Class ID, Ticker, CUSIP)

Extraction Targets (in prospectus text, to be derived by LLM):

  • Minimum initial investment amounts
  • Distribution frequency and policy
  • Share class currency
  • Open/closed-end structure
  • Active vs. passive management
  • Benchmark index name
  • Tax status information
  • Purchase/redemption cut-off times
  • Settlement cycle

This creates a natural supervised learning setup: the XBRL structured data serves as labels/ground truth, and the prospectus HTML/text serves as input, enabling the LLM to learn the mapping from legal language to structured reference data.