fund_rfid_data/SEC_FUND_DATA_RESEARCH.md
Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

9.2 KiB

SEC Fund Prospectus & Reference Data Research

1. Overview: Where Are Fund Prospectuses at the SEC?

All US-registered investment companies (mutual funds, ETFs) must file their prospectuses and amendments with the SEC via the EDGAR (Electronic Data Gathering, Analysis, and Retrieval) system. These filings are publicly available at no cost.

Relevant SEC Filing Form Types

Form Type Description Content
N-1A Initial registration statement for open-end management investment companies (mutual funds) Full prospectus + SAI (Statement of Additional Information)
485BPOS Post-effective amendment that becomes effective upon filing Updated prospectus (most common annual update)
485APOS Post-effective amendment filed in advance Prospectus amendment requiring SEC review
497 Definitive materials filed under Investment Company Act Prospectus supplements, SAI updates, sticker supplements
497K Summary prospectus for mutual funds Condensed key-facts prospectus (3-4 pages per fund/class)
N-CSR Certified shareholder report Annual/semi-annual report with financial statements
NPORT-P Monthly portfolio holdings (public quarterly) Detailed portfolio positions

Key Insight for Your Use Case

  • 485BPOS filings contain the full statutory prospectus (the main legal document)
  • 497K filings contain the summary prospectus (condensed, standardized format ideal for LLM training)
  • 497 filings contain supplements and amendments that modify the prospectus

2. Where Is the Reference Data?

Reference data for funds and their share classes comes from three complementary sources:

Source A: Investment Company Series and Class Information (CSV/XML)

  • URL: https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information
  • Download: https://www.sec.gov/files/investment/data/other/investment-company-series-class-information/investment-company-series-class-2025.csv
  • Updated: Monthly (latest: June 2025)
  • Contains:
    • CIK (Central Index Key) — the umbrella fund company identifier
    • File Number (811-XXXXX) — Investment Company Act registration number
    • Series ID (S######) — unique identifier per fund series (e.g., "Vanguard 500 Index Fund")
    • Class ID (C######) — unique identifier per share class
    • Ticker Symbol
    • Fund/Class name
    • Status (active/inactive)

Source B: Mutual Fund Prospectus Risk/Return Summary Data Sets (XBRL)

  • URL: https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets
  • Download: Quarterly ZIP files, e.g., https://www.sec.gov/files/dera/data/mutual-fund-prospectus-risk/return-summary-data-sets/2025q2_rr1.zip
  • Coverage: December 2010 to present (quarterly updates)
  • Contains structured XBRL data extracted from prospectuses:
    • Expense ratios (management fees, 12b-1 fees, total annual operating expenses)
    • Fee tables and expense examples
    • Performance bar charts and tables (1yr, 5yr, 10yr returns)
    • Risk/return narratives
    • Investment objectives
    • Principal strategies and risks

Source C: EDGAR Submissions API (JSON)

  • URL: https://data.sec.gov/submissions/CIK{cik_padded}.json
  • Contains: Full filing history per entity, including metadata (dates, form types, accession numbers, primary document URLs)

3. Unique Identifiers for Each Share Class

Identifier Format Source Scope
CIK 10-digit number (e.g., 0000036405) SEC EDGAR Umbrella fund entity (investment company trust)
File Number 811-XXXXX SEC Investment Company Act registration
Series ID S###### (e.g., S000002839) SEC EDGAR Individual fund within the trust
Class ID C###### (e.g., C000007773) SEC EDGAR Individual share class
Ticker 1-5 chars (e.g., VFIAX) Exchange Share class trading identifier
CUSIP 9-char alphanumeric CUSIP Global Services Security-level (per share class)
ISIN 12-char (US + CUSIP + check) ANNA International identifier (derived from CUSIP for US)
Accession Number XXXXXXXXXX-YY-ZZZZZZ SEC EDGAR Unique per filing

Hierarchy

Investment Company Trust (CIK) — e.g., "Vanguard Index Funds"
  └── Fund Series (Series ID) — e.g., "Vanguard 500 Index Fund"
        ├── Share Class (Class ID) — e.g., "Investor Shares" (VFINX)
        ├── Share Class (Class ID) — e.g., "Admiral Shares" (VFIAX)
        └── Share Class (Class ID) — e.g., "ETF Shares" (VOO)

4. APIs and Web Endpoints for Downloading Data

4.1 EDGAR Submissions API (FREE, no authentication)

GET https://data.sec.gov/submissions/CIK{10-digit-cik}.json
Header: User-Agent: "YourApp your@email.com"

Returns: JSON with entity metadata + all filing history (form type, dates, accession numbers, primary document filenames).

4.2 EDGAR Full-Text Search API (FREE, no authentication)

GET https://efts.sec.gov/LATEST/search-index?q={query}&forms={form_types}&dateRange=custom&startdt={YYYY-MM-DD}&enddt={YYYY-MM-DD}
Header: User-Agent: "YourApp your@email.com"

Returns: JSON with matching filings, metadata, and file references. Supports Boolean operators (AND, OR, NOT, NEAR) and wildcard (*).

4.3 Filing Document Download (FREE, no authentication)

GET https://www.sec.gov/Archives/edgar/data/{cik}/{accession-number-no-dashes}/{document-filename}
Header: User-Agent: "YourApp your@email.com"

Returns: The actual filing document (HTML, XML, or PDF).

4.4 XBRL APIs (FREE, no authentication)

GET https://data.sec.gov/api/xbrl/companyfacts/CIK{10-digit-cik}.json
GET https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/{taxonomy}/{concept}.json

4.5 Rate Limits

  • 10 requests per second maximum
  • Must include a User-Agent header with company name and email
  • No API key required

5. Five Concrete Fund Examples

Example 1: Vanguard 500 Index Fund

Field Value
Trust Name Vanguard Index Funds
CIK 0000036405
File Number 811-02652
Series ID S000002839
Fund Name Vanguard 500 Index Fund
Share Classes
Investor Shares Class ID: C000007773, Ticker: VFINX
Admiral Shares Class ID: C000007774, Ticker: VFIAX
ETF Shares Class ID: C000092055, Ticker: VOO
Institutional Select Class ID: C000170274, Ticker: VFFSX
Latest 497K (Summary Prospectus) https://www.sec.gov/Archives/edgar/data/36405/000168386325004160/f41649d1.htm
Submissions JSON https://data.sec.gov/submissions/CIK0000036405.json

Example 2: Fidelity Contrafund

Field Value
Trust Name Fidelity Contrafund
CIK 0000024238
File Number 811-01400
Series ID S000006037
Fund Name Fidelity Contrafund
Share Classes
Fidelity Contrafund Class ID: C000016601, Ticker: FCNTX
Class K Class ID: C000064233, Ticker: FCNKX
K6 Class Class ID: C000182865, Ticker: FLCNX
Submissions JSON https://data.sec.gov/submissions/CIK0000024238.json

Example 3: iShares Core S&P 500 ETF (BlackRock)

Field Value
Trust Name iShares Trust
CIK 0001100663
File Number 811-09729
Series ID S000002838
Fund Name iShares Core S&P 500 ETF
Share Classes
ETF Shares Ticker: IVV
Submissions JSON https://data.sec.gov/submissions/CIK0001100663.json

Example 4: Columbia Funds Series Trust I

Field Value
Trust Name Columbia Funds Series Trust I
CIK 0000773757
File Number 811-04367
Submissions JSON https://data.sec.gov/submissions/CIK0000773757.json

Example 5: T. Rowe Price Exchange-Traded Funds

Field Value
Trust Name T. Rowe Price Exchange-Traded Funds, Inc.
CIK 0001795351
File Number 811-23494
Submissions JSON https://data.sec.gov/submissions/CIK0001795351.json

6. Data Architecture for LLM Training Dataset

Goal: Pair prospectus text with structured reference data

Dataset Record:
├── fund_identity
│   ├── cik, series_id, class_id, ticker, cusip
│   ├── fund_name, trust_name
│   └── file_number
├── prospectus_document
│   ├── accession_number, filing_date, form_type
│   ├── document_url
│   └── document_text (HTML/XML → plain text)
├── structured_reference_data (from XBRL risk/return)
│   ├── investment_objective
│   ├── expense_ratio, management_fee, 12b1_fee
│   ├── performance_1yr, performance_5yr, performance_10yr
│   ├── minimum_investment
│   └── risk_narratives
└── supplements_amendments[]
    ├── accession_number, filing_date, form_type
    └── document_text

This allows training an LLM to:

  1. Extract structured fields (expense ratio, objective, fees) from raw prospectus text
  2. Validate extracted data against the XBRL ground truth
  3. Handle amendments/supplements that modify the base prospectus