# SEC Fund Prospectus & Reference Data Research ## 1. Overview: Where Are Fund Prospectuses at the SEC? All US-registered investment companies (mutual funds, ETFs) must file their prospectuses and amendments with the SEC via the **EDGAR** (Electronic Data Gathering, Analysis, and Retrieval) system. These filings are publicly available at no cost. ### Relevant SEC Filing Form Types | Form Type | Description | Content | |-----------|-------------|---------| | **N-1A** | Initial registration statement for open-end management investment companies (mutual funds) | Full prospectus + SAI (Statement of Additional Information) | | **485BPOS** | Post-effective amendment that becomes effective upon filing | Updated prospectus (most common annual update) | | **485APOS** | Post-effective amendment filed in advance | Prospectus amendment requiring SEC review | | **497** | Definitive materials filed under Investment Company Act | Prospectus supplements, SAI updates, sticker supplements | | **497K** | Summary prospectus for mutual funds | Condensed key-facts prospectus (3-4 pages per fund/class) | | **N-CSR** | Certified shareholder report | Annual/semi-annual report with financial statements | | **NPORT-P** | Monthly portfolio holdings (public quarterly) | Detailed portfolio positions | ### Key Insight for Your Use Case - **485BPOS** filings contain the **full statutory prospectus** (the main legal document) - **497K** filings contain the **summary prospectus** (condensed, standardized format ideal for LLM training) - **497** filings contain **supplements and amendments** that modify the prospectus --- ## 2. Where Is the Reference Data? Reference data for funds and their share classes comes from **three complementary sources**: ### Source A: Investment Company Series and Class Information (CSV/XML) - **URL**: `https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information` - **Download**: `https://www.sec.gov/files/investment/data/other/investment-company-series-class-information/investment-company-series-class-2025.csv` - **Updated**: Monthly (latest: June 2025) - **Contains**: - CIK (Central Index Key) — the umbrella fund company identifier - File Number (811-XXXXX) — Investment Company Act registration number - Series ID (S######) — unique identifier per fund series (e.g., "Vanguard 500 Index Fund") - Class ID (C######) — unique identifier per share class - Ticker Symbol - Fund/Class name - Status (active/inactive) ### Source B: Mutual Fund Prospectus Risk/Return Summary Data Sets (XBRL) - **URL**: `https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets` - **Download**: Quarterly ZIP files, e.g., `https://www.sec.gov/files/dera/data/mutual-fund-prospectus-risk/return-summary-data-sets/2025q2_rr1.zip` - **Coverage**: December 2010 to present (quarterly updates) - **Contains structured XBRL data extracted from prospectuses**: - Expense ratios (management fees, 12b-1 fees, total annual operating expenses) - Fee tables and expense examples - Performance bar charts and tables (1yr, 5yr, 10yr returns) - Risk/return narratives - Investment objectives - Principal strategies and risks ### Source C: EDGAR Submissions API (JSON) - **URL**: `https://data.sec.gov/submissions/CIK{cik_padded}.json` - **Contains**: Full filing history per entity, including metadata (dates, form types, accession numbers, primary document URLs) --- ## 3. Unique Identifiers for Each Share Class | Identifier | Format | Source | Scope | |-----------|--------|--------|-------| | **CIK** | 10-digit number (e.g., 0000036405) | SEC EDGAR | Umbrella fund entity (investment company trust) | | **File Number** | 811-XXXXX | SEC | Investment Company Act registration | | **Series ID** | S###### (e.g., S000002839) | SEC EDGAR | Individual fund within the trust | | **Class ID** | C###### (e.g., C000007773) | SEC EDGAR | Individual share class | | **Ticker** | 1-5 chars (e.g., VFIAX) | Exchange | Share class trading identifier | | **CUSIP** | 9-char alphanumeric | CUSIP Global Services | Security-level (per share class) | | **ISIN** | 12-char (US + CUSIP + check) | ANNA | International identifier (derived from CUSIP for US) | | **Accession Number** | XXXXXXXXXX-YY-ZZZZZZ | SEC EDGAR | Unique per filing | ### Hierarchy ``` Investment Company Trust (CIK) — e.g., "Vanguard Index Funds" └── Fund Series (Series ID) — e.g., "Vanguard 500 Index Fund" ├── Share Class (Class ID) — e.g., "Investor Shares" (VFINX) ├── Share Class (Class ID) — e.g., "Admiral Shares" (VFIAX) └── Share Class (Class ID) — e.g., "ETF Shares" (VOO) ``` --- ## 4. APIs and Web Endpoints for Downloading Data ### 4.1 EDGAR Submissions API (FREE, no authentication) ``` GET https://data.sec.gov/submissions/CIK{10-digit-cik}.json Header: User-Agent: "YourApp your@email.com" ``` Returns: JSON with entity metadata + all filing history (form type, dates, accession numbers, primary document filenames). ### 4.2 EDGAR Full-Text Search API (FREE, no authentication) ``` GET https://efts.sec.gov/LATEST/search-index?q={query}&forms={form_types}&dateRange=custom&startdt={YYYY-MM-DD}&enddt={YYYY-MM-DD} Header: User-Agent: "YourApp your@email.com" ``` Returns: JSON with matching filings, metadata, and file references. Supports Boolean operators (AND, OR, NOT, NEAR) and wildcard (*). ### 4.3 Filing Document Download (FREE, no authentication) ``` GET https://www.sec.gov/Archives/edgar/data/{cik}/{accession-number-no-dashes}/{document-filename} Header: User-Agent: "YourApp your@email.com" ``` Returns: The actual filing document (HTML, XML, or PDF). ### 4.4 XBRL APIs (FREE, no authentication) ``` GET https://data.sec.gov/api/xbrl/companyfacts/CIK{10-digit-cik}.json GET https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/{taxonomy}/{concept}.json ``` ### 4.5 Rate Limits - **10 requests per second** maximum - Must include a `User-Agent` header with company name and email - No API key required --- ## 5. Five Concrete Fund Examples ### Example 1: Vanguard 500 Index Fund | Field | Value | |-------|-------| | **Trust Name** | Vanguard Index Funds | | **CIK** | 0000036405 | | **File Number** | 811-02652 | | **Series ID** | S000002839 | | **Fund Name** | Vanguard 500 Index Fund | | **Share Classes** | | | Investor Shares | Class ID: C000007773, Ticker: VFINX | | Admiral Shares | Class ID: C000007774, Ticker: VFIAX | | ETF Shares | Class ID: C000092055, Ticker: VOO | | Institutional Select | Class ID: C000170274, Ticker: VFFSX | | **Latest 497K (Summary Prospectus)** | `https://www.sec.gov/Archives/edgar/data/36405/000168386325004160/f41649d1.htm` | | **Submissions JSON** | `https://data.sec.gov/submissions/CIK0000036405.json` | ### Example 2: Fidelity Contrafund | Field | Value | |-------|-------| | **Trust Name** | Fidelity Contrafund | | **CIK** | 0000024238 | | **File Number** | 811-01400 | | **Series ID** | S000006037 | | **Fund Name** | Fidelity Contrafund | | **Share Classes** | | | Fidelity Contrafund | Class ID: C000016601, Ticker: FCNTX | | Class K | Class ID: C000064233, Ticker: FCNKX | | K6 Class | Class ID: C000182865, Ticker: FLCNX | | **Submissions JSON** | `https://data.sec.gov/submissions/CIK0000024238.json` | ### Example 3: iShares Core S&P 500 ETF (BlackRock) | Field | Value | |-------|-------| | **Trust Name** | iShares Trust | | **CIK** | 0001100663 | | **File Number** | 811-09729 | | **Series ID** | S000002838 | | **Fund Name** | iShares Core S&P 500 ETF | | **Share Classes** | | | ETF Shares | Ticker: IVV | | **Submissions JSON** | `https://data.sec.gov/submissions/CIK0001100663.json` | ### Example 4: Columbia Funds Series Trust I | Field | Value | |-------|-------| | **Trust Name** | Columbia Funds Series Trust I | | **CIK** | 0000773757 | | **File Number** | 811-04367 | | **Submissions JSON** | `https://data.sec.gov/submissions/CIK0000773757.json` | ### Example 5: T. Rowe Price Exchange-Traded Funds | Field | Value | |-------|-------| | **Trust Name** | T. Rowe Price Exchange-Traded Funds, Inc. | | **CIK** | 0001795351 | | **File Number** | 811-23494 | | **Submissions JSON** | `https://data.sec.gov/submissions/CIK0001795351.json` | --- ## 6. Data Architecture for LLM Training Dataset ### Goal: Pair prospectus text with structured reference data ``` Dataset Record: ├── fund_identity │ ├── cik, series_id, class_id, ticker, cusip │ ├── fund_name, trust_name │ └── file_number ├── prospectus_document │ ├── accession_number, filing_date, form_type │ ├── document_url │ └── document_text (HTML/XML → plain text) ├── structured_reference_data (from XBRL risk/return) │ ├── investment_objective │ ├── expense_ratio, management_fee, 12b1_fee │ ├── performance_1yr, performance_5yr, performance_10yr │ ├── minimum_investment │ └── risk_narratives └── supplements_amendments[] ├── accession_number, filing_date, form_type └── document_text ``` This allows training an LLM to: 1. **Extract** structured fields (expense ratio, objective, fees) from raw prospectus text 2. **Validate** extracted data against the XBRL ground truth 3. **Handle** amendments/supplements that modify the base prospectus