Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
212 lines
9.2 KiB
Markdown
212 lines
9.2 KiB
Markdown
# SEC Fund Prospectus & Reference Data Research
|
|
|
|
## 1. Overview: Where Are Fund Prospectuses at the SEC?
|
|
|
|
All US-registered investment companies (mutual funds, ETFs) must file their prospectuses and amendments with the SEC via the **EDGAR** (Electronic Data Gathering, Analysis, and Retrieval) system. These filings are publicly available at no cost.
|
|
|
|
### Relevant SEC Filing Form Types
|
|
|
|
| Form Type | Description | Content |
|
|
|-----------|-------------|---------|
|
|
| **N-1A** | Initial registration statement for open-end management investment companies (mutual funds) | Full prospectus + SAI (Statement of Additional Information) |
|
|
| **485BPOS** | Post-effective amendment that becomes effective upon filing | Updated prospectus (most common annual update) |
|
|
| **485APOS** | Post-effective amendment filed in advance | Prospectus amendment requiring SEC review |
|
|
| **497** | Definitive materials filed under Investment Company Act | Prospectus supplements, SAI updates, sticker supplements |
|
|
| **497K** | Summary prospectus for mutual funds | Condensed key-facts prospectus (3-4 pages per fund/class) |
|
|
| **N-CSR** | Certified shareholder report | Annual/semi-annual report with financial statements |
|
|
| **NPORT-P** | Monthly portfolio holdings (public quarterly) | Detailed portfolio positions |
|
|
|
|
### Key Insight for Your Use Case
|
|
- **485BPOS** filings contain the **full statutory prospectus** (the main legal document)
|
|
- **497K** filings contain the **summary prospectus** (condensed, standardized format ideal for LLM training)
|
|
- **497** filings contain **supplements and amendments** that modify the prospectus
|
|
|
|
---
|
|
|
|
## 2. Where Is the Reference Data?
|
|
|
|
Reference data for funds and their share classes comes from **three complementary sources**:
|
|
|
|
### Source A: Investment Company Series and Class Information (CSV/XML)
|
|
- **URL**: `https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information`
|
|
- **Download**: `https://www.sec.gov/files/investment/data/other/investment-company-series-class-information/investment-company-series-class-2025.csv`
|
|
- **Updated**: Monthly (latest: June 2025)
|
|
- **Contains**:
|
|
- CIK (Central Index Key) — the umbrella fund company identifier
|
|
- File Number (811-XXXXX) — Investment Company Act registration number
|
|
- Series ID (S######) — unique identifier per fund series (e.g., "Vanguard 500 Index Fund")
|
|
- Class ID (C######) — unique identifier per share class
|
|
- Ticker Symbol
|
|
- Fund/Class name
|
|
- Status (active/inactive)
|
|
|
|
### Source B: Mutual Fund Prospectus Risk/Return Summary Data Sets (XBRL)
|
|
- **URL**: `https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets`
|
|
- **Download**: Quarterly ZIP files, e.g., `https://www.sec.gov/files/dera/data/mutual-fund-prospectus-risk/return-summary-data-sets/2025q2_rr1.zip`
|
|
- **Coverage**: December 2010 to present (quarterly updates)
|
|
- **Contains structured XBRL data extracted from prospectuses**:
|
|
- Expense ratios (management fees, 12b-1 fees, total annual operating expenses)
|
|
- Fee tables and expense examples
|
|
- Performance bar charts and tables (1yr, 5yr, 10yr returns)
|
|
- Risk/return narratives
|
|
- Investment objectives
|
|
- Principal strategies and risks
|
|
|
|
### Source C: EDGAR Submissions API (JSON)
|
|
- **URL**: `https://data.sec.gov/submissions/CIK{cik_padded}.json`
|
|
- **Contains**: Full filing history per entity, including metadata (dates, form types, accession numbers, primary document URLs)
|
|
|
|
---
|
|
|
|
## 3. Unique Identifiers for Each Share Class
|
|
|
|
| Identifier | Format | Source | Scope |
|
|
|-----------|--------|--------|-------|
|
|
| **CIK** | 10-digit number (e.g., 0000036405) | SEC EDGAR | Umbrella fund entity (investment company trust) |
|
|
| **File Number** | 811-XXXXX | SEC | Investment Company Act registration |
|
|
| **Series ID** | S###### (e.g., S000002839) | SEC EDGAR | Individual fund within the trust |
|
|
| **Class ID** | C###### (e.g., C000007773) | SEC EDGAR | Individual share class |
|
|
| **Ticker** | 1-5 chars (e.g., VFIAX) | Exchange | Share class trading identifier |
|
|
| **CUSIP** | 9-char alphanumeric | CUSIP Global Services | Security-level (per share class) |
|
|
| **ISIN** | 12-char (US + CUSIP + check) | ANNA | International identifier (derived from CUSIP for US) |
|
|
| **Accession Number** | XXXXXXXXXX-YY-ZZZZZZ | SEC EDGAR | Unique per filing |
|
|
|
|
### Hierarchy
|
|
```
|
|
Investment Company Trust (CIK) — e.g., "Vanguard Index Funds"
|
|
└── Fund Series (Series ID) — e.g., "Vanguard 500 Index Fund"
|
|
├── Share Class (Class ID) — e.g., "Investor Shares" (VFINX)
|
|
├── Share Class (Class ID) — e.g., "Admiral Shares" (VFIAX)
|
|
└── Share Class (Class ID) — e.g., "ETF Shares" (VOO)
|
|
```
|
|
|
|
---
|
|
|
|
## 4. APIs and Web Endpoints for Downloading Data
|
|
|
|
### 4.1 EDGAR Submissions API (FREE, no authentication)
|
|
```
|
|
GET https://data.sec.gov/submissions/CIK{10-digit-cik}.json
|
|
Header: User-Agent: "YourApp your@email.com"
|
|
```
|
|
Returns: JSON with entity metadata + all filing history (form type, dates, accession numbers, primary document filenames).
|
|
|
|
### 4.2 EDGAR Full-Text Search API (FREE, no authentication)
|
|
```
|
|
GET https://efts.sec.gov/LATEST/search-index?q={query}&forms={form_types}&dateRange=custom&startdt={YYYY-MM-DD}&enddt={YYYY-MM-DD}
|
|
Header: User-Agent: "YourApp your@email.com"
|
|
```
|
|
Returns: JSON with matching filings, metadata, and file references. Supports Boolean operators (AND, OR, NOT, NEAR) and wildcard (*).
|
|
|
|
### 4.3 Filing Document Download (FREE, no authentication)
|
|
```
|
|
GET https://www.sec.gov/Archives/edgar/data/{cik}/{accession-number-no-dashes}/{document-filename}
|
|
Header: User-Agent: "YourApp your@email.com"
|
|
```
|
|
Returns: The actual filing document (HTML, XML, or PDF).
|
|
|
|
### 4.4 XBRL APIs (FREE, no authentication)
|
|
```
|
|
GET https://data.sec.gov/api/xbrl/companyfacts/CIK{10-digit-cik}.json
|
|
GET https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/{taxonomy}/{concept}.json
|
|
```
|
|
|
|
### 4.5 Rate Limits
|
|
- **10 requests per second** maximum
|
|
- Must include a `User-Agent` header with company name and email
|
|
- No API key required
|
|
|
|
---
|
|
|
|
## 5. Five Concrete Fund Examples
|
|
|
|
### Example 1: Vanguard 500 Index Fund
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Trust Name** | Vanguard Index Funds |
|
|
| **CIK** | 0000036405 |
|
|
| **File Number** | 811-02652 |
|
|
| **Series ID** | S000002839 |
|
|
| **Fund Name** | Vanguard 500 Index Fund |
|
|
| **Share Classes** | |
|
|
| Investor Shares | Class ID: C000007773, Ticker: VFINX |
|
|
| Admiral Shares | Class ID: C000007774, Ticker: VFIAX |
|
|
| ETF Shares | Class ID: C000092055, Ticker: VOO |
|
|
| Institutional Select | Class ID: C000170274, Ticker: VFFSX |
|
|
| **Latest 497K (Summary Prospectus)** | `https://www.sec.gov/Archives/edgar/data/36405/000168386325004160/f41649d1.htm` |
|
|
| **Submissions JSON** | `https://data.sec.gov/submissions/CIK0000036405.json` |
|
|
|
|
### Example 2: Fidelity Contrafund
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Trust Name** | Fidelity Contrafund |
|
|
| **CIK** | 0000024238 |
|
|
| **File Number** | 811-01400 |
|
|
| **Series ID** | S000006037 |
|
|
| **Fund Name** | Fidelity Contrafund |
|
|
| **Share Classes** | |
|
|
| Fidelity Contrafund | Class ID: C000016601, Ticker: FCNTX |
|
|
| Class K | Class ID: C000064233, Ticker: FCNKX |
|
|
| K6 Class | Class ID: C000182865, Ticker: FLCNX |
|
|
| **Submissions JSON** | `https://data.sec.gov/submissions/CIK0000024238.json` |
|
|
|
|
### Example 3: iShares Core S&P 500 ETF (BlackRock)
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Trust Name** | iShares Trust |
|
|
| **CIK** | 0001100663 |
|
|
| **File Number** | 811-09729 |
|
|
| **Series ID** | S000002838 |
|
|
| **Fund Name** | iShares Core S&P 500 ETF |
|
|
| **Share Classes** | |
|
|
| ETF Shares | Ticker: IVV |
|
|
| **Submissions JSON** | `https://data.sec.gov/submissions/CIK0001100663.json` |
|
|
|
|
### Example 4: Columbia Funds Series Trust I
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Trust Name** | Columbia Funds Series Trust I |
|
|
| **CIK** | 0000773757 |
|
|
| **File Number** | 811-04367 |
|
|
| **Submissions JSON** | `https://data.sec.gov/submissions/CIK0000773757.json` |
|
|
|
|
### Example 5: T. Rowe Price Exchange-Traded Funds
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Trust Name** | T. Rowe Price Exchange-Traded Funds, Inc. |
|
|
| **CIK** | 0001795351 |
|
|
| **File Number** | 811-23494 |
|
|
| **Submissions JSON** | `https://data.sec.gov/submissions/CIK0001795351.json` |
|
|
|
|
---
|
|
|
|
## 6. Data Architecture for LLM Training Dataset
|
|
|
|
### Goal: Pair prospectus text with structured reference data
|
|
|
|
```
|
|
Dataset Record:
|
|
├── fund_identity
|
|
│ ├── cik, series_id, class_id, ticker, cusip
|
|
│ ├── fund_name, trust_name
|
|
│ └── file_number
|
|
├── prospectus_document
|
|
│ ├── accession_number, filing_date, form_type
|
|
│ ├── document_url
|
|
│ └── document_text (HTML/XML → plain text)
|
|
├── structured_reference_data (from XBRL risk/return)
|
|
│ ├── investment_objective
|
|
│ ├── expense_ratio, management_fee, 12b1_fee
|
|
│ ├── performance_1yr, performance_5yr, performance_10yr
|
|
│ ├── minimum_investment
|
|
│ └── risk_narratives
|
|
└── supplements_amendments[]
|
|
├── accession_number, filing_date, form_type
|
|
└── document_text
|
|
```
|
|
|
|
This allows training an LLM to:
|
|
1. **Extract** structured fields (expense ratio, objective, fees) from raw prospectus text
|
|
2. **Validate** extracted data against the XBRL ground truth
|
|
3. **Handle** amendments/supplements that modify the base prospectus
|