Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9.2 KiB
9.2 KiB
SEC Fund Prospectus & Reference Data Research
1. Overview: Where Are Fund Prospectuses at the SEC?
All US-registered investment companies (mutual funds, ETFs) must file their prospectuses and amendments with the SEC via the EDGAR (Electronic Data Gathering, Analysis, and Retrieval) system. These filings are publicly available at no cost.
Relevant SEC Filing Form Types
| Form Type | Description | Content |
|---|---|---|
| N-1A | Initial registration statement for open-end management investment companies (mutual funds) | Full prospectus + SAI (Statement of Additional Information) |
| 485BPOS | Post-effective amendment that becomes effective upon filing | Updated prospectus (most common annual update) |
| 485APOS | Post-effective amendment filed in advance | Prospectus amendment requiring SEC review |
| 497 | Definitive materials filed under Investment Company Act | Prospectus supplements, SAI updates, sticker supplements |
| 497K | Summary prospectus for mutual funds | Condensed key-facts prospectus (3-4 pages per fund/class) |
| N-CSR | Certified shareholder report | Annual/semi-annual report with financial statements |
| NPORT-P | Monthly portfolio holdings (public quarterly) | Detailed portfolio positions |
Key Insight for Your Use Case
- 485BPOS filings contain the full statutory prospectus (the main legal document)
- 497K filings contain the summary prospectus (condensed, standardized format ideal for LLM training)
- 497 filings contain supplements and amendments that modify the prospectus
2. Where Is the Reference Data?
Reference data for funds and their share classes comes from three complementary sources:
Source A: Investment Company Series and Class Information (CSV/XML)
- URL:
https://www.sec.gov/data-research/sec-markets-data/investment-company-series-class-information - Download:
https://www.sec.gov/files/investment/data/other/investment-company-series-class-information/investment-company-series-class-2025.csv - Updated: Monthly (latest: June 2025)
- Contains:
- CIK (Central Index Key) — the umbrella fund company identifier
- File Number (811-XXXXX) — Investment Company Act registration number
- Series ID (S######) — unique identifier per fund series (e.g., "Vanguard 500 Index Fund")
- Class ID (C######) — unique identifier per share class
- Ticker Symbol
- Fund/Class name
- Status (active/inactive)
Source B: Mutual Fund Prospectus Risk/Return Summary Data Sets (XBRL)
- URL:
https://www.sec.gov/data-research/sec-markets-data/mutual-fund-prospectus-riskreturn-summary-data-sets - Download: Quarterly ZIP files, e.g.,
https://www.sec.gov/files/dera/data/mutual-fund-prospectus-risk/return-summary-data-sets/2025q2_rr1.zip - Coverage: December 2010 to present (quarterly updates)
- Contains structured XBRL data extracted from prospectuses:
- Expense ratios (management fees, 12b-1 fees, total annual operating expenses)
- Fee tables and expense examples
- Performance bar charts and tables (1yr, 5yr, 10yr returns)
- Risk/return narratives
- Investment objectives
- Principal strategies and risks
Source C: EDGAR Submissions API (JSON)
- URL:
https://data.sec.gov/submissions/CIK{cik_padded}.json - Contains: Full filing history per entity, including metadata (dates, form types, accession numbers, primary document URLs)
3. Unique Identifiers for Each Share Class
| Identifier | Format | Source | Scope |
|---|---|---|---|
| CIK | 10-digit number (e.g., 0000036405) | SEC EDGAR | Umbrella fund entity (investment company trust) |
| File Number | 811-XXXXX | SEC | Investment Company Act registration |
| Series ID | S###### (e.g., S000002839) | SEC EDGAR | Individual fund within the trust |
| Class ID | C###### (e.g., C000007773) | SEC EDGAR | Individual share class |
| Ticker | 1-5 chars (e.g., VFIAX) | Exchange | Share class trading identifier |
| CUSIP | 9-char alphanumeric | CUSIP Global Services | Security-level (per share class) |
| ISIN | 12-char (US + CUSIP + check) | ANNA | International identifier (derived from CUSIP for US) |
| Accession Number | XXXXXXXXXX-YY-ZZZZZZ | SEC EDGAR | Unique per filing |
Hierarchy
Investment Company Trust (CIK) — e.g., "Vanguard Index Funds"
└── Fund Series (Series ID) — e.g., "Vanguard 500 Index Fund"
├── Share Class (Class ID) — e.g., "Investor Shares" (VFINX)
├── Share Class (Class ID) — e.g., "Admiral Shares" (VFIAX)
└── Share Class (Class ID) — e.g., "ETF Shares" (VOO)
4. APIs and Web Endpoints for Downloading Data
4.1 EDGAR Submissions API (FREE, no authentication)
GET https://data.sec.gov/submissions/CIK{10-digit-cik}.json
Header: User-Agent: "YourApp your@email.com"
Returns: JSON with entity metadata + all filing history (form type, dates, accession numbers, primary document filenames).
4.2 EDGAR Full-Text Search API (FREE, no authentication)
GET https://efts.sec.gov/LATEST/search-index?q={query}&forms={form_types}&dateRange=custom&startdt={YYYY-MM-DD}&enddt={YYYY-MM-DD}
Header: User-Agent: "YourApp your@email.com"
Returns: JSON with matching filings, metadata, and file references. Supports Boolean operators (AND, OR, NOT, NEAR) and wildcard (*).
4.3 Filing Document Download (FREE, no authentication)
GET https://www.sec.gov/Archives/edgar/data/{cik}/{accession-number-no-dashes}/{document-filename}
Header: User-Agent: "YourApp your@email.com"
Returns: The actual filing document (HTML, XML, or PDF).
4.4 XBRL APIs (FREE, no authentication)
GET https://data.sec.gov/api/xbrl/companyfacts/CIK{10-digit-cik}.json
GET https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/{taxonomy}/{concept}.json
4.5 Rate Limits
- 10 requests per second maximum
- Must include a
User-Agentheader with company name and email - No API key required
5. Five Concrete Fund Examples
Example 1: Vanguard 500 Index Fund
| Field | Value |
|---|---|
| Trust Name | Vanguard Index Funds |
| CIK | 0000036405 |
| File Number | 811-02652 |
| Series ID | S000002839 |
| Fund Name | Vanguard 500 Index Fund |
| Share Classes | |
| Investor Shares | Class ID: C000007773, Ticker: VFINX |
| Admiral Shares | Class ID: C000007774, Ticker: VFIAX |
| ETF Shares | Class ID: C000092055, Ticker: VOO |
| Institutional Select | Class ID: C000170274, Ticker: VFFSX |
| Latest 497K (Summary Prospectus) | https://www.sec.gov/Archives/edgar/data/36405/000168386325004160/f41649d1.htm |
| Submissions JSON | https://data.sec.gov/submissions/CIK0000036405.json |
Example 2: Fidelity Contrafund
| Field | Value |
|---|---|
| Trust Name | Fidelity Contrafund |
| CIK | 0000024238 |
| File Number | 811-01400 |
| Series ID | S000006037 |
| Fund Name | Fidelity Contrafund |
| Share Classes | |
| Fidelity Contrafund | Class ID: C000016601, Ticker: FCNTX |
| Class K | Class ID: C000064233, Ticker: FCNKX |
| K6 Class | Class ID: C000182865, Ticker: FLCNX |
| Submissions JSON | https://data.sec.gov/submissions/CIK0000024238.json |
Example 3: iShares Core S&P 500 ETF (BlackRock)
| Field | Value |
|---|---|
| Trust Name | iShares Trust |
| CIK | 0001100663 |
| File Number | 811-09729 |
| Series ID | S000002838 |
| Fund Name | iShares Core S&P 500 ETF |
| Share Classes | |
| ETF Shares | Ticker: IVV |
| Submissions JSON | https://data.sec.gov/submissions/CIK0001100663.json |
Example 4: Columbia Funds Series Trust I
| Field | Value |
|---|---|
| Trust Name | Columbia Funds Series Trust I |
| CIK | 0000773757 |
| File Number | 811-04367 |
| Submissions JSON | https://data.sec.gov/submissions/CIK0000773757.json |
Example 5: T. Rowe Price Exchange-Traded Funds
| Field | Value |
|---|---|
| Trust Name | T. Rowe Price Exchange-Traded Funds, Inc. |
| CIK | 0001795351 |
| File Number | 811-23494 |
| Submissions JSON | https://data.sec.gov/submissions/CIK0001795351.json |
6. Data Architecture for LLM Training Dataset
Goal: Pair prospectus text with structured reference data
Dataset Record:
├── fund_identity
│ ├── cik, series_id, class_id, ticker, cusip
│ ├── fund_name, trust_name
│ └── file_number
├── prospectus_document
│ ├── accession_number, filing_date, form_type
│ ├── document_url
│ └── document_text (HTML/XML → plain text)
├── structured_reference_data (from XBRL risk/return)
│ ├── investment_objective
│ ├── expense_ratio, management_fee, 12b1_fee
│ ├── performance_1yr, performance_5yr, performance_10yr
│ ├── minimum_investment
│ └── risk_narratives
└── supplements_amendments[]
├── accession_number, filing_date, form_type
└── document_text
This allows training an LLM to:
- Extract structured fields (expense ratio, objective, fees) from raw prospectus text
- Validate extracted data against the XBRL ground truth
- Handle amendments/supplements that modify the base prospectus