Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
15 KiB
ESMA Fund Data: Registers, APIs, and Reference Data Fields
1. Overview: ESMA's Fund Data Ecosystem
ESMA (European Securities and Markets Authority) maintains six distinct data systems relevant to fund data. Unlike the US SEC which centralizes prospectus filing and structured extraction, ESMA's fund data is fragmented across regulatory registers focused on authorization, cross-border notification, and instrument identification rather than on prospectus content.
| System | Content | Access | Fund Relevance |
|---|---|---|---|
| Fund Register (AIF/EuSEF/EuVECA) | Authorized AIF funds & managers | Solr API (JSON/XML) | Direct - fund-level identity |
| Cross-border Marketing Register (UCITS + AIF) | Fund passporting notifications | Solr API (JSON/XML) | Direct - UCITS & AIF cross-border data |
| Entity Register (MiFID/UCITS/AIFMD) | Management companies, AIFMs | Solr API (JSON/XML) | Manager-level data |
| FIRDS (Financial Instruments Reference Data) | MiFIR instrument reference data | XML bulk files + API | Instrument-level (ISIN, CFI) |
| Prospectus Register (Prospectus Regulation III) | Approved EU prospectuses + supplements | Solr API (JSON/XML) | Securities prospectuses (not fund KIIDs) |
| Money Market Fund Register | MMF authorizations | Solr API (JSON/XML) | MMF-specific |
2. Fund Register: AIF/EuSEF/EuVECA Funds
API Endpoint
https://registers.esma.europa.eu/solr/esma_registers_funds/select?q=...&wt=json
Available Fields
| Field Name | Type | Description | openfunds Equivalent |
|---|---|---|---|
funds_national_name |
text | Fund name (national identifier name) | OFST010110 Legal Fund Name |
funds_lei |
text | LEI of the fund | OFST010030 LEI Of Fund |
funds_legal_framework_name |
text | Legal framework: AIF, EuSEF, EuVECA | OFST160100 Legal Form |
funds_other_legal_framework_name |
text | Additional legal framework info | — |
funds_status_code_name |
text | Fund authorization status | — |
funds_mgmnt_structure_code_name |
text | Management structure type | OFST010420 Open-ended/Closed-ended |
funds_domicile_cou_code_name |
text | Fund domicile country | OFST010010 Fund Domicile |
funds_mgmnt_status_code_name |
text | Management status | — |
funds_manager_nat_name |
text | Management company name | OFST001020 ManCo |
funds_manager_lei |
text | Manager LEI | — |
funds_manager_cou_code_name |
text | Manager country | — |
funds_manager_legal_framework_name |
text | Manager legal framework | — |
funds_host_country_code_name |
text | Host member states (marketing countries) | OFST6000XX Country registrations |
funds_fund_mrkt_status_code_name |
text | Marketing status per country | — |
funds_notification_event1_date |
date | First notification date | — |
funds_notification_event2_date |
date | Second notification date | — |
funds_notif_legal_framework_name |
text | Notification legal framework | — |
funds_ca_cou_code_name |
text | Competent authority country | OFST010060 Supervisory Authority |
Example API Call (all AIF funds, JSON format)
curl "https://registers.esma.europa.eu/solr/esma_registers_funds/select?q=type_s:*&fq=funds_legal_framework_name:%22AIF%22&fl=funds_national_name,funds_lei,funds_domicile_cou_code_name,funds_manager_nat_name,funds_manager_lei&rows=100&wt=json&indent=true"
3. Cross-border Marketing Register (UCITS + AIF)
API Endpoint
https://registers.esma.europa.eu/solr/esma_registers_funds_cbdif/select?q=...&wt=json
Same field structure as the Fund Register above, but covers UCITS funds as well. This is the only ESMA register that includes UCITS fund-level data.
Filter by fund type
# UCITS funds
...&fq=funds_legal_framework_name:"UCITS"
# AIF funds
...&fq=funds_legal_framework_name:"AIF"
# ELTIF funds
...&fq=funds_legal_framework_name:"ELTIF"
4. Entity Register (Management Companies)
API Endpoint
https://registers.esma.europa.eu/solr/esma_registers_upreg/select?q=...&wt=json
Fields for UCITS Management Companies (ae_entityTypeCode:UCI)
| Field Name | Type | Description | openfunds Equivalent |
|---|---|---|---|
ae_entityName |
text | Entity (ManCo) name | OFST001020 ManCo |
ae_lei |
text | Entity LEI | — |
ae_entityTypeCode |
text | Entity type (UCI, AIF, MIF, etc.) | — |
ae_competentAuthority |
text | NCA name | OFST010060 Supervisory Authority |
ae_homeMemberState |
text | Home member state | — |
ae_hostMemberState |
text | Host member state(s) | — |
ae_status |
text | Authorization status | — |
ae_authorisationNotificationDate |
date | Authorization date | — |
ae_website |
text | Entity website | — |
ae_legalform |
text | Legal form | OFST160100 Legal Form |
ae_commercialName |
text | Commercial/brand name | OFST001000 Fund Group Name |
ac_serviceName |
text | Licensed services | — |
no_of_funds |
string | Number of managed funds | — |
Example: List all UCITS Management Companies
curl "https://registers.esma.europa.eu/solr/esma_registers_upreg/select?q={!join+from=id+to=_root_}ae_entityTypeCode:UCI&fq=(type_s:parent)&rows=1000&wt=json&indent=true"
5. FIRDS (Financial Instruments Reference Data System)
FIRDS contains MiFIR reference data for all financial instruments traded on EU venues, including fund shares/units. Funds are classified with CFI codes starting with "C" (Collective Investment Schemes).
Access Methods
- Full/Delta XML files: Downloaded from ESMA registers portal
- Python package:
esma_data_pyon GitHub - API: Via the ESMA API store
Key Fields for Fund Instruments
| Field | Description | openfunds Equivalent |
|---|---|---|
| ISIN | International Securities Identification Number | OFST020000 ISIN |
| CFI Code | Classification of Financial Instruments (ISO 10962) | OFST350015 CFI Code |
| Instrument Full Name | Name of the instrument | OFST020060 Full Share Class Name |
| Issuer LEI | LEI of the issuer/ManCo | OFST010030 LEI Of Fund |
| Notional Currency | Currency of the instrument | OFST020540 Share Class Currency |
| Trading Venue MIC | Where the instrument is traded | OFST060000-range Listing data |
| Maturity Date | For dated instruments | — |
| Nominal Value | Face value per unit | — |
Python Access
from esma_data_py import EsmaDataLoader
edl = EsmaDataLoader()
# Load FIRDS data for Collective Investment Schemes
df = edl.load_latest_files(instrument_type="FULINS", cfi_codes=["C*"])
6. Money Market Fund Register
API Endpoint
https://registers.esma.europa.eu/solr/esma_registers_mmf04/select?q=type_s:parent&wt=json
Fields
| Field Name | Description | openfunds Equivalent |
|---|---|---|
mmf04_lei |
Fund LEI | OFST010030 LEI Of Fund |
mmf04_national_name |
Fund national name | OFST010110 Legal Fund Name |
mmf04_domicile_name |
Fund domicile | OFST010010 Fund Domicile |
mmf04_type_name |
MMF type (CNAV/LVNAV/VNAV) | OFST351300 Money Market Type |
mmf04_lgl_framework_name |
Legal framework (UCITS/AIF) | OFST160100 Legal Form |
mmf04_is_passported_name |
Cross-border passported? | — |
mmf04_auth_status_name |
Authorization status | — |
mmf04_manager_lei |
Manager LEI | — |
mmf04_manager_nat_name |
Manager name | OFST001020 ManCo |
mmf04_manager_domicile_name |
Manager domicile | — |
mmf04_auth_start_date |
Authorization start date | OFST010240 Fund Launch Date |
mmf04_auth_end_date |
Authorization end date | — |
mmf04_ca_cou_code_name |
Competent authority country | OFST010060 Supervisory Authority |
mmf04_auth_ca_code_name |
Authorizing CA | — |
7. Prospectus Register (Prospectus Regulation III)
This covers securities prospectuses under the EU Prospectus Regulation (not UCITS KIIDs/KIDs). However, some fund-related securities (listed fund shares, ETFs) may appear here.
API Endpoint
https://registers.esma.europa.eu/solr/esma_registers_priii_documents/select?q=...&wt=json
Document Types
| Code | Type |
|---|---|
| URGN | Universal Registration Document |
| REGN | Registration Document |
| SECN | Securities Note |
| SMRY | Summary |
| BPFT | Base Prospectus Final Terms |
| BPWO | Base Prospectus without Final Terms |
| STDA | Standalone Prospectus |
Searchable Fields
issuer_lei— Issuer LEIissuer_name— Issuer nameissuer_residency— Issuer countryofferor_lei/offeror_name/offeror_residencyguarantor_lei/guarantor_name/guarantor_residencyapproval_filing_date— Document approval datedocument_type— Type code (see above)
8. AIFMD Reporting (Not Publicly Available)
ESMA collects detailed fund data through AIFMD Article 24 reporting, but this data is not publicly accessible. It is submitted by AIFMs to NCAs and forwarded to ESMA for supervisory purposes only.
Fields collected (not public):
| Category | Fields |
|---|---|
| Fund Identity | AIF name, national code, LEI, domicile, inception date |
| Fund Type | Predominant AIF type, investment strategy, sub-strategy |
| Assets | Gross Asset Value (GAV), Net Asset Value (NAV), base currency |
| Leverage | Gross method leverage, commitment method leverage |
| Investor Types | Breakdown by investor category (professional, retail, etc.) |
| Geographic Focus | Geographic breakdown of investments |
| Asset Allocation | Breakdown by asset type (equities, bonds, derivatives, etc.) |
| Liquidity | Portfolio liquidity profile, investor redemption frequency |
| Counterparty Risk | Top 5 counterparty exposures |
| Risk Measures | VaR, stress test results |
This is the richest structured dataset but is confidential and only available to regulators.
9. Comparison: ESMA vs SEC Fund Data
| Aspect | SEC (US) | ESMA (EU) |
|---|---|---|
| Prospectus text | Full prospectus filed as HTML/XML on EDGAR | Not centralized; filed with national NCAs |
| Structured prospectus data | XBRL Risk/Return Summary (fees, performance, objective) | Not available — no EU-wide structured extraction |
| Fund identity register | Series/Class CSV (CIK, Series ID, Class ID, ticker) | Fund Register (LEI, name, domicile, manager) |
| Portfolio holdings | N-PORT (quarterly, position-level) | Not public — AIFMD reporting is confidential |
| Instrument reference data | Limited (CUSIP in Series/Class CSV) | FIRDS (ISIN, CFI, LEI, currency, trading venue) |
| Fee data (structured) | XBRL: management fee, TER, loads, 12b-1 | Not available in ESMA registers |
| Performance data (structured) | XBRL: 1yr/5yr/10yr returns, bar charts | Not available in ESMA registers |
| Risk data (structured) | N-PORT: DV01, credit spread, VaR | AIFMD reporting (confidential) |
| Cross-border/passporting | N/A (single market) | Full cross-border notification register |
| API quality | Excellent (REST JSON, free, no auth) | Good (Solr JSON/XML, free, no auth) |
| Bulk download | ZIP files (submissions, XBRL, N-PORT) | FIRDS XML bulk files; fund register via Solr pagination |
Key Difference
The SEC provides structured data extracted from prospectuses (fees, performance, objectives via XBRL), making it directly useful for LLM training with ground-truth labels. ESMA provides authorization/registration data (who is authorized, where, by whom) but does not centralize or structure the content of fund prospectuses/KIIDs/KIDs.
For EU fund prospectus content (KIID/KID), you would need to go to:
- Individual NCAs (AMF in France, BaFin in Germany, CSSF in Luxembourg, etc.)
- Commercial data providers (Morningstar, Refinitiv, FE fundinfo)
- Fund company websites directly
10. What openfunds Fields Can Be Found in ESMA Data?
Directly Available (from ESMA registers)
| openfunds OF-ID | Field Name | ESMA Source |
|---|---|---|
| OFST001000 | Fund Group Name | Entity Register: ae_commercialName |
| OFST001020 | ManCo | Fund Register: funds_manager_nat_name |
| OFST010010 | Fund Domicile | Fund Register: funds_domicile_cou_code_name |
| OFST010030 | LEI Of Fund | Fund Register: funds_lei |
| OFST010060 | Supervisory Authority | Fund Register: funds_ca_cou_code_name |
| OFST010110 | Legal Fund Name | Fund Register: funds_national_name |
| OFST020000 | ISIN | FIRDS: ISIN field |
| OFST020540 | Share Class Currency | FIRDS: Notional Currency |
| OFST160100 | Legal Form | Fund Register: funds_legal_framework_name |
| OFST350015 | CFI Code | FIRDS: CFI Code |
| OFST351295 | Is Money Market Fund | MMF Register: presence in register |
| OFST351300 | Money Market Type | MMF Register: mmf04_type_name |
NOT Available in ESMA Public Data
| Category | openfunds Fields | Notes |
|---|---|---|
| Fees | Management fee, TER, loads, subscription/redemption fees | Not in any ESMA register |
| Performance | Returns, Sharpe ratio, volatility | Not in any ESMA register |
| Investment Objective | Strategy text, objective text | Not in any ESMA register |
| Risk Data | SRRI, VaR, risk narrative | AIFMD reporting (confidential) |
| Asset Class | Detailed asset allocation | AIFMD reporting (confidential) |
| Distribution Policy | Distributing/accumulating | Not in any ESMA register |
| Minimum Investment | Min subscription amount | Not in any ESMA register |
| Benchmark | Benchmark index name | Not in any ESMA register |
| Portfolio Holdings | Position-level data | AIFMD reporting (confidential) |
11. Summary for Your LLM Use Case
ESMA data is useful for fund identity and cross-referencing (LEI, domicile, manager, legal framework, cross-border marketing status), but it does not provide the structured prospectus-derived data (fees, performance, objectives, risk) that the SEC's XBRL Risk/Return Summary provides.
Practical implications:
- For US funds: SEC EDGAR provides both the prospectus text AND structured ground-truth data — ideal for supervised LLM training
- For EU funds: ESMA provides identity/authorization data only. To get the prospectus text + structured reference data for EU funds, you would need to combine ESMA register data with prospectus documents sourced from national regulators or commercial providers
Recommended approach for EU data:
- Use ESMA Fund Register + FIRDS for fund identity (LEI, ISIN, domicile, ManCo, CFI)
- Source KIID/KID documents from national NCAs or fund company websites
- Use openfunds-format data from commercial providers as ground truth
- Or focus on the SEC dataset first (much richer, more accessible) and extend to EU later