fund_rfid_data/ESMA_FUND_DATA_RESEARCH.md
Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline
Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:35 +02:00

15 KiB

ESMA Fund Data: Registers, APIs, and Reference Data Fields

1. Overview: ESMA's Fund Data Ecosystem

ESMA (European Securities and Markets Authority) maintains six distinct data systems relevant to fund data. Unlike the US SEC which centralizes prospectus filing and structured extraction, ESMA's fund data is fragmented across regulatory registers focused on authorization, cross-border notification, and instrument identification rather than on prospectus content.

System Content Access Fund Relevance
Fund Register (AIF/EuSEF/EuVECA) Authorized AIF funds & managers Solr API (JSON/XML) Direct - fund-level identity
Cross-border Marketing Register (UCITS + AIF) Fund passporting notifications Solr API (JSON/XML) Direct - UCITS & AIF cross-border data
Entity Register (MiFID/UCITS/AIFMD) Management companies, AIFMs Solr API (JSON/XML) Manager-level data
FIRDS (Financial Instruments Reference Data) MiFIR instrument reference data XML bulk files + API Instrument-level (ISIN, CFI)
Prospectus Register (Prospectus Regulation III) Approved EU prospectuses + supplements Solr API (JSON/XML) Securities prospectuses (not fund KIIDs)
Money Market Fund Register MMF authorizations Solr API (JSON/XML) MMF-specific

2. Fund Register: AIF/EuSEF/EuVECA Funds

API Endpoint

https://registers.esma.europa.eu/solr/esma_registers_funds/select?q=...&wt=json

Available Fields

Field Name Type Description openfunds Equivalent
funds_national_name text Fund name (national identifier name) OFST010110 Legal Fund Name
funds_lei text LEI of the fund OFST010030 LEI Of Fund
funds_legal_framework_name text Legal framework: AIF, EuSEF, EuVECA OFST160100 Legal Form
funds_other_legal_framework_name text Additional legal framework info
funds_status_code_name text Fund authorization status
funds_mgmnt_structure_code_name text Management structure type OFST010420 Open-ended/Closed-ended
funds_domicile_cou_code_name text Fund domicile country OFST010010 Fund Domicile
funds_mgmnt_status_code_name text Management status
funds_manager_nat_name text Management company name OFST001020 ManCo
funds_manager_lei text Manager LEI
funds_manager_cou_code_name text Manager country
funds_manager_legal_framework_name text Manager legal framework
funds_host_country_code_name text Host member states (marketing countries) OFST6000XX Country registrations
funds_fund_mrkt_status_code_name text Marketing status per country
funds_notification_event1_date date First notification date
funds_notification_event2_date date Second notification date
funds_notif_legal_framework_name text Notification legal framework
funds_ca_cou_code_name text Competent authority country OFST010060 Supervisory Authority

Example API Call (all AIF funds, JSON format)

curl "https://registers.esma.europa.eu/solr/esma_registers_funds/select?q=type_s:*&fq=funds_legal_framework_name:%22AIF%22&fl=funds_national_name,funds_lei,funds_domicile_cou_code_name,funds_manager_nat_name,funds_manager_lei&rows=100&wt=json&indent=true"

3. Cross-border Marketing Register (UCITS + AIF)

API Endpoint

https://registers.esma.europa.eu/solr/esma_registers_funds_cbdif/select?q=...&wt=json

Same field structure as the Fund Register above, but covers UCITS funds as well. This is the only ESMA register that includes UCITS fund-level data.

Filter by fund type

# UCITS funds
...&fq=funds_legal_framework_name:"UCITS"

# AIF funds
...&fq=funds_legal_framework_name:"AIF"

# ELTIF funds
...&fq=funds_legal_framework_name:"ELTIF"

4. Entity Register (Management Companies)

API Endpoint

https://registers.esma.europa.eu/solr/esma_registers_upreg/select?q=...&wt=json

Fields for UCITS Management Companies (ae_entityTypeCode:UCI)

Field Name Type Description openfunds Equivalent
ae_entityName text Entity (ManCo) name OFST001020 ManCo
ae_lei text Entity LEI
ae_entityTypeCode text Entity type (UCI, AIF, MIF, etc.)
ae_competentAuthority text NCA name OFST010060 Supervisory Authority
ae_homeMemberState text Home member state
ae_hostMemberState text Host member state(s)
ae_status text Authorization status
ae_authorisationNotificationDate date Authorization date
ae_website text Entity website
ae_legalform text Legal form OFST160100 Legal Form
ae_commercialName text Commercial/brand name OFST001000 Fund Group Name
ac_serviceName text Licensed services
no_of_funds string Number of managed funds

Example: List all UCITS Management Companies

curl "https://registers.esma.europa.eu/solr/esma_registers_upreg/select?q={!join+from=id+to=_root_}ae_entityTypeCode:UCI&fq=(type_s:parent)&rows=1000&wt=json&indent=true"

5. FIRDS (Financial Instruments Reference Data System)

FIRDS contains MiFIR reference data for all financial instruments traded on EU venues, including fund shares/units. Funds are classified with CFI codes starting with "C" (Collective Investment Schemes).

Access Methods

  1. Full/Delta XML files: Downloaded from ESMA registers portal
  2. Python package: esma_data_py on GitHub
  3. API: Via the ESMA API store

Key Fields for Fund Instruments

Field Description openfunds Equivalent
ISIN International Securities Identification Number OFST020000 ISIN
CFI Code Classification of Financial Instruments (ISO 10962) OFST350015 CFI Code
Instrument Full Name Name of the instrument OFST020060 Full Share Class Name
Issuer LEI LEI of the issuer/ManCo OFST010030 LEI Of Fund
Notional Currency Currency of the instrument OFST020540 Share Class Currency
Trading Venue MIC Where the instrument is traded OFST060000-range Listing data
Maturity Date For dated instruments
Nominal Value Face value per unit

Python Access

from esma_data_py import EsmaDataLoader
edl = EsmaDataLoader()
# Load FIRDS data for Collective Investment Schemes
df = edl.load_latest_files(instrument_type="FULINS", cfi_codes=["C*"])

6. Money Market Fund Register

API Endpoint

https://registers.esma.europa.eu/solr/esma_registers_mmf04/select?q=type_s:parent&wt=json

Fields

Field Name Description openfunds Equivalent
mmf04_lei Fund LEI OFST010030 LEI Of Fund
mmf04_national_name Fund national name OFST010110 Legal Fund Name
mmf04_domicile_name Fund domicile OFST010010 Fund Domicile
mmf04_type_name MMF type (CNAV/LVNAV/VNAV) OFST351300 Money Market Type
mmf04_lgl_framework_name Legal framework (UCITS/AIF) OFST160100 Legal Form
mmf04_is_passported_name Cross-border passported?
mmf04_auth_status_name Authorization status
mmf04_manager_lei Manager LEI
mmf04_manager_nat_name Manager name OFST001020 ManCo
mmf04_manager_domicile_name Manager domicile
mmf04_auth_start_date Authorization start date OFST010240 Fund Launch Date
mmf04_auth_end_date Authorization end date
mmf04_ca_cou_code_name Competent authority country OFST010060 Supervisory Authority
mmf04_auth_ca_code_name Authorizing CA

7. Prospectus Register (Prospectus Regulation III)

This covers securities prospectuses under the EU Prospectus Regulation (not UCITS KIIDs/KIDs). However, some fund-related securities (listed fund shares, ETFs) may appear here.

API Endpoint

https://registers.esma.europa.eu/solr/esma_registers_priii_documents/select?q=...&wt=json

Document Types

Code Type
URGN Universal Registration Document
REGN Registration Document
SECN Securities Note
SMRY Summary
BPFT Base Prospectus Final Terms
BPWO Base Prospectus without Final Terms
STDA Standalone Prospectus

Searchable Fields

  • issuer_lei — Issuer LEI
  • issuer_name — Issuer name
  • issuer_residency — Issuer country
  • offeror_lei / offeror_name / offeror_residency
  • guarantor_lei / guarantor_name / guarantor_residency
  • approval_filing_date — Document approval date
  • document_type — Type code (see above)

8. AIFMD Reporting (Not Publicly Available)

ESMA collects detailed fund data through AIFMD Article 24 reporting, but this data is not publicly accessible. It is submitted by AIFMs to NCAs and forwarded to ESMA for supervisory purposes only.

Fields collected (not public):

Category Fields
Fund Identity AIF name, national code, LEI, domicile, inception date
Fund Type Predominant AIF type, investment strategy, sub-strategy
Assets Gross Asset Value (GAV), Net Asset Value (NAV), base currency
Leverage Gross method leverage, commitment method leverage
Investor Types Breakdown by investor category (professional, retail, etc.)
Geographic Focus Geographic breakdown of investments
Asset Allocation Breakdown by asset type (equities, bonds, derivatives, etc.)
Liquidity Portfolio liquidity profile, investor redemption frequency
Counterparty Risk Top 5 counterparty exposures
Risk Measures VaR, stress test results

This is the richest structured dataset but is confidential and only available to regulators.


9. Comparison: ESMA vs SEC Fund Data

Aspect SEC (US) ESMA (EU)
Prospectus text Full prospectus filed as HTML/XML on EDGAR Not centralized; filed with national NCAs
Structured prospectus data XBRL Risk/Return Summary (fees, performance, objective) Not available — no EU-wide structured extraction
Fund identity register Series/Class CSV (CIK, Series ID, Class ID, ticker) Fund Register (LEI, name, domicile, manager)
Portfolio holdings N-PORT (quarterly, position-level) Not public — AIFMD reporting is confidential
Instrument reference data Limited (CUSIP in Series/Class CSV) FIRDS (ISIN, CFI, LEI, currency, trading venue)
Fee data (structured) XBRL: management fee, TER, loads, 12b-1 Not available in ESMA registers
Performance data (structured) XBRL: 1yr/5yr/10yr returns, bar charts Not available in ESMA registers
Risk data (structured) N-PORT: DV01, credit spread, VaR AIFMD reporting (confidential)
Cross-border/passporting N/A (single market) Full cross-border notification register
API quality Excellent (REST JSON, free, no auth) Good (Solr JSON/XML, free, no auth)
Bulk download ZIP files (submissions, XBRL, N-PORT) FIRDS XML bulk files; fund register via Solr pagination

Key Difference

The SEC provides structured data extracted from prospectuses (fees, performance, objectives via XBRL), making it directly useful for LLM training with ground-truth labels. ESMA provides authorization/registration data (who is authorized, where, by whom) but does not centralize or structure the content of fund prospectuses/KIIDs/KIDs.

For EU fund prospectus content (KIID/KID), you would need to go to:

  • Individual NCAs (AMF in France, BaFin in Germany, CSSF in Luxembourg, etc.)
  • Commercial data providers (Morningstar, Refinitiv, FE fundinfo)
  • Fund company websites directly

10. What openfunds Fields Can Be Found in ESMA Data?

Directly Available (from ESMA registers)

openfunds OF-ID Field Name ESMA Source
OFST001000 Fund Group Name Entity Register: ae_commercialName
OFST001020 ManCo Fund Register: funds_manager_nat_name
OFST010010 Fund Domicile Fund Register: funds_domicile_cou_code_name
OFST010030 LEI Of Fund Fund Register: funds_lei
OFST010060 Supervisory Authority Fund Register: funds_ca_cou_code_name
OFST010110 Legal Fund Name Fund Register: funds_national_name
OFST020000 ISIN FIRDS: ISIN field
OFST020540 Share Class Currency FIRDS: Notional Currency
OFST160100 Legal Form Fund Register: funds_legal_framework_name
OFST350015 CFI Code FIRDS: CFI Code
OFST351295 Is Money Market Fund MMF Register: presence in register
OFST351300 Money Market Type MMF Register: mmf04_type_name

NOT Available in ESMA Public Data

Category openfunds Fields Notes
Fees Management fee, TER, loads, subscription/redemption fees Not in any ESMA register
Performance Returns, Sharpe ratio, volatility Not in any ESMA register
Investment Objective Strategy text, objective text Not in any ESMA register
Risk Data SRRI, VaR, risk narrative AIFMD reporting (confidential)
Asset Class Detailed asset allocation AIFMD reporting (confidential)
Distribution Policy Distributing/accumulating Not in any ESMA register
Minimum Investment Min subscription amount Not in any ESMA register
Benchmark Benchmark index name Not in any ESMA register
Portfolio Holdings Position-level data AIFMD reporting (confidential)

11. Summary for Your LLM Use Case

ESMA data is useful for fund identity and cross-referencing (LEI, domicile, manager, legal framework, cross-border marketing status), but it does not provide the structured prospectus-derived data (fees, performance, objectives, risk) that the SEC's XBRL Risk/Return Summary provides.

Practical implications:

  • For US funds: SEC EDGAR provides both the prospectus text AND structured ground-truth data — ideal for supervised LLM training
  • For EU funds: ESMA provides identity/authorization data only. To get the prospectus text + structured reference data for EU funds, you would need to combine ESMA register data with prospectus documents sourced from national regulators or commercial providers

Recommended approach for EU data:

  1. Use ESMA Fund Register + FIRDS for fund identity (LEI, ISIN, domicile, ManCo, CFI)
  2. Source KIID/KID documents from national NCAs or fund company websites
  3. Use openfunds-format data from commercial providers as ground truth
  4. Or focus on the SEC dataset first (much richer, more accessible) and extend to EU later