Go to file

Florian Herzog 9dc870b8d0 Add 3x-context dataset variant (trainset --radius)

- build_trainset gains --radius (chars each side of the cited name) and --out;
  merge-gap scales with radius. Default 600 unchanged.
- trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split,
  but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median
  ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee.
- DATASET.md documents both context sizes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 16:37:30 +02:00

data

Add 3x-context dataset variant (trainset --radius)

2026-06-10 16:37:30 +02:00

.gitignore

Commit training-ready dataset (~6 MB) + DATASET.md usage guide

2026-06-10 16:22:39 +02:00

build_rdf_dataset.py

Add 3x-context dataset variant (trainset --radius)

2026-06-10 16:37:30 +02:00

dataset_description.pdf

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

dataset_description.tex

Drop non-extractable custodian relation; add per-triple grounded flag

2026-06-05 10:34:14 +02:00

DATASET.md

Add 3x-context dataset variant (trainset --radius)

2026-06-10 16:37:30 +02:00

documentation.pdf

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

ESMA_FUND_DATA_RESEARCH.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

fetch_filings.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

fetch_universe.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

finalize_dataset.sh

Add LLM role-check grounding + labelled training-set pipeline

2026-06-10 13:52:50 +02:00

fund_db.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

LICENSE

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

llm_extract.py

Add LLM role-check grounding + labelled training-set pipeline

2026-06-10 13:52:50 +02:00

load_ncen.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

load_nport.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

load_xbrl_rr.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

OPENFUNDS_PUBLIC_DATA_SOURCES.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

pipeline.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

README.md

Add LLM grounding pipeline: current-source fetch, alias + LLM role-check matching

2026-06-09 13:45:32 +02:00

requirements.txt

Add LLM grounding pipeline: current-source fetch, alias + LLM role-check matching

2026-06-09 13:45:32 +02:00

resume_match.sh

Add LLM role-check grounding + labelled training-set pipeline

2026-06-10 13:52:50 +02:00

score_baseline.py

Add LLM grounding pipeline: current-source fetch, alias + LLM role-check matching

2026-06-09 13:45:32 +02:00

SEC_FUND_DATA_RESEARCH.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

sec_fund_fetcher.py

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

SEC_REFERENCE_DATA_vs_OPENFUNDS.md

Add SEC fund prospectus -> RDF triple dataset pipeline

2026-06-03 10:31:35 +02:00

watch_fetch.sh

Add LLM grounding pipeline: current-source fetch, alias + LLM role-check matching

2026-06-09 13:45:32 +02:00

watch_match.sh

Add LLM role-check grounding + labelled training-set pipeline

2026-06-10 13:52:50 +02:00

README.md

fund_rdf_data

A relationship-rich finance dataset for text-to-RDF-triple extraction, built from mandatory U.S. SEC fund disclosures. Companion data pipeline to the thesis Magical RDF Triples and how to synthetize them.

Each sample pairs a long natural-language prospectus (incl. SAI) (input) with a compact graph of entity-to-entity RDF triples (target) — a fund advised by a manager, distributed by an underwriter, seriesOf a trust, and so on. Unlike Wikidata-derived corpora where text ≈ triples, here the input is far larger than the output, and the target is a genuine knowledge graph rather than flat attributes. Ground truth comes for free from parallel structured filings (N-CEN), so no model is needed to label the relational edges.

Text↔triple agreement — the central requirement. For a text→triple dataset a gold triple is only useful if the fact is actually stated in the input. Three things were needed to achieve high agreement:

Granularity — one sample per trust from the full prospectus+SAI book, not the tiny per-fund summary segment (where most provider names are absent).

Currency — the N-CEN gold is current, so the source must be too: fetch the newest full book plus all later supplements (497), not an old or partial filing. (A "largest book" can be 2–5 years stale and disagree with the gold.)

Role-correct grounding — a name appearing in the text is not enough; it must appear in that role. A lexical/alias matcher over-keeps (e.g. a bank named as securities-lending agent is not the custodian; a parent company is not the named sub-adviser). An LLM verifier (see below) checks the role and is the accurate filter; the fast alias matcher is a cheap pre-screen.

With (1)+(2), the entity name is present for ~93% of gold triples; the LLM role-check then keeps only those actually asserted in that role.

See dataset_description.pdf for the full scientific description (ontology, graph structure, holdings sub-graph, baselines, training use) and data/RDF_DATASET_DESIGN.md for design notes.

Pipeline

The dataset is built by build_rdf_dataset.py in four stages:

# 1. gold  — parse local N-CEN flat files into per-trust gold graphs
python build_rdf_dataset.py gold --custodian-scope primary

# 2. fetch — newest full prospectus book + all later supplements (497) per trust,
#    aligned in time with the current N-CEN gold. (--no-sai to skip SAI docs.)
python build_rdf_dataset.py fetch --limit 435 --max-filings 8   # --ciks ... to target

# 3. samples — one sample per TRUST (whole book -> all the trust's triples),
#    filtered by the grounding mode (default alias; use llm for role-correct).
python build_rdf_dataset.py samples --whole-trust --grounding alias

# (optional) LLM role-check grounding: run the match, then rebuild with --grounding llm
python llm_extract.py --mode match --backend vllm \
    --in data/rdf_poc/match_input.jsonl --out data/rdf_poc/match_all.jsonl --workers 3
python build_rdf_dataset.py samples --whole-trust --grounding llm

# 4. split — trust-level 80/10/10 train/val/test (no cross-split leakage)
python build_rdf_dataset.py split

# or run all four:
python build_rdf_dataset.py all --limit 435

The four stages above produce the complete dataset (samples.jsonl and the train/val/test splits). Nothing below is required to build or use it.

Scoring (optional)

score_baseline.py scores any model's predictions against the N-CEN gold (triple-level P/R/F1, per relation), and also provides a no-model lexical lower bound:

python score_baseline.py stringmatch          # no-model lexical lower bound
python score_baseline.py model --pred preds.jsonl

Optional: LLM extractability check

This is a diagnostic / quality-assurance tool only — it is not part of the dataset build and is not needed to train on or use the dataset.

A target triple is only useful if the fact is actually present in the input text. Whether that holds is a semantic question that lexical substring/keyword matching cannot answer reliably (a brand name like "John Hancock" matches the fund heading, not the administrator role; "administration services are provided by X" uses no fixed keyword). llm_extract.py therefore lets a strong open-source instruct model (local, via Ollama; default qwen3.6:35b) read each segment + ontology and extract the triples it can find. What it extracts is, by construction, present in the text — so its output doubles as (a) a strong extraction baseline and (b) a semantic extractability check on the gold.

# requires a running Ollama server with the model pulled
python llm_extract.py --in data/rdf_poc/test.jsonl --out data/rdf_poc/preds_qwen.jsonl
python score_baseline.py model --pred data/rdf_poc/preds_qwen.jsonl

Long inputs → sliding window. A single full-book input (0.5–1 MB) is slow in one huge-context call and suffers from "lost in the middle". llm_extract.py therefore slides an overlapping window over any text longer than --window (default 40 KB, --overlap 8 KB), extracts per window, and unions the de-duplicated triples — so service-provider facts that live only in the SAI section near the end are reliably seen by some window. This windowing is an inference strategy for the check, not a transformation of the dataset (the stored input_text always remains the full text).

Granularity is what determines agreement. Scored against the small per-fund summary segment, only advisedBy/subAdvisedBy are reliably present (~0.5–0.6 recall) and custodian collapses to ~0.05 — the motivating Cambria case, where the foreign sub-custodians filed in N-CEN never appear in the summary. Scored against the full single-book trust prospectus+SAI, the gold names are present ~90% overall (custodian 0.81, administrator/adviser 0.94, seriesOf 0.98): the information was never missing, the per-fund segment was simply the wrong unit. The genuinely non-text facts (foreign sub-custodians) stay correctly absent, so they should not be training targets. This is why the recommended build is one sample per trust from one full book.

Sample format

Each line of samples.jsonl / train|val|test.jsonl is a JSON record:

field	meaning
`input_text`	prospectus prose for the fund (model input)
`ontology`	inferred meta-schema (subject type → predicate → object type)
`target_triples`	structured `{s,p,o,grounded}` list (`grounded` = object name appears in `input_text`)
`target_serialized`	marker form (`<triple_start>` …) for Models 2/4
`target_serialized_plain`	Turtle-like form, no special tokens, for Models 1/3
`cik`, `series_id`, `fund`, `trust_name`	identifiers
`stats`	input/target sizes, triple count, `n_grounded`, text:json ratio

Relations

Entity-to-entity edges (gold from N-CEN / Series-Class): seriesOf, advisedBy, subAdvisedBy, transferAgent, administrator, underwrittenBy.

custodian is dropped by default (--custodian-scope none): custodian names — especially foreign sub-custodians — appear only in the structured N-CEN table and in no prose document (the summary prospectus says only "the custodian"), so they are not extractable from text. The primary custodian is named only in the separately-filed SAI (N-1A Part B), which is not part of the input. Use --custodian-scope primary or all to re-include it if you add the SAI as input.

Prose-grounding: every triple carries a grounded flag (object name present in the sample's input). Across the full build ~80 % of triples are grounded (per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 80–84 %, transferAgent 72 %, underwrittenBy 62 %). Filter on grounded to train/evaluate only on text-extractable targets.

Holdings edges (holds/issuedBy/domiciledIn, gold from N-PORT) are a planned second track from annual-report (N-CSR) commentary — see the description PDF.

Data sources

All inputs are public SEC filings (EDGAR, DERA data sets). The large raw data (prospectus prose, bulk N-CEN/N-PORT/XBRL downloads) and the generated sample files are git-ignored because they are reproducible from the commands above; only the lightweight structured gold graph (data/rdf_poc/gold_graphs.jsonl) is committed.

Requirements

pip install -r requirements.txt

README.md Unescape Escape

fund_rdf_data

Pipeline

Scoring (optional)

Optional: LLM extractability check

Sample format

Relations

Data sources

Requirements

README.md