- build_trainset gains --radius (chars each side of the cited name) and --out; merge-gap scales with radius. Default 600 unchanged. - trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split, but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee. - DATASET.md documents both context sizes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.8 KiB
Fund Prospectus → RDF Triples — Training Dataset
A text→graph extraction dataset: the input is natural-language prose from a U.S. SEC fund prospectus, the output is a graph of entity→entity RDF triples (service-provider relationships of a fund family). Unlike attribute-style datasets, the targets are relationships between named entities, and the input text is much larger than the serialized output (mean text:triples ratio ≈ 6–9×).
Key guarantee: every target triple's entity name is verified to appear in the
sample's input_text (100 % coverage). Triples the grounding model accepted but
whose name is absent from the prose were dropped, so the dataset contains no
target that cannot be derived from its text.
Files (this is all you need — no rebuild required)
Two context sizes are provided, identical triples and splits, differing only in
how much surrounding prose is kept around each cited entity (the _3x variant has
~3× more distractor text per triple — useful to test robustness to longer input).
Standard context (~600 chars each side, ~47 tokens/triple):
| File | Samples | Size | Purpose |
|---|---|---|---|
data/rdf_poc/trainset.jsonl |
332 | ~6 MB | full set (all samples) |
data/rdf_poc/train.jsonl |
262 | ~5 MB | training split |
data/rdf_poc/val.jsonl |
35 | ~0.8 MB | validation split |
data/rdf_poc/test.jsonl |
35 | ~0.5 MB | test split |
3× context (~1800 chars each side, ~132 tokens/triple, median ~3.7k tokens/sample):
| File | Samples | Size |
|---|---|---|
data/rdf_poc/trainset_3x.jsonl |
332 | ~9 MB |
data/rdf_poc/train_3x.jsonl |
262 | ~7 MB |
data/rdf_poc/val_3x.jsonl |
35 | ~1 MB |
data/rdf_poc/test_3x.jsonl |
35 | ~0.8 MB |
Both variants keep the 100 % name-in-text guarantee. The _3x set is produced by
build_rdf_dataset.py trainset --radius 1800 --out ...; any other radius works too.
Splits are by trust (CIK): all funds of one trust stay in one split, so the model cannot memorise a trust's providers from another split (no leakage). The assignment is a deterministic hash of the CIK and is reproducible.
gold_graphs.jsonl (the raw N-CEN gold graph) is also committed for provenance.
The multi-MB intermediate files (samples_full.jsonl, raw prose/) are not
committed — they are only needed to rebuild from scratch (see README.md).
Record schema
Each line is one JSON object (one fund family / trust):
{
"sample_id": "0000899774:ALL", // <CIK>:<scope>
"cik": "0000899774", // SEC Central Index Key of the trust
"trust_name": "AB MUNICIPAL INCOME FUND II",
"input_text": "...INVESTMENT ADVISER: AllianceBernstein L.P. is the investment adviser ...",
// focused prospectus excerpts (the model input)
"ontology": { "Fund": { "advisedBy": ["InvestmentAdviser"], ... } },
// schema of the relations used in this sample
"target_triples": [ // the extraction TARGET (a graph)
{ "s": "fund:AB_Massachusetts_Portfolio",
"p": "advisedBy",
"o": "org:AllianceBernstein_L_P" },
...
],
"target_serialized":
"<triple_start> AB Massachusetts Portfolio <predicate_marker> advisedBy <object_marker> AllianceBernstein L.P. <triple_end> ...",
// marker-token serialization (thesis Models 2 & 4)
"target_serialized_plain":
"AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. . ...",
// plain Turtle-like serialization (thesis Models 1 & 3)
"stats": { "input_chars": 3766, "n_triples": 21, "text_to_json_ratio": 2.1 }
}
Relations (predicates)
seriesOf (Fund→Trust), advisedBy, subAdvisedBy, transferAgent,
custodian, administrator (all Fund→Organization), underwrittenBy
(Trust→Distributor). Entity IRIs are prefixed: fund:, trust:, org:.
Which serialization to train on?
- Plain (
target_serialized_plain) — natural Turtle-like text; use for a standard seq2seq / causal-LM extraction target. - Marker tokens (
target_serialized) — uses the special tokens<triple_start> <predicate_marker> <object_marker> <triple_end>from the thesis (Section 5.2); add these 4 tokens to the tokenizer if you train on it.
Pick one and keep it consistent across train/val/test.
Load it
Plain Python:
import json
def load(path):
with open(path) as f:
return [json.loads(line) for line in f]
train = load("data/rdf_poc/train.jsonl")
val = load("data/rdf_poc/val.jsonl")
test = load("data/rdf_poc/test.jsonl")
print(len(train), "training samples; example relations:",
[t["p"] for t in train[0]["target_triples"]][:5])
HuggingFace datasets:
from datasets import load_dataset
ds = load_dataset("json", data_files={
"train": "data/rdf_poc/train.jsonl",
"validation": "data/rdf_poc/val.jsonl",
"test": "data/rdf_poc/test.jsonl",
})
Use it for training (text → triples)
Map each record to an (input, target) pair, then fine-tune any seq2seq or causal LM:
def to_pair(rec, serialization="plain"):
prompt = (
"Extract the fund service-provider relationships as RDF triples.\n\n"
f"TEXT:\n{rec['input_text']}\n\nTRIPLES:\n"
)
target = rec["target_serialized_plain" if serialization == "plain"
else "target_serialized"]
return {"input": prompt, "target": target}
pairs = [to_pair(r) for r in train]
A minimal causal-LM fine-tune (HF Transformers/TRL sketch):
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
# if training on the marker serialization, register the special tokens first:
# tok.add_special_tokens({"additional_special_tokens":
# ["<triple_start>","<predicate_marker>","<object_marker>","<triple_end>"]})
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model.resize_token_embeddings(len(tok))
def encode(rec):
p = to_pair(rec)
text = p["input"] + p["target"] + tok.eos_token
enc = tok(text, truncation=True, max_length=8192)
enc["labels"] = enc["input_ids"].copy() # (mask the prompt for loss if desired)
return enc
# feed `encode`-d train/val to your Trainer / SFTTrainer.
Evaluation
Compare predicted triples against target_triples (or parse the serialized
string back to triples). Report precision / recall / F1 per relation. The
companion score_baseline.py does fuzzy entity matching (name variants) for
this; treat test.jsonl as held-out.
Provenance & caveats
- Gold = SEC N-CEN structured filings (service providers), 2025 Q3.
- Input text = the current prospectus (485BPOS) + supplements per trust, fetched from EDGAR.
- Grounding = a 122B open model role-checks each gold triple against the
prose (does the entity actually play that role here?), then a lexical pass
guarantees the name is in
input_text. - Because the prospectus is the current version while gold is 2025 Q3, ~2–3 %
of triples name a provider the text describes as a former one (entity+role
still present, so still valid extraction signal). See the note in
build_rdf_dataset.py(build_trainset).
Rebuild from scratch (optional)
You do not need this to use the dataset. To regenerate everything (SEC download + ~5 h LLM grounding), see README.md. In short:
python build_rdf_dataset.py gold --ncen data/ncen/2025q3
python build_rdf_dataset.py fetch
python build_rdf_dataset.py samples
python llm_extract.py --mode match --backend vllm --in ... --out data/rdf_poc/match_all.jsonl
python build_rdf_dataset.py label
python build_rdf_dataset.py trainset
python build_rdf_dataset.py split --src data/rdf_poc/trainset.jsonl