# Fund Prospectus → RDF Triples — Training Dataset A text→graph extraction dataset: the **input** is natural-language prose from a U.S. SEC fund prospectus, the **output** is a graph of entity→entity RDF triples (service-provider relationships of a fund family). Unlike attribute-style datasets, the targets are *relationships between named entities*, and the input text is much larger than the serialized output (mean text:triples ratio ≈ 6–9×). **Key guarantee:** every target triple's entity name is verified to appear in the sample's `input_text` (100 % coverage). Triples the grounding model accepted but whose name is absent from the prose were dropped, so the dataset contains no target that cannot be derived from its text. ## Files (this is all you need — no rebuild required) Two context sizes are provided, identical triples and splits, differing only in how much surrounding prose is kept around each cited entity (the `_3x` variant has ~3× more distractor text per triple — useful to test robustness to longer input). **Standard context (~600 chars each side, ~47 tokens/triple):** | File | Samples | Size | Purpose | |------|--------:|-----:|---------| | `data/rdf_poc/trainset.jsonl` | 332 | ~6 MB | full set (all samples) | | `data/rdf_poc/train.jsonl` | 262 | ~5 MB | training split | | `data/rdf_poc/val.jsonl` | 35 | ~0.8 MB | validation split | | `data/rdf_poc/test.jsonl` | 35 | ~0.5 MB | test split | **3× context (~1800 chars each side, ~132 tokens/triple, median ~3.7k tokens/sample):** | File | Samples | Size | |------|--------:|-----:| | `data/rdf_poc/trainset_3x.jsonl` | 332 | ~9 MB | | `data/rdf_poc/train_3x.jsonl` | 262 | ~7 MB | | `data/rdf_poc/val_3x.jsonl` | 35 | ~1 MB | | `data/rdf_poc/test_3x.jsonl` | 35 | ~0.8 MB | Both variants keep the 100 % name-in-text guarantee. The `_3x` set is produced by `build_rdf_dataset.py trainset --radius 1800 --out ...`; any other radius works too. Splits are **by trust (CIK)**: all funds of one trust stay in one split, so the model cannot memorise a trust's providers from another split (no leakage). The assignment is a deterministic hash of the CIK and is reproducible. `gold_graphs.jsonl` (the raw N-CEN gold graph) is also committed for provenance. The multi-MB intermediate files (`samples_full.jsonl`, raw `prose/`) are **not** committed — they are only needed to rebuild from scratch (see README.md). ## Record schema Each line is one JSON object (one fund family / trust): ```jsonc { "sample_id": "0000899774:ALL", // : "cik": "0000899774", // SEC Central Index Key of the trust "trust_name": "AB MUNICIPAL INCOME FUND II", "input_text": "...INVESTMENT ADVISER: AllianceBernstein L.P. is the investment adviser ...", // focused prospectus excerpts (the model input) "ontology": { "Fund": { "advisedBy": ["InvestmentAdviser"], ... } }, // schema of the relations used in this sample "target_triples": [ // the extraction TARGET (a graph) { "s": "fund:AB_Massachusetts_Portfolio", "p": "advisedBy", "o": "org:AllianceBernstein_L_P" }, ... ], "target_serialized": " AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. ...", // marker-token serialization (thesis Models 2 & 4) "target_serialized_plain": "AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. . ...", // plain Turtle-like serialization (thesis Models 1 & 3) "stats": { "input_chars": 3766, "n_triples": 21, "text_to_json_ratio": 2.1 } } ``` ### Relations (predicates) `seriesOf` (Fund→Trust), `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian`, `administrator` (all Fund→Organization), `underwrittenBy` (Trust→Distributor). Entity IRIs are prefixed: `fund:`, `trust:`, `org:`. ### Which serialization to train on? - **Plain** (`target_serialized_plain`) — natural Turtle-like text; use for a standard seq2seq / causal-LM extraction target. - **Marker tokens** (`target_serialized`) — uses the special tokens ` ` from the thesis (Section 5.2); add these 4 tokens to the tokenizer if you train on it. Pick one and keep it consistent across train/val/test. ## Load it Plain Python: ```python import json def load(path): with open(path) as f: return [json.loads(line) for line in f] train = load("data/rdf_poc/train.jsonl") val = load("data/rdf_poc/val.jsonl") test = load("data/rdf_poc/test.jsonl") print(len(train), "training samples; example relations:", [t["p"] for t in train[0]["target_triples"]][:5]) ``` HuggingFace `datasets`: ```python from datasets import load_dataset ds = load_dataset("json", data_files={ "train": "data/rdf_poc/train.jsonl", "validation": "data/rdf_poc/val.jsonl", "test": "data/rdf_poc/test.jsonl", }) ``` ## Use it for training (text → triples) Map each record to an (input, target) pair, then fine-tune any seq2seq or causal LM: ```python def to_pair(rec, serialization="plain"): prompt = ( "Extract the fund service-provider relationships as RDF triples.\n\n" f"TEXT:\n{rec['input_text']}\n\nTRIPLES:\n" ) target = rec["target_serialized_plain" if serialization == "plain" else "target_serialized"] return {"input": prompt, "target": target} pairs = [to_pair(r) for r in train] ``` A minimal causal-LM fine-tune (HF Transformers/TRL sketch): ```python from transformers import AutoTokenizer, AutoModelForCausalLM tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it") # if training on the marker serialization, register the special tokens first: # tok.add_special_tokens({"additional_special_tokens": # ["","","",""]}) model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it") model.resize_token_embeddings(len(tok)) def encode(rec): p = to_pair(rec) text = p["input"] + p["target"] + tok.eos_token enc = tok(text, truncation=True, max_length=8192) enc["labels"] = enc["input_ids"].copy() # (mask the prompt for loss if desired) return enc # feed `encode`-d train/val to your Trainer / SFTTrainer. ``` ### Evaluation Compare predicted triples against `target_triples` (or parse the serialized string back to triples). Report precision / recall / F1 per relation. The companion `score_baseline.py` does fuzzy entity matching (name variants) for this; treat `test.jsonl` as held-out. ## Provenance & caveats - **Gold** = SEC N-CEN structured filings (service providers), 2025 Q3. - **Input text** = the *current* prospectus (485BPOS) + supplements per trust, fetched from EDGAR. - **Grounding** = a 122B open model role-checks each gold triple against the prose (does the entity actually play that role here?), then a lexical pass guarantees the name is in `input_text`. - Because the prospectus is the *current* version while gold is 2025 Q3, ~2–3 % of triples name a provider the text describes as a *former* one (entity+role still present, so still valid extraction signal). See the note in `build_rdf_dataset.py` (`build_trainset`). ## Rebuild from scratch (optional) You do **not** need this to use the dataset. To regenerate everything (SEC download + ~5 h LLM grounding), see README.md. In short: ``` python build_rdf_dataset.py gold --ncen data/ncen/2025q3 python build_rdf_dataset.py fetch python build_rdf_dataset.py samples python llm_extract.py --mode match --backend vllm --in ... --out data/rdf_poc/match_all.jsonl python build_rdf_dataset.py label python build_rdf_dataset.py trainset python build_rdf_dataset.py split --src data/rdf_poc/trainset.jsonl ```