- build_trainset gains --radius (chars each side of the cited name) and --out; merge-gap scales with radius. Default 600 unchanged. - trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split, but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee. - DATASET.md documents both context sizes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
195 lines
7.8 KiB
Markdown
195 lines
7.8 KiB
Markdown
# Fund Prospectus → RDF Triples — Training Dataset
|
||
|
||
A text→graph extraction dataset: the **input** is natural-language prose from a
|
||
U.S. SEC fund prospectus, the **output** is a graph of entity→entity RDF triples
|
||
(service-provider relationships of a fund family). Unlike attribute-style datasets,
|
||
the targets are *relationships between named entities*, and the input text is much
|
||
larger than the serialized output (mean text:triples ratio ≈ 6–9×).
|
||
|
||
**Key guarantee:** every target triple's entity name is verified to appear in the
|
||
sample's `input_text` (100 % coverage). Triples the grounding model accepted but
|
||
whose name is absent from the prose were dropped, so the dataset contains no
|
||
target that cannot be derived from its text.
|
||
|
||
## Files (this is all you need — no rebuild required)
|
||
|
||
Two context sizes are provided, identical triples and splits, differing only in
|
||
how much surrounding prose is kept around each cited entity (the `_3x` variant has
|
||
~3× more distractor text per triple — useful to test robustness to longer input).
|
||
|
||
**Standard context (~600 chars each side, ~47 tokens/triple):**
|
||
|
||
| File | Samples | Size | Purpose |
|
||
|------|--------:|-----:|---------|
|
||
| `data/rdf_poc/trainset.jsonl` | 332 | ~6 MB | full set (all samples) |
|
||
| `data/rdf_poc/train.jsonl` | 262 | ~5 MB | training split |
|
||
| `data/rdf_poc/val.jsonl` | 35 | ~0.8 MB | validation split |
|
||
| `data/rdf_poc/test.jsonl` | 35 | ~0.5 MB | test split |
|
||
|
||
**3× context (~1800 chars each side, ~132 tokens/triple, median ~3.7k tokens/sample):**
|
||
|
||
| File | Samples | Size |
|
||
|------|--------:|-----:|
|
||
| `data/rdf_poc/trainset_3x.jsonl` | 332 | ~9 MB |
|
||
| `data/rdf_poc/train_3x.jsonl` | 262 | ~7 MB |
|
||
| `data/rdf_poc/val_3x.jsonl` | 35 | ~1 MB |
|
||
| `data/rdf_poc/test_3x.jsonl` | 35 | ~0.8 MB |
|
||
|
||
Both variants keep the 100 % name-in-text guarantee. The `_3x` set is produced by
|
||
`build_rdf_dataset.py trainset --radius 1800 --out ...`; any other radius works too.
|
||
|
||
Splits are **by trust (CIK)**: all funds of one trust stay in one split, so the
|
||
model cannot memorise a trust's providers from another split (no leakage). The
|
||
assignment is a deterministic hash of the CIK and is reproducible.
|
||
|
||
`gold_graphs.jsonl` (the raw N-CEN gold graph) is also committed for provenance.
|
||
The multi-MB intermediate files (`samples_full.jsonl`, raw `prose/`) are **not**
|
||
committed — they are only needed to rebuild from scratch (see README.md).
|
||
|
||
## Record schema
|
||
|
||
Each line is one JSON object (one fund family / trust):
|
||
|
||
```jsonc
|
||
{
|
||
"sample_id": "0000899774:ALL", // <CIK>:<scope>
|
||
"cik": "0000899774", // SEC Central Index Key of the trust
|
||
"trust_name": "AB MUNICIPAL INCOME FUND II",
|
||
"input_text": "...INVESTMENT ADVISER: AllianceBernstein L.P. is the investment adviser ...",
|
||
// focused prospectus excerpts (the model input)
|
||
"ontology": { "Fund": { "advisedBy": ["InvestmentAdviser"], ... } },
|
||
// schema of the relations used in this sample
|
||
"target_triples": [ // the extraction TARGET (a graph)
|
||
{ "s": "fund:AB_Massachusetts_Portfolio",
|
||
"p": "advisedBy",
|
||
"o": "org:AllianceBernstein_L_P" },
|
||
...
|
||
],
|
||
"target_serialized":
|
||
"<triple_start> AB Massachusetts Portfolio <predicate_marker> advisedBy <object_marker> AllianceBernstein L.P. <triple_end> ...",
|
||
// marker-token serialization (thesis Models 2 & 4)
|
||
"target_serialized_plain":
|
||
"AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. . ...",
|
||
// plain Turtle-like serialization (thesis Models 1 & 3)
|
||
"stats": { "input_chars": 3766, "n_triples": 21, "text_to_json_ratio": 2.1 }
|
||
}
|
||
```
|
||
|
||
### Relations (predicates)
|
||
|
||
`seriesOf` (Fund→Trust), `advisedBy`, `subAdvisedBy`, `transferAgent`,
|
||
`custodian`, `administrator` (all Fund→Organization), `underwrittenBy`
|
||
(Trust→Distributor). Entity IRIs are prefixed: `fund:`, `trust:`, `org:`.
|
||
|
||
### Which serialization to train on?
|
||
|
||
- **Plain** (`target_serialized_plain`) — natural Turtle-like text; use for a
|
||
standard seq2seq / causal-LM extraction target.
|
||
- **Marker tokens** (`target_serialized`) — uses the special tokens
|
||
`<triple_start> <predicate_marker> <object_marker> <triple_end>` from the
|
||
thesis (Section 5.2); add these 4 tokens to the tokenizer if you train on it.
|
||
|
||
Pick one and keep it consistent across train/val/test.
|
||
|
||
## Load it
|
||
|
||
Plain Python:
|
||
|
||
```python
|
||
import json
|
||
|
||
def load(path):
|
||
with open(path) as f:
|
||
return [json.loads(line) for line in f]
|
||
|
||
train = load("data/rdf_poc/train.jsonl")
|
||
val = load("data/rdf_poc/val.jsonl")
|
||
test = load("data/rdf_poc/test.jsonl")
|
||
print(len(train), "training samples; example relations:",
|
||
[t["p"] for t in train[0]["target_triples"]][:5])
|
||
```
|
||
|
||
HuggingFace `datasets`:
|
||
|
||
```python
|
||
from datasets import load_dataset
|
||
ds = load_dataset("json", data_files={
|
||
"train": "data/rdf_poc/train.jsonl",
|
||
"validation": "data/rdf_poc/val.jsonl",
|
||
"test": "data/rdf_poc/test.jsonl",
|
||
})
|
||
```
|
||
|
||
## Use it for training (text → triples)
|
||
|
||
Map each record to an (input, target) pair, then fine-tune any seq2seq or
|
||
causal LM:
|
||
|
||
```python
|
||
def to_pair(rec, serialization="plain"):
|
||
prompt = (
|
||
"Extract the fund service-provider relationships as RDF triples.\n\n"
|
||
f"TEXT:\n{rec['input_text']}\n\nTRIPLES:\n"
|
||
)
|
||
target = rec["target_serialized_plain" if serialization == "plain"
|
||
else "target_serialized"]
|
||
return {"input": prompt, "target": target}
|
||
|
||
pairs = [to_pair(r) for r in train]
|
||
```
|
||
|
||
A minimal causal-LM fine-tune (HF Transformers/TRL sketch):
|
||
|
||
```python
|
||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
|
||
# if training on the marker serialization, register the special tokens first:
|
||
# tok.add_special_tokens({"additional_special_tokens":
|
||
# ["<triple_start>","<predicate_marker>","<object_marker>","<triple_end>"]})
|
||
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
|
||
model.resize_token_embeddings(len(tok))
|
||
|
||
def encode(rec):
|
||
p = to_pair(rec)
|
||
text = p["input"] + p["target"] + tok.eos_token
|
||
enc = tok(text, truncation=True, max_length=8192)
|
||
enc["labels"] = enc["input_ids"].copy() # (mask the prompt for loss if desired)
|
||
return enc
|
||
# feed `encode`-d train/val to your Trainer / SFTTrainer.
|
||
```
|
||
|
||
### Evaluation
|
||
|
||
Compare predicted triples against `target_triples` (or parse the serialized
|
||
string back to triples). Report precision / recall / F1 per relation. The
|
||
companion `score_baseline.py` does fuzzy entity matching (name variants) for
|
||
this; treat `test.jsonl` as held-out.
|
||
|
||
## Provenance & caveats
|
||
|
||
- **Gold** = SEC N-CEN structured filings (service providers), 2025 Q3.
|
||
- **Input text** = the *current* prospectus (485BPOS) + supplements per trust,
|
||
fetched from EDGAR.
|
||
- **Grounding** = a 122B open model role-checks each gold triple against the
|
||
prose (does the entity actually play that role here?), then a lexical pass
|
||
guarantees the name is in `input_text`.
|
||
- Because the prospectus is the *current* version while gold is 2025 Q3, ~2–3 %
|
||
of triples name a provider the text describes as a *former* one (entity+role
|
||
still present, so still valid extraction signal). See the note in
|
||
`build_rdf_dataset.py` (`build_trainset`).
|
||
|
||
## Rebuild from scratch (optional)
|
||
|
||
You do **not** need this to use the dataset. To regenerate everything (SEC
|
||
download + ~5 h LLM grounding), see README.md. In short:
|
||
|
||
```
|
||
python build_rdf_dataset.py gold --ncen data/ncen/2025q3
|
||
python build_rdf_dataset.py fetch
|
||
python build_rdf_dataset.py samples
|
||
python llm_extract.py --mode match --backend vllm --in ... --out data/rdf_poc/match_all.jsonl
|
||
python build_rdf_dataset.py label
|
||
python build_rdf_dataset.py trainset
|
||
python build_rdf_dataset.py split --src data/rdf_poc/trainset.jsonl
|
||
```
|