fund_rfid_data/DATASET.md
Florian Herzog 9dc870b8d0 Add 3x-context dataset variant (trainset --radius)
- build_trainset gains --radius (chars each side of the cited name) and --out;
  merge-gap scales with radius. Default 600 unchanged.
- trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split,
  but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median
  ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee.
- DATASET.md documents both context sizes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 16:37:30 +02:00

195 lines
7.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fund Prospectus → RDF Triples — Training Dataset
A text→graph extraction dataset: the **input** is natural-language prose from a
U.S. SEC fund prospectus, the **output** is a graph of entity→entity RDF triples
(service-provider relationships of a fund family). Unlike attribute-style datasets,
the targets are *relationships between named entities*, and the input text is much
larger than the serialized output (mean text:triples ratio ≈ 69×).
**Key guarantee:** every target triple's entity name is verified to appear in the
sample's `input_text` (100 % coverage). Triples the grounding model accepted but
whose name is absent from the prose were dropped, so the dataset contains no
target that cannot be derived from its text.
## Files (this is all you need — no rebuild required)
Two context sizes are provided, identical triples and splits, differing only in
how much surrounding prose is kept around each cited entity (the `_3x` variant has
~3× more distractor text per triple — useful to test robustness to longer input).
**Standard context (~600 chars each side, ~47 tokens/triple):**
| File | Samples | Size | Purpose |
|------|--------:|-----:|---------|
| `data/rdf_poc/trainset.jsonl` | 332 | ~6 MB | full set (all samples) |
| `data/rdf_poc/train.jsonl` | 262 | ~5 MB | training split |
| `data/rdf_poc/val.jsonl` | 35 | ~0.8 MB | validation split |
| `data/rdf_poc/test.jsonl` | 35 | ~0.5 MB | test split |
**3× context (~1800 chars each side, ~132 tokens/triple, median ~3.7k tokens/sample):**
| File | Samples | Size |
|------|--------:|-----:|
| `data/rdf_poc/trainset_3x.jsonl` | 332 | ~9 MB |
| `data/rdf_poc/train_3x.jsonl` | 262 | ~7 MB |
| `data/rdf_poc/val_3x.jsonl` | 35 | ~1 MB |
| `data/rdf_poc/test_3x.jsonl` | 35 | ~0.8 MB |
Both variants keep the 100 % name-in-text guarantee. The `_3x` set is produced by
`build_rdf_dataset.py trainset --radius 1800 --out ...`; any other radius works too.
Splits are **by trust (CIK)**: all funds of one trust stay in one split, so the
model cannot memorise a trust's providers from another split (no leakage). The
assignment is a deterministic hash of the CIK and is reproducible.
`gold_graphs.jsonl` (the raw N-CEN gold graph) is also committed for provenance.
The multi-MB intermediate files (`samples_full.jsonl`, raw `prose/`) are **not**
committed — they are only needed to rebuild from scratch (see README.md).
## Record schema
Each line is one JSON object (one fund family / trust):
```jsonc
{
"sample_id": "0000899774:ALL", // <CIK>:<scope>
"cik": "0000899774", // SEC Central Index Key of the trust
"trust_name": "AB MUNICIPAL INCOME FUND II",
"input_text": "...INVESTMENT ADVISER: AllianceBernstein L.P. is the investment adviser ...",
// focused prospectus excerpts (the model input)
"ontology": { "Fund": { "advisedBy": ["InvestmentAdviser"], ... } },
// schema of the relations used in this sample
"target_triples": [ // the extraction TARGET (a graph)
{ "s": "fund:AB_Massachusetts_Portfolio",
"p": "advisedBy",
"o": "org:AllianceBernstein_L_P" },
...
],
"target_serialized":
"<triple_start> AB Massachusetts Portfolio <predicate_marker> advisedBy <object_marker> AllianceBernstein L.P. <triple_end> ...",
// marker-token serialization (thesis Models 2 & 4)
"target_serialized_plain":
"AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. . ...",
// plain Turtle-like serialization (thesis Models 1 & 3)
"stats": { "input_chars": 3766, "n_triples": 21, "text_to_json_ratio": 2.1 }
}
```
### Relations (predicates)
`seriesOf` (Fund→Trust), `advisedBy`, `subAdvisedBy`, `transferAgent`,
`custodian`, `administrator` (all Fund→Organization), `underwrittenBy`
(Trust→Distributor). Entity IRIs are prefixed: `fund:`, `trust:`, `org:`.
### Which serialization to train on?
- **Plain** (`target_serialized_plain`) — natural Turtle-like text; use for a
standard seq2seq / causal-LM extraction target.
- **Marker tokens** (`target_serialized`) — uses the special tokens
`<triple_start> <predicate_marker> <object_marker> <triple_end>` from the
thesis (Section 5.2); add these 4 tokens to the tokenizer if you train on it.
Pick one and keep it consistent across train/val/test.
## Load it
Plain Python:
```python
import json
def load(path):
with open(path) as f:
return [json.loads(line) for line in f]
train = load("data/rdf_poc/train.jsonl")
val = load("data/rdf_poc/val.jsonl")
test = load("data/rdf_poc/test.jsonl")
print(len(train), "training samples; example relations:",
[t["p"] for t in train[0]["target_triples"]][:5])
```
HuggingFace `datasets`:
```python
from datasets import load_dataset
ds = load_dataset("json", data_files={
"train": "data/rdf_poc/train.jsonl",
"validation": "data/rdf_poc/val.jsonl",
"test": "data/rdf_poc/test.jsonl",
})
```
## Use it for training (text → triples)
Map each record to an (input, target) pair, then fine-tune any seq2seq or
causal LM:
```python
def to_pair(rec, serialization="plain"):
prompt = (
"Extract the fund service-provider relationships as RDF triples.\n\n"
f"TEXT:\n{rec['input_text']}\n\nTRIPLES:\n"
)
target = rec["target_serialized_plain" if serialization == "plain"
else "target_serialized"]
return {"input": prompt, "target": target}
pairs = [to_pair(r) for r in train]
```
A minimal causal-LM fine-tune (HF Transformers/TRL sketch):
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
# if training on the marker serialization, register the special tokens first:
# tok.add_special_tokens({"additional_special_tokens":
# ["<triple_start>","<predicate_marker>","<object_marker>","<triple_end>"]})
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model.resize_token_embeddings(len(tok))
def encode(rec):
p = to_pair(rec)
text = p["input"] + p["target"] + tok.eos_token
enc = tok(text, truncation=True, max_length=8192)
enc["labels"] = enc["input_ids"].copy() # (mask the prompt for loss if desired)
return enc
# feed `encode`-d train/val to your Trainer / SFTTrainer.
```
### Evaluation
Compare predicted triples against `target_triples` (or parse the serialized
string back to triples). Report precision / recall / F1 per relation. The
companion `score_baseline.py` does fuzzy entity matching (name variants) for
this; treat `test.jsonl` as held-out.
## Provenance & caveats
- **Gold** = SEC N-CEN structured filings (service providers), 2025 Q3.
- **Input text** = the *current* prospectus (485BPOS) + supplements per trust,
fetched from EDGAR.
- **Grounding** = a 122B open model role-checks each gold triple against the
prose (does the entity actually play that role here?), then a lexical pass
guarantees the name is in `input_text`.
- Because the prospectus is the *current* version while gold is 2025 Q3, ~23 %
of triples name a provider the text describes as a *former* one (entity+role
still present, so still valid extraction signal). See the note in
`build_rdf_dataset.py` (`build_trainset`).
## Rebuild from scratch (optional)
You do **not** need this to use the dataset. To regenerate everything (SEC
download + ~5 h LLM grounding), see README.md. In short:
```
python build_rdf_dataset.py gold --ncen data/ncen/2025q3
python build_rdf_dataset.py fetch
python build_rdf_dataset.py samples
python llm_extract.py --mode match --backend vllm --in ... --out data/rdf_poc/match_all.jsonl
python build_rdf_dataset.py label
python build_rdf_dataset.py trainset
python build_rdf_dataset.py split --src data/rdf_poc/trainset.jsonl
```