fund_rfid_data/DATASET.md

# Fund Prospectus → RDF Triples — Training Dataset

A text→graph extraction dataset: the **input** is natural-language prose from a
U.S. SEC fund prospectus, the **output** is a graph of entity→entity RDF triples
(service-provider relationships of a fund family). Unlike attribute-style datasets,
the targets are *relationships between named entities*, and the input text is much
larger than the serialized output (mean text:triples ratio ≈ 6–9×).

**Key guarantee:** every target triple's entity name is verified to appear in the
sample's `input_text` (100 % coverage). Triples the grounding model accepted but
whose name is absent from the prose were dropped, so the dataset contains no
target that cannot be derived from its text.

## Files (this is all you need — no rebuild required)

Two context sizes are provided, identical triples and splits, differing only in
how much surrounding prose is kept around each cited entity (the `_3x` variant has
~3× more distractor text per triple — useful to test robustness to longer input).

**Standard context (~600 chars each side, ~47 tokens/triple):**

| File | Samples | Size | Purpose |
|------|--------:|-----:|---------|
| `data/rdf_poc/trainset.jsonl` | 332 | ~6 MB | full set (all samples) |
| `data/rdf_poc/train.jsonl` | 262 | ~5 MB | training split |
| `data/rdf_poc/val.jsonl` | 35 | ~0.8 MB | validation split |
| `data/rdf_poc/test.jsonl` | 35 | ~0.5 MB | test split |

**3× context (~1800 chars each side, ~132 tokens/triple, median ~3.7k tokens/sample):**

| File | Samples | Size |
|------|--------:|-----:|
| `data/rdf_poc/trainset_3x.jsonl` | 332 | ~9 MB |
| `data/rdf_poc/train_3x.jsonl` | 262 | ~7 MB |
| `data/rdf_poc/val_3x.jsonl` | 35 | ~1 MB |
| `data/rdf_poc/test_3x.jsonl` | 35 | ~0.8 MB |

Both variants keep the 100 % name-in-text guarantee. The `_3x` set is produced by
`build_rdf_dataset.py trainset --radius 1800 --out ...`; any other radius works too.

Splits are **by trust (CIK)**: all funds of one trust stay in one split, so the
model cannot memorise a trust's providers from another split (no leakage). The
assignment is a deterministic hash of the CIK and is reproducible.

`gold_graphs.jsonl` (the raw N-CEN gold graph) is also committed for provenance.
The multi-MB intermediate files (`samples_full.jsonl`, raw `prose/`) are **not**
committed — they are only needed to rebuild from scratch (see README.md).

## Record schema

Each line is one JSON object (one fund family / trust):

```jsonc
{
  "sample_id": "0000899774:ALL",          // <CIK>:<scope>
  "cik": "0000899774",                     // SEC Central Index Key of the trust
  "trust_name": "AB MUNICIPAL INCOME FUND II",
  "input_text": "...INVESTMENT ADVISER: AllianceBernstein L.P. is the investment adviser ...",
                                           // focused prospectus excerpts (the model input)
  "ontology": { "Fund": { "advisedBy": ["InvestmentAdviser"], ... } },
                                           // schema of the relations used in this sample
  "target_triples": [                      // the extraction TARGET (a graph)
    { "s": "fund:AB_Massachusetts_Portfolio",
      "p": "advisedBy",
      "o": "org:AllianceBernstein_L_P" },
    ...
  ],
  "target_serialized":
    "<triple_start> AB Massachusetts Portfolio <predicate_marker> advisedBy <object_marker> AllianceBernstein L.P. <triple_end> ...",
                                           // marker-token serialization (thesis Models 2 & 4)
  "target_serialized_plain":
    "AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. . ...",
                                           // plain Turtle-like serialization (thesis Models 1 & 3)
  "stats": { "input_chars": 3766, "n_triples": 21, "text_to_json_ratio": 2.1 }
}
```

### Relations (predicates)

`seriesOf` (Fund→Trust), `advisedBy`, `subAdvisedBy`, `transferAgent`,
`custodian`, `administrator` (all Fund→Organization), `underwrittenBy`
(Trust→Distributor). Entity IRIs are prefixed: `fund:`, `trust:`, `org:`.

### Which serialization to train on?

- **Plain** (`target_serialized_plain`) — natural Turtle-like text; use for a
  standard seq2seq / causal-LM extraction target.
- **Marker tokens** (`target_serialized`) — uses the special tokens
  `<triple_start> <predicate_marker> <object_marker> <triple_end>` from the
  thesis (Section 5.2); add these 4 tokens to the tokenizer if you train on it.

Pick one and keep it consistent across train/val/test.

## Load it

Plain Python:

```python
import json

def load(path):
    with open(path) as f:
        return [json.loads(line) for line in f]

train = load("data/rdf_poc/train.jsonl")
val   = load("data/rdf_poc/val.jsonl")
test  = load("data/rdf_poc/test.jsonl")
print(len(train), "training samples; example relations:",
      [t["p"] for t in train[0]["target_triples"]][:5])
```

HuggingFace `datasets`:

```python
from datasets import load_dataset
ds = load_dataset("json", data_files={
    "train": "data/rdf_poc/train.jsonl",
    "validation": "data/rdf_poc/val.jsonl",
    "test": "data/rdf_poc/test.jsonl",
})
```

## Use it for training (text → triples)

Map each record to an (input, target) pair, then fine-tune any seq2seq or
causal LM:

```python
def to_pair(rec, serialization="plain"):
    prompt = (
        "Extract the fund service-provider relationships as RDF triples.\n\n"
        f"TEXT:\n{rec['input_text']}\n\nTRIPLES:\n"
    )
    target = rec["target_serialized_plain" if serialization == "plain"
                 else "target_serialized"]
    return {"input": prompt, "target": target}

pairs = [to_pair(r) for r in train]
```

A minimal causal-LM fine-tune (HF Transformers/TRL sketch):

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
# if training on the marker serialization, register the special tokens first:
# tok.add_special_tokens({"additional_special_tokens":
#   ["<triple_start>","<predicate_marker>","<object_marker>","<triple_end>"]})
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model.resize_token_embeddings(len(tok))

def encode(rec):
    p = to_pair(rec)
    text = p["input"] + p["target"] + tok.eos_token
    enc = tok(text, truncation=True, max_length=8192)
    enc["labels"] = enc["input_ids"].copy()   # (mask the prompt for loss if desired)
    return enc
# feed `encode`-d train/val to your Trainer / SFTTrainer.
```

### Evaluation

Compare predicted triples against `target_triples` (or parse the serialized
string back to triples). Report precision / recall / F1 per relation. The
companion `score_baseline.py` does fuzzy entity matching (name variants) for
this; treat `test.jsonl` as held-out.

## Provenance & caveats

- **Gold** = SEC N-CEN structured filings (service providers), 2025 Q3.
- **Input text** = the *current* prospectus (485BPOS) + supplements per trust,
  fetched from EDGAR.
- **Grounding** = a 122B open model role-checks each gold triple against the
  prose (does the entity actually play that role here?), then a lexical pass
  guarantees the name is in `input_text`.
- Because the prospectus is the *current* version while gold is 2025 Q3, ~2–3 %
  of triples name a provider the text describes as a *former* one (entity+role
  still present, so still valid extraction signal). See the note in
  `build_rdf_dataset.py` (`build_trainset`).

## Rebuild from scratch (optional)

You do **not** need this to use the dataset. To regenerate everything (SEC
download + ~5 h LLM grounding), see README.md. In short:

```
python build_rdf_dataset.py gold     --ncen data/ncen/2025q3
python build_rdf_dataset.py fetch
python build_rdf_dataset.py samples
python llm_extract.py --mode match --backend vllm --in ... --out data/rdf_poc/match_all.jsonl
python build_rdf_dataset.py label
python build_rdf_dataset.py trainset
python build_rdf_dataset.py split --src data/rdf_poc/trainset.jsonl
```