Florian Herzog 9dc870b8d0 Add 3x-context dataset variant (trainset --radius)

- build_trainset gains --radius (chars each side of the cited name) and --out;
  merge-gap scales with radius. Default 600 unchanged.
- trainset_3x + train/val/test_3x.jsonl: same 10,519 triples and same trust split,
  but ~3x more surrounding prose per triple (~47 -> ~132 tokens/triple, median
  ~3.7k tokens/sample). Keeps the 100% name-in-text guarantee.
- DATASET.md documents both context sizes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 16:37:30 +02:00

7.8 KiB

Raw Permalink Blame History

Fund Prospectus → RDF Triples — Training Dataset

A text→graph extraction dataset: the input is natural-language prose from a U.S. SEC fund prospectus, the output is a graph of entity→entity RDF triples (service-provider relationships of a fund family). Unlike attribute-style datasets, the targets are relationships between named entities, and the input text is much larger than the serialized output (mean text:triples ratio ≈ 6–9×).

Key guarantee: every target triple's entity name is verified to appear in the sample's input_text (100 % coverage). Triples the grounding model accepted but whose name is absent from the prose were dropped, so the dataset contains no target that cannot be derived from its text.

Files (this is all you need — no rebuild required)

Two context sizes are provided, identical triples and splits, differing only in how much surrounding prose is kept around each cited entity (the _3x variant has ~3× more distractor text per triple — useful to test robustness to longer input).

Standard context (~600 chars each side, ~47 tokens/triple):

File	Samples	Size	Purpose
`data/rdf_poc/trainset.jsonl`	332	~6 MB	full set (all samples)
`data/rdf_poc/train.jsonl`	262	~5 MB	training split
`data/rdf_poc/val.jsonl`	35	~0.8 MB	validation split
`data/rdf_poc/test.jsonl`	35	~0.5 MB	test split

3× context (~1800 chars each side, ~132 tokens/triple, median ~3.7k tokens/sample):

File	Samples	Size
`data/rdf_poc/trainset_3x.jsonl`	332	~9 MB
`data/rdf_poc/train_3x.jsonl`	262	~7 MB
`data/rdf_poc/val_3x.jsonl`	35	~1 MB
`data/rdf_poc/test_3x.jsonl`	35	~0.8 MB

Both variants keep the 100 % name-in-text guarantee. The _3x set is produced by build_rdf_dataset.py trainset --radius 1800 --out ...; any other radius works too.

Splits are by trust (CIK): all funds of one trust stay in one split, so the model cannot memorise a trust's providers from another split (no leakage). The assignment is a deterministic hash of the CIK and is reproducible.

gold_graphs.jsonl (the raw N-CEN gold graph) is also committed for provenance. The multi-MB intermediate files (samples_full.jsonl, raw prose/) are not committed — they are only needed to rebuild from scratch (see README.md).

Record schema

Each line is one JSON object (one fund family / trust):

{
  "sample_id": "0000899774:ALL",          // <CIK>:<scope>
  "cik": "0000899774",                     // SEC Central Index Key of the trust
  "trust_name": "AB MUNICIPAL INCOME FUND II",
  "input_text": "...INVESTMENT ADVISER: AllianceBernstein L.P. is the investment adviser ...",
                                           // focused prospectus excerpts (the model input)
  "ontology": { "Fund": { "advisedBy": ["InvestmentAdviser"], ... } },
                                           // schema of the relations used in this sample
  "target_triples": [                      // the extraction TARGET (a graph)
    { "s": "fund:AB_Massachusetts_Portfolio",
      "p": "advisedBy",
      "o": "org:AllianceBernstein_L_P" },
    ...
  ],
  "target_serialized":
    "<triple_start> AB Massachusetts Portfolio <predicate_marker> advisedBy <object_marker> AllianceBernstein L.P. <triple_end> ...",
                                           // marker-token serialization (thesis Models 2 & 4)
  "target_serialized_plain":
    "AB Massachusetts Portfolio advisedBy AllianceBernstein L.P. . ...",
                                           // plain Turtle-like serialization (thesis Models 1 & 3)
  "stats": { "input_chars": 3766, "n_triples": 21, "text_to_json_ratio": 2.1 }
}

Relations (predicates)

seriesOf (Fund→Trust), advisedBy, subAdvisedBy, transferAgent, custodian, administrator (all Fund→Organization), underwrittenBy (Trust→Distributor). Entity IRIs are prefixed: fund:, trust:, org:.

Which serialization to train on?

Plain (target_serialized_plain) — natural Turtle-like text; use for a standard seq2seq / causal-LM extraction target.
Marker tokens (target_serialized) — uses the special tokens <triple_start> <predicate_marker> <object_marker> <triple_end> from the thesis (Section 5.2); add these 4 tokens to the tokenizer if you train on it.

Pick one and keep it consistent across train/val/test.

Load it

Plain Python:

import json

def load(path):
    with open(path) as f:
        return [json.loads(line) for line in f]

train = load("data/rdf_poc/train.jsonl")
val   = load("data/rdf_poc/val.jsonl")
test  = load("data/rdf_poc/test.jsonl")
print(len(train), "training samples; example relations:",
      [t["p"] for t in train[0]["target_triples"]][:5])

HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset("json", data_files={
    "train": "data/rdf_poc/train.jsonl",
    "validation": "data/rdf_poc/val.jsonl",
    "test": "data/rdf_poc/test.jsonl",
})

Use it for training (text → triples)

Map each record to an (input, target) pair, then fine-tune any seq2seq or causal LM:

def to_pair(rec, serialization="plain"):
    prompt = (
        "Extract the fund service-provider relationships as RDF triples.\n\n"
        f"TEXT:\n{rec['input_text']}\n\nTRIPLES:\n"
    )
    target = rec["target_serialized_plain" if serialization == "plain"
                 else "target_serialized"]
    return {"input": prompt, "target": target}

pairs = [to_pair(r) for r in train]

A minimal causal-LM fine-tune (HF Transformers/TRL sketch):

from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
# if training on the marker serialization, register the special tokens first:
# tok.add_special_tokens({"additional_special_tokens":
#   ["<triple_start>","<predicate_marker>","<object_marker>","<triple_end>"]})
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-4b-it")
model.resize_token_embeddings(len(tok))

def encode(rec):
    p = to_pair(rec)
    text = p["input"] + p["target"] + tok.eos_token
    enc = tok(text, truncation=True, max_length=8192)
    enc["labels"] = enc["input_ids"].copy()   # (mask the prompt for loss if desired)
    return enc
# feed `encode`-d train/val to your Trainer / SFTTrainer.

Evaluation

Compare predicted triples against target_triples (or parse the serialized string back to triples). Report precision / recall / F1 per relation. The companion score_baseline.py does fuzzy entity matching (name variants) for this; treat test.jsonl as held-out.

Provenance & caveats

Gold = SEC N-CEN structured filings (service providers), 2025 Q3.
Input text = the current prospectus (485BPOS) + supplements per trust, fetched from EDGAR.
Grounding = a 122B open model role-checks each gold triple against the prose (does the entity actually play that role here?), then a lexical pass guarantees the name is in input_text.
Because the prospectus is the current version while gold is 2025 Q3, ~2–3 % of triples name a provider the text describes as a former one (entity+role still present, so still valid extraction signal). See the note in build_rdf_dataset.py (build_trainset).

Rebuild from scratch (optional)

You do not need this to use the dataset. To regenerate everything (SEC download + ~5 h LLM grounding), see README.md. In short:

python build_rdf_dataset.py gold     --ncen data/ncen/2025q3
python build_rdf_dataset.py fetch
python build_rdf_dataset.py samples
python llm_extract.py --mode match --backend vllm --in ... --out data/rdf_poc/match_all.jsonl
python build_rdf_dataset.py label
python build_rdf_dataset.py trainset
python build_rdf_dataset.py split --src data/rdf_poc/trainset.jsonl

7.8 KiB Raw Permalink Blame History Unescape Escape