fund_rfid_data/dataset_description.pdf at 00f51859e06492cf20a338d87252434561f240c9

Florian Herzog 00f51859e0 Drop non-extractable custodian relation; add per-triple grounded flag

Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN,
never in the prospectus prose, so they are not a valid text->triple target.
Per-fund the custodian object name occurs in only 28% of segments, the weakest
of all relations. Default is now --custodian-scope none.

Every triple now carries a 'grounded' boolean (object name present in the
sample's input text); 80% of triples are grounded across the full build. This
lets training/eval restrict to text-extractable targets.

- build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats
- gold rebuilt without custodian (15,739 -> 12,694 edges)
- dataset_description + README updated (custodian dropped, grounding documented)

Reported by thesis author: Citibank custodians in triples for 0001529390 never
appear in that prospectus text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-05 10:34:14 +02:00

339 KiB Raw History

339 KiB

Raw History