Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN, never in the prospectus prose, so they are not a valid text->triple target. Per-fund the custodian object name occurs in only 28% of segments, the weakest of all relations. Default is now --custodian-scope none. Every triple now carries a 'grounded' boolean (object name present in the sample's input text); 80% of triples are grounded across the full build. This lets training/eval restrict to text-extractable targets. - build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats - gold rebuilt without custodian (15,739 -> 12,694 edges) - dataset_description + README updated (custodian dropped, grounding documented) Reported by thesis author: Citibank custodians in triples for 0001529390 never appear in that prospectus text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
339 KiB
339 KiB