Drop non-extractable custodian relation; add per-triple grounded flag

Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN,
never in the prospectus prose, so they are not a valid text->triple target.
Per-fund the custodian object name occurs in only 28% of segments, the weakest
of all relations. Default is now --custodian-scope none.

Every triple now carries a 'grounded' boolean (object name present in the
sample's input text); 80% of triples are grounded across the full build. This
lets training/eval restrict to text-extractable targets.

- build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats
- gold rebuilt without custodian (15,739 -> 12,694 edges)
- dataset_description + README updated (custodian dropped, grounding documented)

Reported by thesis author: Citibank custodians in triples for 0001529390 never
appear in that prospectus text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Florian Herzog 2026-06-05 10:34:14 +02:00
parent 63e650fa14
commit 00f51859e0
5 changed files with 566 additions and 512 deletions

View File

@ -54,19 +54,33 @@ Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:
|---|---|
| `input_text` | prospectus prose for the fund (model input) |
| `ontology` | inferred meta-schema (subject type → predicate → object type) |
| `target_triples` | structured `{s,p,o}` list |
| `target_triples` | structured `{s,p,o,grounded}` list (`grounded` = object name appears in `input_text`) |
| `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
| `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
| `cik`, `series_id`, `fund`, `trust_name` | identifiers |
| `stats` | input/target sizes, triple count, text:json ratio |
| `stats` | input/target sizes, triple count, `n_grounded`, text:json ratio |
## Relations
Entity-to-entity edges (gold from N-CEN / Series-Class):
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary
only by default), `administrator`, `underwrittenBy`. Holdings edges
(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
from annual-report (N-CSR) commentary — see the description PDF.
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `administrator`,
`underwrittenBy`.
**`custodian` is dropped by default** (`--custodian-scope none`): custodian names
— especially foreign sub-custodians — appear only in the structured N-CEN table
and in *no* prose document (the summary prospectus says only "the custodian"),
so they are not extractable from text. The primary custodian is named only in the
separately-filed SAI (N-1A Part B), which is not part of the input. Use
`--custodian-scope primary` or `all` to re-include it if you add the SAI as input.
**Prose-grounding:** every triple carries a `grounded` flag (object name present
in the sample's input). Across the full build ~80 % of triples are grounded
(per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 8084 %,
transferAgent 72 %, underwrittenBy 62 %). Filter on `grounded` to train/evaluate
only on text-extractable targets.
Holdings edges (`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned
second track from annual-report (N-CSR) commentary — see the description PDF.
## Data sources

View File

@ -394,6 +394,39 @@ def serialize_triples_plain(triples, entities) -> str:
return "\n".join(chunks)
_GND_SUFFIX = re.compile(
r"\b(llc|l\.l\.c|inc|incorporated|corp|corporation|company|co|ltd|limited"
r"|lp|l\.p|llp|n\.a|na|the|trust)\b", re.I)
def _gnorm(s: str) -> str:
"""Normalize a name for prose-grounding checks (lowercase, strip legal suffixes)."""
s = _GND_SUFFIX.sub(" ", (s or "").lower())
s = re.sub(r"[^a-z0-9]+", " ", s)
return re.sub(r"\s+", " ", s).strip()
def annotate_grounding(triples, entities, text):
"""Mark each triple as prose-grounded if its object name appears in `text`.
A triple is only useful as a TEXT->triple training target if the fact can be
found in the input. This adds a boolean `grounded` to each triple (the
normalized object label occurs as a substring of the normalized input), so
training/eval can restrict to grounded triples. Returns (annotated_triples,
n_grounded). Mutates copies, not the input dicts.
"""
ntext = _gnorm(text)
out = []
n_grounded = 0
for t in triples:
olbl = entities.get(t["o"], {}).get("label", "")
no = _gnorm(olbl)
g = bool(no) and no in ntext
n_grounded += int(g)
out.append({**t, "grounded": g})
return out, n_grounded
def ontology_schema(triples, entities) -> dict:
"""Inferred meta-schema (subject type -> predicate -> object type), per thesis 5.3."""
schema = defaultdict(lambda: defaultdict(set))
@ -580,6 +613,7 @@ def _build_samples_per_fund():
if not segs: # whole-trust fallback (no section located)
n_fallback_trusts += 1
triples = g["triples"]
triples, n_gnd = annotate_grounding(triples, ents, text)
target = serialize_triples(triples, ents)
target_plain = serialize_triples_plain(triples, ents)
rec = {
@ -590,7 +624,8 @@ def _build_samples_per_fund():
"target_serialized": target,
"target_serialized_plain": target_plain,
"stats": {"input_chars": len(text), "target_chars": len(target),
"n_triples": len(triples), "n_entities": len(ents),
"n_triples": len(triples), "n_grounded": n_gnd,
"n_entities": len(ents),
"text_to_json_ratio": round(len(text) / max(1, len(target)), 1)},
}
out.write(json.dumps(rec, ensure_ascii=False) + "\n")
@ -611,6 +646,7 @@ def _build_samples_per_fund():
for t in triples:
ref.add(t["s"]); ref.add(t["o"])
sub_ents = {k: ents[k] for k in ref if k in ents}
triples, n_gnd = annotate_grounding(triples, sub_ents, seg)
target = serialize_triples(triples, sub_ents)
target_plain = serialize_triples_plain(triples, sub_ents)
rec = {
@ -623,7 +659,8 @@ def _build_samples_per_fund():
"target_serialized": target,
"target_serialized_plain": target_plain,
"stats": {"input_chars": len(seg), "target_chars": len(target),
"n_triples": len(triples), "n_entities": len(sub_ents),
"n_triples": len(triples), "n_grounded": n_gnd,
"n_entities": len(sub_ents),
"text_to_json_ratio": round(len(seg) / max(1, len(target)), 1)},
}
out.write(json.dumps(rec, ensure_ascii=False) + "\n")

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -189,11 +189,14 @@ Fund & \code{seriesOf} & Trust \\
Fund & \code{advisedBy} & InvestmentAdviser \\
Fund & \code{subAdvisedBy} & SubAdviser \\
Fund & \code{transferAgent} & TransferAgent \\
Fund & \code{custodian} & Custodian \\
Fund & \code{administrator} & Administrator \\
Trust & \code{underwrittenBy} & Distributor \\
\addlinespace
Fund & \code{holds} & Security \quad(holdings sub-graph) \\
\multicolumn{3}{@{}l}{\emph{dropped (not prose-grounded, see \S\ref{sec:baselines}):}}\\
Fund & \code{custodian} & Custodian \\
\addlinespace
\multicolumn{3}{@{}l}{\emph{holdings sub-graph (planned 2nd track):}}\\
Fund & \code{holds} & Security \\
Security & \code{issuedBy} & Issuer \\
Security & \code{domiciledIn} & Country \\
Fund & \code{tracksIndex} & Index \\
@ -220,7 +223,6 @@ realised in that sample, e.g.
"advisedBy": ["InvestmentAdviser"],
"subAdvisedBy": ["SubAdviser"],
"transferAgent": ["TransferAgent"],
"custodian": ["Custodian"],
"administrator": ["Administrator"]
},
"Trust": { "underwrittenBy": ["Distributor"] }
@ -242,7 +244,7 @@ realised in that sample, e.g.
\node[ent,above=of trust] (dist) {Distributor};
\node[ent,left=of fund] (adv) {Investment\\Adviser};
\node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
\node[ent,below=of fund] (cust) {Custodian};
\node[ent,dashed,fill=black!5,below=of fund] (cust) {Custodian};
\node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
\node[ent,right=of fund] (admin) {Administrator};
% --- holdings cluster (far right, separated column) ---
@ -254,7 +256,7 @@ realised in that sample, e.g.
\draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
\draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
\draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
\draw[edge] (fund) -- node[lbl]{custodian} (cust);
\draw[edge,dashed,gray] (fund) -- node[lbl]{custodian} (cust);
\draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
\draw[edge] (fund) -- node[lbl]{administrator} (admin);
% holds: arc from the Fund's top, over the Administrator, down to Security
@ -264,9 +266,10 @@ realised in that sample, e.g.
\draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
\end{tikzpicture}
\caption{Schematic of the target knowledge graph. Left and centre: the
service-provider/structure graph grounded in the prospectus prose. Right column
(Issuer--Security--Country): the holdings sub-graph grounded in annual-report
commentary with N-PORT gold.}
service-provider/structure graph grounded in the prospectus prose. The dashed
\code{custodian} edge is \emph{dropped} from the dataset (not prose-grounded,
\S\ref{sec:baselines}). Right column (Issuer--Security--Country): the holdings
sub-graph grounded in annual-report commentary with N-PORT gold.}
\label{fig:graph}
\end{figure}
@ -322,18 +325,22 @@ thesis (Section~5.2), in which four special tokens delimit triple components and
shared subjects/predicates are factored out, mirroring Turtle's predicate-object
lists:
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (primary-custodian scope).},captionpos=b]
<triple_start> Small Cap Special Values Fund
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (custodian dropped).},captionpos=b]
<triple_start> John Hancock Bond Fund
<predicate_marker> seriesOf
<object_marker> VALIC Co I
<object_marker> JOHN HANCOCK SOVEREIGN BOND FUND
<predicate_marker> advisedBy
<object_marker> The Variable Annuity Life Insurance Company
<object_marker> John Hancock Investment Management LLC
<predicate_marker> subAdvisedBy
<object_marker> SunAmerica Asset Management, LLC
<object_marker> Manulife Investment Management (US) LLC
<predicate_marker> administrator
<object_marker> SunAmerica Asset Management, LLC
<predicate_marker> custodian
<object_marker> State Street Bank and Trust Company
<object_marker> John Hancock Investment Management LLC
<predicate_marker> transferAgent
<object_marker> John Hancock Signature Services, Inc.
<triple_end>
<triple_start> JOHN HANCOCK SOVEREIGN BOND FUND
<predicate_marker> underwrittenBy
<object_marker> JOHN HANCOCK INVESTMENT MANAGEMENT DISTRIBUTORS LLC
<triple_end>
\end{lstlisting}
@ -348,8 +355,9 @@ question~1.
Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
inferred ontology (\code{ontology}), the target triples as a structured list
(\code{target\_triples}) and in both serializations, the trust/series identifiers,
and size statistics.
(\code{target\_triples}, each triple carrying a \code{grounded} flag) and in both
serializations, the trust/series identifiers, and size statistics (including the
grounded-triple count).
% ====================================================================
\section{Per-fund segmentation}
@ -417,70 +425,64 @@ fetching more filings of the same CIK, and is left to the full-scale build.
The dataset offers two independent ground-truth regimes.
\paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
\code{transferAgent}, \code{custodian}, \code{administrator} and
\code{underwrittenBy}, the labels come directly from N-CEN; for \code{seriesOf}
and \code{hasShareClass}, from the Series/Class reference data; for \code{holds},
\code{issuedBy} and \code{domiciledIn}, from N-PORT. No model is involved in
producing these labels, which makes them an unusually trustworthy reference for a
generative-extraction benchmark.
\code{transferAgent}, \code{administrator} and \code{underwrittenBy}, the labels
come directly from N-CEN; for \code{seriesOf}, from the Series/Class reference
data; for the (planned) \code{holds}, \code{issuedBy} and \code{domiciledIn}, from
N-PORT. No model is involved in producing these labels, which makes them an
unusually trustworthy reference for a generative-extraction benchmark.
\paragraph{The custodian relation and edge scoping.} The custodian relation
illustrates a subtlety that any honest benchmark on this data must address. N-CEN
reports, for a global fund, not only its \emph{primary} custodian but the entire
chain of \emph{foreign sub-custodians} --- one bank per market it invests in.
These sub-custodians have two damaging properties. First, they are
\emph{unextractable}: they appear only in the N-CEN table and essentially never in
the prospectus prose (a naive string-match recovers $7\%$ of them), so keeping
them as targets asks the model to extract facts absent from its input. Second,
they \emph{dominate}: with \code{IS\_SUB\_CUSTODIAN${=}$Y} accounting for $88\%$
of custodian rows, they constitute roughly two thirds of \emph{all} edges in the
unfiltered graph, inflating both the target size and the training loss with noise.
The dataset therefore scopes the custodian relation to the \emph{primary}
custodian (\code{IS\_SUB\_CUSTODIAN${\neq}$Y}, a median of one per fund), which is
genuinely prose-grounded --- it is named in the prospectus or its Statement of
Additional Information (e.g.\ ``State Street Bank and Trust Company serves as
custodian''). This single change reduces the corpus from $36{,}880$ to $15{,}739$
edges, all of prose-grounded relation types, and is the configurable default
(\code{--custodian-scope primary}). The full sub-custodian chain remains available
in N-CEN as a structured-only relation outside the text-to-triples task. This is a
dataset-quality decision of the same kind the thesis notes for T-REx and REBEL,
whose non-exhaustive references unfairly penalise correct extractions.
\paragraph{Why the custodian relation is dropped.} A text-to-triple target is only
useful if the fact can be found in the input text. The custodian relation fails
this test and is therefore \emph{excluded} from the dataset
(\code{--custodian-scope none}). N-CEN reports, for a global fund, the entire chain
of \emph{foreign sub-custodians} --- one bank per market it invests in (Citibank
Brazil, Banco de Chile, Cititrust Colombia, \dots). These names appear only in the
structured N-CEN table and in \emph{no} prose document: the summary prospectus
refers to the custodian only generically (``\dots\ including the adviser, the
custodian, and the transfer agent\dots'') and never names it. Even the
\emph{primary} custodian is typically named only in the separately filed Statement
of Additional Information (N-1A Part~B), which is not part of the fetched input.
Measured per fund, the custodian object name occurs in the fund's own prospectus
segment only $28\%$ of the time --- by far the weakest of all relations --- so
keeping it would systematically ask the model to extract facts absent from its
input. The full custodian chain remains available in N-CEN as a structured-only
relation, outside the text-to-triples task; recovering it from text would require
adding the SAI as an input source (a separate crawl).
\paragraph{No-model lower bound.} A trivial string-matching baseline --- emit a
gold edge iff the object's name occurs in the prose --- establishes a \emph{floor}
and measures \emph{how prose-grounded each relation is}. Table~\ref{tab:baseline}
reports this on the proof-of-concept slice after primary-custodian scoping,
multi-book fetching and per-fund segmentation. Because the baseline requires an
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
strict lower bound: a fund's adviser, for instance, must be named in that fund's
segment under a literal spelling. On the full quarter the adviser is recovered
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
alone). The residual gap from $1.0$ is attributable to surface-form variation
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
model handles but exact matching does not.
\paragraph{Per-triple grounding flag.} Because even the retained relations are not
\emph{always} named in a fund's segment, every triple carries a boolean
\code{grounded} flag: true iff the normalized object name occurs in that sample's
input text. This lets training and evaluation restrict to grounded triples rather
than silently carrying unextractable targets. Across the full build, $80\%$ of
triples are grounded; per relation the rate ranges from $93\%$ (\code{advisedBy})
down to $62\%$ (\code{underwrittenBy}), as shown in Table~\ref{tab:baseline}.
\paragraph{No-model lower bound.} The grounding flag is itself a trivial,
model-free baseline (emit a gold edge iff its object name occurs in the prose);
its per-relation rate is a strict lower bound on recall, since it requires an
\emph{exact substring} match within the fund's own segment and so misses surface
variants (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a
trained model handles. Table~\ref{tab:baseline} reports it on the full build.
\begin{table}[h]
\centering
\caption{No-model string-match baseline on the full 2025\,Q3 build, after
primary-custodian scoping, multi-book fetching and per-fund segmentation
($852$ samples). Precision is $1.00$ by construction; recall is a strict
exact-match lower bound.}
\caption{Per-relation prose-grounding on the full 2025\,Q3 build ($852$ samples,
$8{,}824$ triples; custodian dropped). ``Grounded'' = object name present in the
sample's input; a strict, model-free lower bound on recall.}
\label{tab:baseline}
\small
\begin{tabular}{@{}lrr@{}}
\toprule
Relation & Recall & Gold edges \\
Relation & Triples & Grounded \\
\midrule
\code{advisedBy} & 0.93 & 1{,}673 \\
\code{seriesOf} & 0.84 & 1{,}555 \\
\code{subAdvisedBy} & 0.84 & 946 \\
\code{administrator} & 0.80 & 2{,}066 \\
\code{transferAgent} & 0.72 & 1{,}721 \\
\code{custodian} & 0.63 & 1{,}761 \\
\code{underwrittenBy} & 0.62 & 863 \\
\code{advisedBy} & 1{,}673 & 93\% \\
\code{seriesOf} & 1{,}555 & 84\% \\
\code{subAdvisedBy} & 946 & 84\% \\
\code{administrator} & 2{,}066 & 80\% \\
\code{transferAgent} & 1{,}721 & 72\% \\
\code{underwrittenBy} & 863 & 62\% \\
\midrule
micro-average & 0.65 & 6{,}479 \\
all & 8{,}824 & 80\% \\
\bottomrule
\end{tabular}
\end{table}
@ -497,21 +499,22 @@ quantified rather than assumed.
\section{Corpus statistics}
% ====================================================================
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. With
the custodian relation dropped, the N-CEN gold graph holds $12{,}694$
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
closed-end or interval funds file no standard prospectus) and applying the robust
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
near $117\!:\!1$ (input prose to target serialization), and across all samples the
$193$ whole-trust fallbacks) containing $8{,}824$ target triples, of which $80\%$
are prose-grounded. The segmented samples have a per-fund median ratio near
$117\!:\!1$ (input prose to target serialization), and across all samples the
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
\begin{table}[h]
\centering
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
trust, per-fund segmentation).}
\caption{Corpus statistics for the full 2025\,Q3 build (custodian dropped). Left:
N-CEN gold graph. Right: text-to-triple samples (all prospectus books per trust,
per-fund segmentation).}
\label{tab:stats}
\small
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
@ -520,9 +523,9 @@ trust, per-fund segmentation).}
\cmidrule(r){1-2}\cmidrule(l){3-4}
Trust graphs & 435 & Samples (total) & 852 \\
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\
Entity-entity edges & 12{,}694 & \;whole-trust fallback & 193 \\
\;administrator & 3{,}288 & Target triples & 8{,}824 \\
\;advisedBy & 2{,}588 & \;grounded & 80\% \\
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
\bottomrule
\end{tabular}