Drop non-extractable custodian relation; add per-triple grounded flag

Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN,
never in the prospectus prose, so they are not a valid text->triple target.
Per-fund the custodian object name occurs in only 28% of segments, the weakest
of all relations. Default is now --custodian-scope none.

Every triple now carries a 'grounded' boolean (object name present in the
sample's input text); 80% of triples are grounded across the full build. This
lets training/eval restrict to text-extractable targets.

- build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats
- gold rebuilt without custodian (15,739 -> 12,694 edges)
- dataset_description + README updated (custodian dropped, grounding documented)

Reported by thesis author: Citibank custodians in triples for 0001529390 never
appear in that prospectus text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Florian Herzog 2026-06-05 10:34:14 +02:00
parent 63e650fa14
commit 00f51859e0
5 changed files with 566 additions and 512 deletions

View File

@ -54,19 +54,33 @@ Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:
|---|---| |---|---|
| `input_text` | prospectus prose for the fund (model input) | | `input_text` | prospectus prose for the fund (model input) |
| `ontology` | inferred meta-schema (subject type → predicate → object type) | | `ontology` | inferred meta-schema (subject type → predicate → object type) |
| `target_triples` | structured `{s,p,o}` list | | `target_triples` | structured `{s,p,o,grounded}` list (`grounded` = object name appears in `input_text`) |
| `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 | | `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
| `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 | | `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
| `cik`, `series_id`, `fund`, `trust_name` | identifiers | | `cik`, `series_id`, `fund`, `trust_name` | identifiers |
| `stats` | input/target sizes, triple count, text:json ratio | | `stats` | input/target sizes, triple count, `n_grounded`, text:json ratio |
## Relations ## Relations
Entity-to-entity edges (gold from N-CEN / Series-Class): Entity-to-entity edges (gold from N-CEN / Series-Class):
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary `seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `administrator`,
only by default), `administrator`, `underwrittenBy`. Holdings edges `underwrittenBy`.
(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
from annual-report (N-CSR) commentary — see the description PDF. **`custodian` is dropped by default** (`--custodian-scope none`): custodian names
— especially foreign sub-custodians — appear only in the structured N-CEN table
and in *no* prose document (the summary prospectus says only "the custodian"),
so they are not extractable from text. The primary custodian is named only in the
separately-filed SAI (N-1A Part B), which is not part of the input. Use
`--custodian-scope primary` or `all` to re-include it if you add the SAI as input.
**Prose-grounding:** every triple carries a `grounded` flag (object name present
in the sample's input). Across the full build ~80 % of triples are grounded
(per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 8084 %,
transferAgent 72 %, underwrittenBy 62 %). Filter on `grounded` to train/evaluate
only on text-extractable targets.
Holdings edges (`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned
second track from annual-report (N-CSR) commentary — see the description PDF.
## Data sources ## Data sources

View File

@ -394,6 +394,39 @@ def serialize_triples_plain(triples, entities) -> str:
return "\n".join(chunks) return "\n".join(chunks)
_GND_SUFFIX = re.compile(
r"\b(llc|l\.l\.c|inc|incorporated|corp|corporation|company|co|ltd|limited"
r"|lp|l\.p|llp|n\.a|na|the|trust)\b", re.I)
def _gnorm(s: str) -> str:
"""Normalize a name for prose-grounding checks (lowercase, strip legal suffixes)."""
s = _GND_SUFFIX.sub(" ", (s or "").lower())
s = re.sub(r"[^a-z0-9]+", " ", s)
return re.sub(r"\s+", " ", s).strip()
def annotate_grounding(triples, entities, text):
"""Mark each triple as prose-grounded if its object name appears in `text`.
A triple is only useful as a TEXT->triple training target if the fact can be
found in the input. This adds a boolean `grounded` to each triple (the
normalized object label occurs as a substring of the normalized input), so
training/eval can restrict to grounded triples. Returns (annotated_triples,
n_grounded). Mutates copies, not the input dicts.
"""
ntext = _gnorm(text)
out = []
n_grounded = 0
for t in triples:
olbl = entities.get(t["o"], {}).get("label", "")
no = _gnorm(olbl)
g = bool(no) and no in ntext
n_grounded += int(g)
out.append({**t, "grounded": g})
return out, n_grounded
def ontology_schema(triples, entities) -> dict: def ontology_schema(triples, entities) -> dict:
"""Inferred meta-schema (subject type -> predicate -> object type), per thesis 5.3.""" """Inferred meta-schema (subject type -> predicate -> object type), per thesis 5.3."""
schema = defaultdict(lambda: defaultdict(set)) schema = defaultdict(lambda: defaultdict(set))
@ -580,6 +613,7 @@ def _build_samples_per_fund():
if not segs: # whole-trust fallback (no section located) if not segs: # whole-trust fallback (no section located)
n_fallback_trusts += 1 n_fallback_trusts += 1
triples = g["triples"] triples = g["triples"]
triples, n_gnd = annotate_grounding(triples, ents, text)
target = serialize_triples(triples, ents) target = serialize_triples(triples, ents)
target_plain = serialize_triples_plain(triples, ents) target_plain = serialize_triples_plain(triples, ents)
rec = { rec = {
@ -590,7 +624,8 @@ def _build_samples_per_fund():
"target_serialized": target, "target_serialized": target,
"target_serialized_plain": target_plain, "target_serialized_plain": target_plain,
"stats": {"input_chars": len(text), "target_chars": len(target), "stats": {"input_chars": len(text), "target_chars": len(target),
"n_triples": len(triples), "n_entities": len(ents), "n_triples": len(triples), "n_grounded": n_gnd,
"n_entities": len(ents),
"text_to_json_ratio": round(len(text) / max(1, len(target)), 1)}, "text_to_json_ratio": round(len(text) / max(1, len(target)), 1)},
} }
out.write(json.dumps(rec, ensure_ascii=False) + "\n") out.write(json.dumps(rec, ensure_ascii=False) + "\n")
@ -611,6 +646,7 @@ def _build_samples_per_fund():
for t in triples: for t in triples:
ref.add(t["s"]); ref.add(t["o"]) ref.add(t["s"]); ref.add(t["o"])
sub_ents = {k: ents[k] for k in ref if k in ents} sub_ents = {k: ents[k] for k in ref if k in ents}
triples, n_gnd = annotate_grounding(triples, sub_ents, seg)
target = serialize_triples(triples, sub_ents) target = serialize_triples(triples, sub_ents)
target_plain = serialize_triples_plain(triples, sub_ents) target_plain = serialize_triples_plain(triples, sub_ents)
rec = { rec = {
@ -623,7 +659,8 @@ def _build_samples_per_fund():
"target_serialized": target, "target_serialized": target,
"target_serialized_plain": target_plain, "target_serialized_plain": target_plain,
"stats": {"input_chars": len(seg), "target_chars": len(target), "stats": {"input_chars": len(seg), "target_chars": len(target),
"n_triples": len(triples), "n_entities": len(sub_ents), "n_triples": len(triples), "n_grounded": n_gnd,
"n_entities": len(sub_ents),
"text_to_json_ratio": round(len(seg) / max(1, len(target)), 1)}, "text_to_json_ratio": round(len(seg) / max(1, len(target)), 1)},
} }
out.write(json.dumps(rec, ensure_ascii=False) + "\n") out.write(json.dumps(rec, ensure_ascii=False) + "\n")

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -189,11 +189,14 @@ Fund & \code{seriesOf} & Trust \\
Fund & \code{advisedBy} & InvestmentAdviser \\ Fund & \code{advisedBy} & InvestmentAdviser \\
Fund & \code{subAdvisedBy} & SubAdviser \\ Fund & \code{subAdvisedBy} & SubAdviser \\
Fund & \code{transferAgent} & TransferAgent \\ Fund & \code{transferAgent} & TransferAgent \\
Fund & \code{custodian} & Custodian \\
Fund & \code{administrator} & Administrator \\ Fund & \code{administrator} & Administrator \\
Trust & \code{underwrittenBy} & Distributor \\ Trust & \code{underwrittenBy} & Distributor \\
\addlinespace \addlinespace
Fund & \code{holds} & Security \quad(holdings sub-graph) \\ \multicolumn{3}{@{}l}{\emph{dropped (not prose-grounded, see \S\ref{sec:baselines}):}}\\
Fund & \code{custodian} & Custodian \\
\addlinespace
\multicolumn{3}{@{}l}{\emph{holdings sub-graph (planned 2nd track):}}\\
Fund & \code{holds} & Security \\
Security & \code{issuedBy} & Issuer \\ Security & \code{issuedBy} & Issuer \\
Security & \code{domiciledIn} & Country \\ Security & \code{domiciledIn} & Country \\
Fund & \code{tracksIndex} & Index \\ Fund & \code{tracksIndex} & Index \\
@ -220,7 +223,6 @@ realised in that sample, e.g.
"advisedBy": ["InvestmentAdviser"], "advisedBy": ["InvestmentAdviser"],
"subAdvisedBy": ["SubAdviser"], "subAdvisedBy": ["SubAdviser"],
"transferAgent": ["TransferAgent"], "transferAgent": ["TransferAgent"],
"custodian": ["Custodian"],
"administrator": ["Administrator"] "administrator": ["Administrator"]
}, },
"Trust": { "underwrittenBy": ["Distributor"] } "Trust": { "underwrittenBy": ["Distributor"] }
@ -242,7 +244,7 @@ realised in that sample, e.g.
\node[ent,above=of trust] (dist) {Distributor}; \node[ent,above=of trust] (dist) {Distributor};
\node[ent,left=of fund] (adv) {Investment\\Adviser}; \node[ent,left=of fund] (adv) {Investment\\Adviser};
\node[ent,below=16mm of adv] (sub) {Sub-\\Adviser}; \node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
\node[ent,below=of fund] (cust) {Custodian}; \node[ent,dashed,fill=black!5,below=of fund] (cust) {Custodian};
\node[ent,right=24mm of cust] (ta) {Transfer\\Agent}; \node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
\node[ent,right=of fund] (admin) {Administrator}; \node[ent,right=of fund] (admin) {Administrator};
% --- holdings cluster (far right, separated column) --- % --- holdings cluster (far right, separated column) ---
@ -254,7 +256,7 @@ realised in that sample, e.g.
\draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist); \draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
\draw[edge] (fund) -- node[lbl]{advisedBy} (adv); \draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
\draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub); \draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
\draw[edge] (fund) -- node[lbl]{custodian} (cust); \draw[edge,dashed,gray] (fund) -- node[lbl]{custodian} (cust);
\draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta); \draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
\draw[edge] (fund) -- node[lbl]{administrator} (admin); \draw[edge] (fund) -- node[lbl]{administrator} (admin);
% holds: arc from the Fund's top, over the Administrator, down to Security % holds: arc from the Fund's top, over the Administrator, down to Security
@ -264,9 +266,10 @@ realised in that sample, e.g.
\draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry); \draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
\end{tikzpicture} \end{tikzpicture}
\caption{Schematic of the target knowledge graph. Left and centre: the \caption{Schematic of the target knowledge graph. Left and centre: the
service-provider/structure graph grounded in the prospectus prose. Right column service-provider/structure graph grounded in the prospectus prose. The dashed
(Issuer--Security--Country): the holdings sub-graph grounded in annual-report \code{custodian} edge is \emph{dropped} from the dataset (not prose-grounded,
commentary with N-PORT gold.} \S\ref{sec:baselines}). Right column (Issuer--Security--Country): the holdings
sub-graph grounded in annual-report commentary with N-PORT gold.}
\label{fig:graph} \label{fig:graph}
\end{figure} \end{figure}
@ -322,18 +325,22 @@ thesis (Section~5.2), in which four special tokens delimit triple components and
shared subjects/predicates are factored out, mirroring Turtle's predicate-object shared subjects/predicates are factored out, mirroring Turtle's predicate-object
lists: lists:
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (primary-custodian scope).},captionpos=b] \begin{lstlisting}[language=,caption={Target serialization for one segmented fund (custodian dropped).},captionpos=b]
<triple_start> Small Cap Special Values Fund <triple_start> John Hancock Bond Fund
<predicate_marker> seriesOf <predicate_marker> seriesOf
<object_marker> VALIC Co I <object_marker> JOHN HANCOCK SOVEREIGN BOND FUND
<predicate_marker> advisedBy <predicate_marker> advisedBy
<object_marker> The Variable Annuity Life Insurance Company <object_marker> John Hancock Investment Management LLC
<predicate_marker> subAdvisedBy <predicate_marker> subAdvisedBy
<object_marker> SunAmerica Asset Management, LLC <object_marker> Manulife Investment Management (US) LLC
<predicate_marker> administrator <predicate_marker> administrator
<object_marker> SunAmerica Asset Management, LLC <object_marker> John Hancock Investment Management LLC
<predicate_marker> custodian <predicate_marker> transferAgent
<object_marker> State Street Bank and Trust Company <object_marker> John Hancock Signature Services, Inc.
<triple_end>
<triple_start> JOHN HANCOCK SOVEREIGN BOND FUND
<predicate_marker> underwrittenBy
<object_marker> JOHN HANCOCK INVESTMENT MANAGEMENT DISTRIBUTORS LLC
<triple_end> <triple_end>
\end{lstlisting} \end{lstlisting}
@ -348,8 +355,9 @@ question~1.
Each sample is thus a JSON record with: the input prose (\code{input\_text}), the Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
inferred ontology (\code{ontology}), the target triples as a structured list inferred ontology (\code{ontology}), the target triples as a structured list
(\code{target\_triples}) and in both serializations, the trust/series identifiers, (\code{target\_triples}, each triple carrying a \code{grounded} flag) and in both
and size statistics. serializations, the trust/series identifiers, and size statistics (including the
grounded-triple count).
% ==================================================================== % ====================================================================
\section{Per-fund segmentation} \section{Per-fund segmentation}
@ -417,70 +425,64 @@ fetching more filings of the same CIK, and is left to the full-scale build.
The dataset offers two independent ground-truth regimes. The dataset offers two independent ground-truth regimes.
\paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy}, \paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
\code{transferAgent}, \code{custodian}, \code{administrator} and \code{transferAgent}, \code{administrator} and \code{underwrittenBy}, the labels
\code{underwrittenBy}, the labels come directly from N-CEN; for \code{seriesOf} come directly from N-CEN; for \code{seriesOf}, from the Series/Class reference
and \code{hasShareClass}, from the Series/Class reference data; for \code{holds}, data; for the (planned) \code{holds}, \code{issuedBy} and \code{domiciledIn}, from
\code{issuedBy} and \code{domiciledIn}, from N-PORT. No model is involved in N-PORT. No model is involved in producing these labels, which makes them an
producing these labels, which makes them an unusually trustworthy reference for a unusually trustworthy reference for a generative-extraction benchmark.
generative-extraction benchmark.
\paragraph{The custodian relation and edge scoping.} The custodian relation \paragraph{Why the custodian relation is dropped.} A text-to-triple target is only
illustrates a subtlety that any honest benchmark on this data must address. N-CEN useful if the fact can be found in the input text. The custodian relation fails
reports, for a global fund, not only its \emph{primary} custodian but the entire this test and is therefore \emph{excluded} from the dataset
chain of \emph{foreign sub-custodians} --- one bank per market it invests in. (\code{--custodian-scope none}). N-CEN reports, for a global fund, the entire chain
These sub-custodians have two damaging properties. First, they are of \emph{foreign sub-custodians} --- one bank per market it invests in (Citibank
\emph{unextractable}: they appear only in the N-CEN table and essentially never in Brazil, Banco de Chile, Cititrust Colombia, \dots). These names appear only in the
the prospectus prose (a naive string-match recovers $7\%$ of them), so keeping structured N-CEN table and in \emph{no} prose document: the summary prospectus
them as targets asks the model to extract facts absent from its input. Second, refers to the custodian only generically (``\dots\ including the adviser, the
they \emph{dominate}: with \code{IS\_SUB\_CUSTODIAN${=}$Y} accounting for $88\%$ custodian, and the transfer agent\dots'') and never names it. Even the
of custodian rows, they constitute roughly two thirds of \emph{all} edges in the \emph{primary} custodian is typically named only in the separately filed Statement
unfiltered graph, inflating both the target size and the training loss with noise. of Additional Information (N-1A Part~B), which is not part of the fetched input.
The dataset therefore scopes the custodian relation to the \emph{primary} Measured per fund, the custodian object name occurs in the fund's own prospectus
custodian (\code{IS\_SUB\_CUSTODIAN${\neq}$Y}, a median of one per fund), which is segment only $28\%$ of the time --- by far the weakest of all relations --- so
genuinely prose-grounded --- it is named in the prospectus or its Statement of keeping it would systematically ask the model to extract facts absent from its
Additional Information (e.g.\ ``State Street Bank and Trust Company serves as input. The full custodian chain remains available in N-CEN as a structured-only
custodian''). This single change reduces the corpus from $36{,}880$ to $15{,}739$ relation, outside the text-to-triples task; recovering it from text would require
edges, all of prose-grounded relation types, and is the configurable default adding the SAI as an input source (a separate crawl).
(\code{--custodian-scope primary}). The full sub-custodian chain remains available
in N-CEN as a structured-only relation outside the text-to-triples task. This is a
dataset-quality decision of the same kind the thesis notes for T-REx and REBEL,
whose non-exhaustive references unfairly penalise correct extractions.
\paragraph{No-model lower bound.} A trivial string-matching baseline --- emit a \paragraph{Per-triple grounding flag.} Because even the retained relations are not
gold edge iff the object's name occurs in the prose --- establishes a \emph{floor} \emph{always} named in a fund's segment, every triple carries a boolean
and measures \emph{how prose-grounded each relation is}. Table~\ref{tab:baseline} \code{grounded} flag: true iff the normalized object name occurs in that sample's
reports this on the proof-of-concept slice after primary-custodian scoping, input text. This lets training and evaluation restrict to grounded triples rather
multi-book fetching and per-fund segmentation. Because the baseline requires an than silently carrying unextractable targets. Across the full build, $80\%$ of
\emph{exact substring} match within the fund's \emph{own} section, its recall is a triples are grounded; per relation the rate ranges from $93\%$ (\code{advisedBy})
strict lower bound: a fund's adviser, for instance, must be named in that fund's down to $62\%$ (\code{underwrittenBy}), as shown in Table~\ref{tab:baseline}.
segment under a literal spelling. On the full quarter the adviser is recovered
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered \paragraph{No-model lower bound.} The grounding flag is itself a trivial,
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping model-free baseline (emit a gold edge iff its object name occurs in the prose);
alone). The residual gap from $1.0$ is attributable to surface-form variation its per-relation rate is a strict lower bound on recall, since it requires an
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained \emph{exact substring} match within the fund's own segment and so misses surface
model handles but exact matching does not. variants (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a
trained model handles. Table~\ref{tab:baseline} reports it on the full build.
\begin{table}[h] \begin{table}[h]
\centering \centering
\caption{No-model string-match baseline on the full 2025\,Q3 build, after \caption{Per-relation prose-grounding on the full 2025\,Q3 build ($852$ samples,
primary-custodian scoping, multi-book fetching and per-fund segmentation $8{,}824$ triples; custodian dropped). ``Grounded'' = object name present in the
($852$ samples). Precision is $1.00$ by construction; recall is a strict sample's input; a strict, model-free lower bound on recall.}
exact-match lower bound.}
\label{tab:baseline} \label{tab:baseline}
\small \small
\begin{tabular}{@{}lrr@{}} \begin{tabular}{@{}lrr@{}}
\toprule \toprule
Relation & Recall & Gold edges \\ Relation & Triples & Grounded \\
\midrule \midrule
\code{advisedBy} & 0.93 & 1{,}673 \\ \code{advisedBy} & 1{,}673 & 93\% \\
\code{seriesOf} & 0.84 & 1{,}555 \\ \code{seriesOf} & 1{,}555 & 84\% \\
\code{subAdvisedBy} & 0.84 & 946 \\ \code{subAdvisedBy} & 946 & 84\% \\
\code{administrator} & 0.80 & 2{,}066 \\ \code{administrator} & 2{,}066 & 80\% \\
\code{transferAgent} & 0.72 & 1{,}721 \\ \code{transferAgent} & 1{,}721 & 72\% \\
\code{custodian} & 0.63 & 1{,}761 \\ \code{underwrittenBy} & 863 & 62\% \\
\code{underwrittenBy} & 0.62 & 863 \\
\midrule \midrule
micro-average & 0.65 & 6{,}479 \\ all & 8{,}824 & 80\% \\
\bottomrule \bottomrule
\end{tabular} \end{tabular}
\end{table} \end{table}
@ -497,21 +499,22 @@ quantified rather than assumed.
\section{Corpus statistics} \section{Corpus statistics}
% ==================================================================== % ====================================================================
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. With
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$ the custodian relation dropped, the N-CEN gold graph holds $12{,}694$
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$ prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
closed-end or interval funds file no standard prospectus) and applying the robust closed-end or interval funds file no standard prospectus) and applying the robust
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio $193$ whole-trust fallbacks) containing $8{,}824$ target triples, of which $80\%$
near $117\!:\!1$ (input prose to target serialization), and across all samples the are prose-grounded. The segmented samples have a per-fund median ratio near
$117\!:\!1$ (input prose to target serialization), and across all samples the
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks. median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
\begin{table}[h] \begin{table}[h]
\centering \centering
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph \caption{Corpus statistics for the full 2025\,Q3 build (custodian dropped). Left:
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per N-CEN gold graph. Right: text-to-triple samples (all prospectus books per trust,
trust, per-fund segmentation).} per-fund segmentation).}
\label{tab:stats} \label{tab:stats}
\small \small
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}} \begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
@ -520,9 +523,9 @@ trust, per-fund segmentation).}
\cmidrule(r){1-2}\cmidrule(l){3-4} \cmidrule(r){1-2}\cmidrule(l){3-4}
Trust graphs & 435 & Samples (total) & 852 \\ Trust graphs & 435 & Samples (total) & 852 \\
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\ Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\ Entity-entity edges & 12{,}694 & \;whole-trust fallback & 193 \\
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\ \;administrator & 3{,}288 & Target triples & 8{,}824 \\
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\ \;advisedBy & 2{,}588 & \;grounded & 80\% \\
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\ Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
\bottomrule \bottomrule
\end{tabular} \end{tabular}