Drop non-extractable custodian relation; add per-triple grounded flag
Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN, never in the prospectus prose, so they are not a valid text->triple target. Per-fund the custodian object name occurs in only 28% of segments, the weakest of all relations. Default is now --custodian-scope none. Every triple now carries a 'grounded' boolean (object name present in the sample's input text); 80% of triples are grounded across the full build. This lets training/eval restrict to text-extractable targets. - build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats - gold rebuilt without custodian (15,739 -> 12,694 edges) - dataset_description + README updated (custodian dropped, grounding documented) Reported by thesis author: Citibank custodians in triples for 0001529390 never appear in that prospectus text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
63e650fa14
commit
00f51859e0
26
README.md
26
README.md
@ -54,19 +54,33 @@ Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:
|
||||
|---|---|
|
||||
| `input_text` | prospectus prose for the fund (model input) |
|
||||
| `ontology` | inferred meta-schema (subject type → predicate → object type) |
|
||||
| `target_triples` | structured `{s,p,o}` list |
|
||||
| `target_triples` | structured `{s,p,o,grounded}` list (`grounded` = object name appears in `input_text`) |
|
||||
| `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
|
||||
| `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
|
||||
| `cik`, `series_id`, `fund`, `trust_name` | identifiers |
|
||||
| `stats` | input/target sizes, triple count, text:json ratio |
|
||||
| `stats` | input/target sizes, triple count, `n_grounded`, text:json ratio |
|
||||
|
||||
## Relations
|
||||
|
||||
Entity-to-entity edges (gold from N-CEN / Series-Class):
|
||||
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary
|
||||
only by default), `administrator`, `underwrittenBy`. Holdings edges
|
||||
(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
|
||||
from annual-report (N-CSR) commentary — see the description PDF.
|
||||
`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `administrator`,
|
||||
`underwrittenBy`.
|
||||
|
||||
**`custodian` is dropped by default** (`--custodian-scope none`): custodian names
|
||||
— especially foreign sub-custodians — appear only in the structured N-CEN table
|
||||
and in *no* prose document (the summary prospectus says only "the custodian"),
|
||||
so they are not extractable from text. The primary custodian is named only in the
|
||||
separately-filed SAI (N-1A Part B), which is not part of the input. Use
|
||||
`--custodian-scope primary` or `all` to re-include it if you add the SAI as input.
|
||||
|
||||
**Prose-grounding:** every triple carries a `grounded` flag (object name present
|
||||
in the sample's input). Across the full build ~80 % of triples are grounded
|
||||
(per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 80–84 %,
|
||||
transferAgent 72 %, underwrittenBy 62 %). Filter on `grounded` to train/evaluate
|
||||
only on text-extractable targets.
|
||||
|
||||
Holdings edges (`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned
|
||||
second track from annual-report (N-CSR) commentary — see the description PDF.
|
||||
|
||||
## Data sources
|
||||
|
||||
|
||||
@ -394,6 +394,39 @@ def serialize_triples_plain(triples, entities) -> str:
|
||||
return "\n".join(chunks)
|
||||
|
||||
|
||||
_GND_SUFFIX = re.compile(
|
||||
r"\b(llc|l\.l\.c|inc|incorporated|corp|corporation|company|co|ltd|limited"
|
||||
r"|lp|l\.p|llp|n\.a|na|the|trust)\b", re.I)
|
||||
|
||||
|
||||
def _gnorm(s: str) -> str:
|
||||
"""Normalize a name for prose-grounding checks (lowercase, strip legal suffixes)."""
|
||||
s = _GND_SUFFIX.sub(" ", (s or "").lower())
|
||||
s = re.sub(r"[^a-z0-9]+", " ", s)
|
||||
return re.sub(r"\s+", " ", s).strip()
|
||||
|
||||
|
||||
def annotate_grounding(triples, entities, text):
|
||||
"""Mark each triple as prose-grounded if its object name appears in `text`.
|
||||
|
||||
A triple is only useful as a TEXT->triple training target if the fact can be
|
||||
found in the input. This adds a boolean `grounded` to each triple (the
|
||||
normalized object label occurs as a substring of the normalized input), so
|
||||
training/eval can restrict to grounded triples. Returns (annotated_triples,
|
||||
n_grounded). Mutates copies, not the input dicts.
|
||||
"""
|
||||
ntext = _gnorm(text)
|
||||
out = []
|
||||
n_grounded = 0
|
||||
for t in triples:
|
||||
olbl = entities.get(t["o"], {}).get("label", "")
|
||||
no = _gnorm(olbl)
|
||||
g = bool(no) and no in ntext
|
||||
n_grounded += int(g)
|
||||
out.append({**t, "grounded": g})
|
||||
return out, n_grounded
|
||||
|
||||
|
||||
def ontology_schema(triples, entities) -> dict:
|
||||
"""Inferred meta-schema (subject type -> predicate -> object type), per thesis 5.3."""
|
||||
schema = defaultdict(lambda: defaultdict(set))
|
||||
@ -580,6 +613,7 @@ def _build_samples_per_fund():
|
||||
if not segs: # whole-trust fallback (no section located)
|
||||
n_fallback_trusts += 1
|
||||
triples = g["triples"]
|
||||
triples, n_gnd = annotate_grounding(triples, ents, text)
|
||||
target = serialize_triples(triples, ents)
|
||||
target_plain = serialize_triples_plain(triples, ents)
|
||||
rec = {
|
||||
@ -590,7 +624,8 @@ def _build_samples_per_fund():
|
||||
"target_serialized": target,
|
||||
"target_serialized_plain": target_plain,
|
||||
"stats": {"input_chars": len(text), "target_chars": len(target),
|
||||
"n_triples": len(triples), "n_entities": len(ents),
|
||||
"n_triples": len(triples), "n_grounded": n_gnd,
|
||||
"n_entities": len(ents),
|
||||
"text_to_json_ratio": round(len(text) / max(1, len(target)), 1)},
|
||||
}
|
||||
out.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||
@ -611,6 +646,7 @@ def _build_samples_per_fund():
|
||||
for t in triples:
|
||||
ref.add(t["s"]); ref.add(t["o"])
|
||||
sub_ents = {k: ents[k] for k in ref if k in ents}
|
||||
triples, n_gnd = annotate_grounding(triples, sub_ents, seg)
|
||||
target = serialize_triples(triples, sub_ents)
|
||||
target_plain = serialize_triples_plain(triples, sub_ents)
|
||||
rec = {
|
||||
@ -623,7 +659,8 @@ def _build_samples_per_fund():
|
||||
"target_serialized": target,
|
||||
"target_serialized_plain": target_plain,
|
||||
"stats": {"input_chars": len(seg), "target_chars": len(target),
|
||||
"n_triples": len(triples), "n_entities": len(sub_ents),
|
||||
"n_triples": len(triples), "n_grounded": n_gnd,
|
||||
"n_entities": len(sub_ents),
|
||||
"text_to_json_ratio": round(len(seg) / max(1, len(target)), 1)},
|
||||
}
|
||||
out.write(json.dumps(rec, ensure_ascii=False) + "\n")
|
||||
|
||||
File diff suppressed because one or more lines are too long
Binary file not shown.
@ -189,11 +189,14 @@ Fund & \code{seriesOf} & Trust \\
|
||||
Fund & \code{advisedBy} & InvestmentAdviser \\
|
||||
Fund & \code{subAdvisedBy} & SubAdviser \\
|
||||
Fund & \code{transferAgent} & TransferAgent \\
|
||||
Fund & \code{custodian} & Custodian \\
|
||||
Fund & \code{administrator} & Administrator \\
|
||||
Trust & \code{underwrittenBy} & Distributor \\
|
||||
\addlinespace
|
||||
Fund & \code{holds} & Security \quad(holdings sub-graph) \\
|
||||
\multicolumn{3}{@{}l}{\emph{dropped (not prose-grounded, see \S\ref{sec:baselines}):}}\\
|
||||
Fund & \code{custodian} & Custodian \\
|
||||
\addlinespace
|
||||
\multicolumn{3}{@{}l}{\emph{holdings sub-graph (planned 2nd track):}}\\
|
||||
Fund & \code{holds} & Security \\
|
||||
Security & \code{issuedBy} & Issuer \\
|
||||
Security & \code{domiciledIn} & Country \\
|
||||
Fund & \code{tracksIndex} & Index \\
|
||||
@ -220,7 +223,6 @@ realised in that sample, e.g.
|
||||
"advisedBy": ["InvestmentAdviser"],
|
||||
"subAdvisedBy": ["SubAdviser"],
|
||||
"transferAgent": ["TransferAgent"],
|
||||
"custodian": ["Custodian"],
|
||||
"administrator": ["Administrator"]
|
||||
},
|
||||
"Trust": { "underwrittenBy": ["Distributor"] }
|
||||
@ -242,7 +244,7 @@ realised in that sample, e.g.
|
||||
\node[ent,above=of trust] (dist) {Distributor};
|
||||
\node[ent,left=of fund] (adv) {Investment\\Adviser};
|
||||
\node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
|
||||
\node[ent,below=of fund] (cust) {Custodian};
|
||||
\node[ent,dashed,fill=black!5,below=of fund] (cust) {Custodian};
|
||||
\node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
|
||||
\node[ent,right=of fund] (admin) {Administrator};
|
||||
% --- holdings cluster (far right, separated column) ---
|
||||
@ -254,7 +256,7 @@ realised in that sample, e.g.
|
||||
\draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
|
||||
\draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
|
||||
\draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
|
||||
\draw[edge] (fund) -- node[lbl]{custodian} (cust);
|
||||
\draw[edge,dashed,gray] (fund) -- node[lbl]{custodian} (cust);
|
||||
\draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
|
||||
\draw[edge] (fund) -- node[lbl]{administrator} (admin);
|
||||
% holds: arc from the Fund's top, over the Administrator, down to Security
|
||||
@ -264,9 +266,10 @@ realised in that sample, e.g.
|
||||
\draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
|
||||
\end{tikzpicture}
|
||||
\caption{Schematic of the target knowledge graph. Left and centre: the
|
||||
service-provider/structure graph grounded in the prospectus prose. Right column
|
||||
(Issuer--Security--Country): the holdings sub-graph grounded in annual-report
|
||||
commentary with N-PORT gold.}
|
||||
service-provider/structure graph grounded in the prospectus prose. The dashed
|
||||
\code{custodian} edge is \emph{dropped} from the dataset (not prose-grounded,
|
||||
\S\ref{sec:baselines}). Right column (Issuer--Security--Country): the holdings
|
||||
sub-graph grounded in annual-report commentary with N-PORT gold.}
|
||||
\label{fig:graph}
|
||||
\end{figure}
|
||||
|
||||
@ -322,18 +325,22 @@ thesis (Section~5.2), in which four special tokens delimit triple components and
|
||||
shared subjects/predicates are factored out, mirroring Turtle's predicate-object
|
||||
lists:
|
||||
|
||||
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (primary-custodian scope).},captionpos=b]
|
||||
<triple_start> Small Cap Special Values Fund
|
||||
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (custodian dropped).},captionpos=b]
|
||||
<triple_start> John Hancock Bond Fund
|
||||
<predicate_marker> seriesOf
|
||||
<object_marker> VALIC Co I
|
||||
<object_marker> JOHN HANCOCK SOVEREIGN BOND FUND
|
||||
<predicate_marker> advisedBy
|
||||
<object_marker> The Variable Annuity Life Insurance Company
|
||||
<object_marker> John Hancock Investment Management LLC
|
||||
<predicate_marker> subAdvisedBy
|
||||
<object_marker> SunAmerica Asset Management, LLC
|
||||
<object_marker> Manulife Investment Management (US) LLC
|
||||
<predicate_marker> administrator
|
||||
<object_marker> SunAmerica Asset Management, LLC
|
||||
<predicate_marker> custodian
|
||||
<object_marker> State Street Bank and Trust Company
|
||||
<object_marker> John Hancock Investment Management LLC
|
||||
<predicate_marker> transferAgent
|
||||
<object_marker> John Hancock Signature Services, Inc.
|
||||
<triple_end>
|
||||
<triple_start> JOHN HANCOCK SOVEREIGN BOND FUND
|
||||
<predicate_marker> underwrittenBy
|
||||
<object_marker> JOHN HANCOCK INVESTMENT MANAGEMENT DISTRIBUTORS LLC
|
||||
<triple_end>
|
||||
\end{lstlisting}
|
||||
|
||||
@ -348,8 +355,9 @@ question~1.
|
||||
|
||||
Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
|
||||
inferred ontology (\code{ontology}), the target triples as a structured list
|
||||
(\code{target\_triples}) and in both serializations, the trust/series identifiers,
|
||||
and size statistics.
|
||||
(\code{target\_triples}, each triple carrying a \code{grounded} flag) and in both
|
||||
serializations, the trust/series identifiers, and size statistics (including the
|
||||
grounded-triple count).
|
||||
|
||||
% ====================================================================
|
||||
\section{Per-fund segmentation}
|
||||
@ -417,70 +425,64 @@ fetching more filings of the same CIK, and is left to the full-scale build.
|
||||
The dataset offers two independent ground-truth regimes.
|
||||
|
||||
\paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
|
||||
\code{transferAgent}, \code{custodian}, \code{administrator} and
|
||||
\code{underwrittenBy}, the labels come directly from N-CEN; for \code{seriesOf}
|
||||
and \code{hasShareClass}, from the Series/Class reference data; for \code{holds},
|
||||
\code{issuedBy} and \code{domiciledIn}, from N-PORT. No model is involved in
|
||||
producing these labels, which makes them an unusually trustworthy reference for a
|
||||
generative-extraction benchmark.
|
||||
\code{transferAgent}, \code{administrator} and \code{underwrittenBy}, the labels
|
||||
come directly from N-CEN; for \code{seriesOf}, from the Series/Class reference
|
||||
data; for the (planned) \code{holds}, \code{issuedBy} and \code{domiciledIn}, from
|
||||
N-PORT. No model is involved in producing these labels, which makes them an
|
||||
unusually trustworthy reference for a generative-extraction benchmark.
|
||||
|
||||
\paragraph{The custodian relation and edge scoping.} The custodian relation
|
||||
illustrates a subtlety that any honest benchmark on this data must address. N-CEN
|
||||
reports, for a global fund, not only its \emph{primary} custodian but the entire
|
||||
chain of \emph{foreign sub-custodians} --- one bank per market it invests in.
|
||||
These sub-custodians have two damaging properties. First, they are
|
||||
\emph{unextractable}: they appear only in the N-CEN table and essentially never in
|
||||
the prospectus prose (a naive string-match recovers $7\%$ of them), so keeping
|
||||
them as targets asks the model to extract facts absent from its input. Second,
|
||||
they \emph{dominate}: with \code{IS\_SUB\_CUSTODIAN${=}$Y} accounting for $88\%$
|
||||
of custodian rows, they constitute roughly two thirds of \emph{all} edges in the
|
||||
unfiltered graph, inflating both the target size and the training loss with noise.
|
||||
The dataset therefore scopes the custodian relation to the \emph{primary}
|
||||
custodian (\code{IS\_SUB\_CUSTODIAN${\neq}$Y}, a median of one per fund), which is
|
||||
genuinely prose-grounded --- it is named in the prospectus or its Statement of
|
||||
Additional Information (e.g.\ ``State Street Bank and Trust Company serves as
|
||||
custodian''). This single change reduces the corpus from $36{,}880$ to $15{,}739$
|
||||
edges, all of prose-grounded relation types, and is the configurable default
|
||||
(\code{--custodian-scope primary}). The full sub-custodian chain remains available
|
||||
in N-CEN as a structured-only relation outside the text-to-triples task. This is a
|
||||
dataset-quality decision of the same kind the thesis notes for T-REx and REBEL,
|
||||
whose non-exhaustive references unfairly penalise correct extractions.
|
||||
\paragraph{Why the custodian relation is dropped.} A text-to-triple target is only
|
||||
useful if the fact can be found in the input text. The custodian relation fails
|
||||
this test and is therefore \emph{excluded} from the dataset
|
||||
(\code{--custodian-scope none}). N-CEN reports, for a global fund, the entire chain
|
||||
of \emph{foreign sub-custodians} --- one bank per market it invests in (Citibank
|
||||
Brazil, Banco de Chile, Cititrust Colombia, \dots). These names appear only in the
|
||||
structured N-CEN table and in \emph{no} prose document: the summary prospectus
|
||||
refers to the custodian only generically (``\dots\ including the adviser, the
|
||||
custodian, and the transfer agent\dots'') and never names it. Even the
|
||||
\emph{primary} custodian is typically named only in the separately filed Statement
|
||||
of Additional Information (N-1A Part~B), which is not part of the fetched input.
|
||||
Measured per fund, the custodian object name occurs in the fund's own prospectus
|
||||
segment only $28\%$ of the time --- by far the weakest of all relations --- so
|
||||
keeping it would systematically ask the model to extract facts absent from its
|
||||
input. The full custodian chain remains available in N-CEN as a structured-only
|
||||
relation, outside the text-to-triples task; recovering it from text would require
|
||||
adding the SAI as an input source (a separate crawl).
|
||||
|
||||
\paragraph{No-model lower bound.} A trivial string-matching baseline --- emit a
|
||||
gold edge iff the object's name occurs in the prose --- establishes a \emph{floor}
|
||||
and measures \emph{how prose-grounded each relation is}. Table~\ref{tab:baseline}
|
||||
reports this on the proof-of-concept slice after primary-custodian scoping,
|
||||
multi-book fetching and per-fund segmentation. Because the baseline requires an
|
||||
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
|
||||
strict lower bound: a fund's adviser, for instance, must be named in that fund's
|
||||
segment under a literal spelling. On the full quarter the adviser is recovered
|
||||
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
|
||||
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
|
||||
alone). The residual gap from $1.0$ is attributable to surface-form variation
|
||||
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
|
||||
model handles but exact matching does not.
|
||||
\paragraph{Per-triple grounding flag.} Because even the retained relations are not
|
||||
\emph{always} named in a fund's segment, every triple carries a boolean
|
||||
\code{grounded} flag: true iff the normalized object name occurs in that sample's
|
||||
input text. This lets training and evaluation restrict to grounded triples rather
|
||||
than silently carrying unextractable targets. Across the full build, $80\%$ of
|
||||
triples are grounded; per relation the rate ranges from $93\%$ (\code{advisedBy})
|
||||
down to $62\%$ (\code{underwrittenBy}), as shown in Table~\ref{tab:baseline}.
|
||||
|
||||
\paragraph{No-model lower bound.} The grounding flag is itself a trivial,
|
||||
model-free baseline (emit a gold edge iff its object name occurs in the prose);
|
||||
its per-relation rate is a strict lower bound on recall, since it requires an
|
||||
\emph{exact substring} match within the fund's own segment and so misses surface
|
||||
variants (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a
|
||||
trained model handles. Table~\ref{tab:baseline} reports it on the full build.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{No-model string-match baseline on the full 2025\,Q3 build, after
|
||||
primary-custodian scoping, multi-book fetching and per-fund segmentation
|
||||
($852$ samples). Precision is $1.00$ by construction; recall is a strict
|
||||
exact-match lower bound.}
|
||||
\caption{Per-relation prose-grounding on the full 2025\,Q3 build ($852$ samples,
|
||||
$8{,}824$ triples; custodian dropped). ``Grounded'' = object name present in the
|
||||
sample's input; a strict, model-free lower bound on recall.}
|
||||
\label{tab:baseline}
|
||||
\small
|
||||
\begin{tabular}{@{}lrr@{}}
|
||||
\toprule
|
||||
Relation & Recall & Gold edges \\
|
||||
Relation & Triples & Grounded \\
|
||||
\midrule
|
||||
\code{advisedBy} & 0.93 & 1{,}673 \\
|
||||
\code{seriesOf} & 0.84 & 1{,}555 \\
|
||||
\code{subAdvisedBy} & 0.84 & 946 \\
|
||||
\code{administrator} & 0.80 & 2{,}066 \\
|
||||
\code{transferAgent} & 0.72 & 1{,}721 \\
|
||||
\code{custodian} & 0.63 & 1{,}761 \\
|
||||
\code{underwrittenBy} & 0.62 & 863 \\
|
||||
\code{advisedBy} & 1{,}673 & 93\% \\
|
||||
\code{seriesOf} & 1{,}555 & 84\% \\
|
||||
\code{subAdvisedBy} & 946 & 84\% \\
|
||||
\code{administrator} & 2{,}066 & 80\% \\
|
||||
\code{transferAgent} & 1{,}721 & 72\% \\
|
||||
\code{underwrittenBy} & 863 & 62\% \\
|
||||
\midrule
|
||||
micro-average & 0.65 & 6{,}479 \\
|
||||
all & 8{,}824 & 80\% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@ -497,21 +499,22 @@ quantified rather than assumed.
|
||||
\section{Corpus statistics}
|
||||
% ====================================================================
|
||||
|
||||
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
|
||||
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
|
||||
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. With
|
||||
the custodian relation dropped, the N-CEN gold graph holds $12{,}694$
|
||||
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
|
||||
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
|
||||
closed-end or interval funds file no standard prospectus) and applying the robust
|
||||
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
|
||||
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
|
||||
near $117\!:\!1$ (input prose to target serialization), and across all samples the
|
||||
$193$ whole-trust fallbacks) containing $8{,}824$ target triples, of which $80\%$
|
||||
are prose-grounded. The segmented samples have a per-fund median ratio near
|
||||
$117\!:\!1$ (input prose to target serialization), and across all samples the
|
||||
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
|
||||
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
|
||||
trust, per-fund segmentation).}
|
||||
\caption{Corpus statistics for the full 2025\,Q3 build (custodian dropped). Left:
|
||||
N-CEN gold graph. Right: text-to-triple samples (all prospectus books per trust,
|
||||
per-fund segmentation).}
|
||||
\label{tab:stats}
|
||||
\small
|
||||
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
|
||||
@ -520,9 +523,9 @@ trust, per-fund segmentation).}
|
||||
\cmidrule(r){1-2}\cmidrule(l){3-4}
|
||||
Trust graphs & 435 & Samples (total) & 852 \\
|
||||
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
|
||||
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\
|
||||
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\
|
||||
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\
|
||||
Entity-entity edges & 12{,}694 & \;whole-trust fallback & 193 \\
|
||||
\;administrator & 3{,}288 & Target triples & 8{,}824 \\
|
||||
\;advisedBy & 2{,}588 & \;grounded & 80\% \\
|
||||
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user