Update dataset description with full 2025Q3 build statistics

Full build: 2,326 prospectus filings across 393 trusts -> 852 samples
(659 segmented per-fund + 193 fallback), trust-level split 655/122/75,
no-model baseline F1=0.79.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Florian Herzog 2026-06-03 11:21:23 +02:00
parent 1993658fb2
commit 63e650fa14
2 changed files with 40 additions and 35 deletions

Binary file not shown.

View File

@ -453,18 +453,18 @@ reports this on the proof-of-concept slice after primary-custodian scoping,
multi-book fetching and per-fund segmentation. Because the baseline requires an
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
strict lower bound: a fund's adviser, for instance, must be named in that fund's
segment under a literal spelling. With the right prospectus books present, the
adviser is recovered with recall $1.00$ and the micro-averaged $F_1$ reaches
$0.65$; the recovered custodian recall is $0.63$ (up from $0.07$ unscoped and
$0.37$ after scoping alone). The residual gap from $1.0$ is attributable to
surface-form variation (``State Street Bank and Trust Company'' vs.\ ``State
Street'') that a trained model handles but exact matching does not.
segment under a literal spelling. On the full quarter the adviser is recovered
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
alone). The residual gap from $1.0$ is attributable to surface-form variation
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
model handles but exact matching does not.
\begin{table}[h]
\centering
\caption{No-model string-match baseline on the proof-of-concept slice, after
\caption{No-model string-match baseline on the full 2025\,Q3 build, after
primary-custodian scoping, multi-book fetching and per-fund segmentation
($141$ samples). Precision is $1.00$ by construction; recall is a strict
($852$ samples). Precision is $1.00$ by construction; recall is a strict
exact-match lower bound.}
\label{tab:baseline}
\small
@ -472,15 +472,15 @@ exact-match lower bound.}
\toprule
Relation & Recall & Gold edges \\
\midrule
\code{advisedBy} & 1.00 & 513 \\
\code{subAdvisedBy} & 0.94 & 232 \\
\code{seriesOf} & 0.86 & 496 \\
\code{administrator} & 0.84 & 730 \\
\code{transferAgent} & 0.79 & 537 \\
\code{custodian} & 0.63 & 601 \\
\code{underwrittenBy} & 0.40 & 144 \\
\code{advisedBy} & 0.93 & 1{,}673 \\
\code{seriesOf} & 0.84 & 1{,}555 \\
\code{subAdvisedBy} & 0.84 & 946 \\
\code{administrator} & 0.80 & 2{,}066 \\
\code{transferAgent} & 0.72 & 1{,}721 \\
\code{custodian} & 0.63 & 1{,}761 \\
\code{underwrittenBy} & 0.62 & 863 \\
\midrule
micro-average & 0.48 & 1{,}194 \\
micro-average & 0.65 & 6{,}479 \\
\bottomrule
\end{tabular}
\end{table}
@ -497,37 +497,42 @@ quantified rather than assumed.
\section{Corpus statistics}
% ====================================================================
Table~\ref{tab:stats} summarises one quarter (2025\,Q3) of N-CEN gold (after
primary-custodian scoping) and the proof-of-concept \emph{per-fund} samples. The
gold graph holds $15{,}739$ entity-to-entity edges across $435$ trusts and
$2{,}421$ funds. Fetching all prospectus books and applying the robust segmenter
yields $141$ samples whose per-fund median input is $\sim\!3.7\times10^{4}$
characters against a $\sim\!6.5\times10^{2}$-character target: a median ratio near
$55\!:\!1$, the inverse of the symmetric-size benchmarks.
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
closed-end or interval funds file no standard prospectus) and applying the robust
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
near $117\!:\!1$ (input prose to target serialization), and across all samples the
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
\begin{table}[h]
\centering
\caption{Corpus statistics. Left: N-CEN gold graph for 2025\,Q3 (primary-custodian
scope). Right: proof-of-concept samples (multi-book fetch, per-fund segmentation).}
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
trust, per-fund segmentation).}
\label{tab:stats}
\small
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
\toprule
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{PoC samples} \\
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
\cmidrule(r){1-2}\cmidrule(l){3-4}
Trust graphs & 435 & Samples (total) & 141 \\
Funds (series) & 2{,}421 & \;segmented per-fund & 135 \\
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 6 \\
\;custodian (primary) & 3{,}045 & Per-fund input (median) & 36{,}856 \\
\;advisedBy & 2{,}588 & Per-fund target (median) & 654 \\
Distributors & 458 & Ratio (median) & $54.8\!:\!1$ \\
Trust graphs & 435 & Samples (total) & 852 \\
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
\bottomrule
\end{tabular}
\end{table}
The full quarter therefore yields on the order of $2{,}400$ fund-level graphs
from $435$ prospectus filings; multiple quarters and the dropping of ontology
subsets (per the thesis's augmentation strategy) expand this further.
\paragraph{Train/validation/test split.} Partitioned at the trust level by a
deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
more than one split. Multiple quarters and the dropping of ontology subsets (per
the thesis's augmentation strategy) expand the corpus further.
% ====================================================================
\section{Use in the thesis experiments}