Update dataset description with full 2025Q3 build statistics

Full build: 2,326 prospectus filings across 393 trusts -> 852 samples (659 segmented per-fund + 193 fallback), trust-level split 655/122/75, no-model baseline F1=0.79. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 11:21:23 +02:00 · 2026-06-03 11:21:23 +02:00 · 63e650fa14
commit 63e650fa14
parent 1993658fb2
2 changed files with 40 additions and 35 deletions
--- a/dataset_description.pdf
+++ b/dataset_description.pdf
--- a/dataset_description.tex
+++ b/dataset_description.tex
@ -453,18 +453,18 @@ reports this on the proof-of-concept slice after primary-custodian scoping,
 multi-book fetching and per-fund segmentation. Because the baseline requires an
 \emph{exact substring} match within the fund's \emph{own} section, its recall is a
 strict lower bound: a fund's adviser, for instance, must be named in that fund's
-segment under a literal spelling. With the right prospectus books present, the
-adviser is recovered with recall $1.00$ and the micro-averaged $F_1$ reaches
-$0.65$; the recovered custodian recall is $0.63$ (up from $0.07$ unscoped and
-$0.37$ after scoping alone). The residual gap from $1.0$ is attributable to
-surface-form variation (``State Street Bank and Trust Company'' vs.\ ``State
-Street'') that a trained model handles but exact matching does not.
+segment under a literal spelling. On the full quarter the adviser is recovered
+with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
+custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
+alone). The residual gap from $1.0$ is attributable to surface-form variation
+(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
+model handles but exact matching does not.

 \begin{table}[h]
 \centering
-\caption{No-model string-match baseline on the proof-of-concept slice, after
+\caption{No-model string-match baseline on the full 2025\,Q3 build, after
 primary-custodian scoping, multi-book fetching and per-fund segmentation
-($141$ samples). Precision is $1.00$ by construction; recall is a strict
+($852$ samples). Precision is $1.00$ by construction; recall is a strict
 exact-match lower bound.}
 \label{tab:baseline}
 \small
@ -472,15 +472,15 @@ exact-match lower bound.}
 \toprule
 Relation & Recall & Gold edges \\
 \midrule
-\code{advisedBy}      & 1.00 & 513 \\
-\code{subAdvisedBy}   & 0.94 & 232 \\
-\code{seriesOf}       & 0.86 & 496 \\
-\code{administrator}  & 0.84 & 730 \\
-\code{transferAgent}  & 0.79 & 537 \\
-\code{custodian}      & 0.63 & 601 \\
-\code{underwrittenBy} & 0.40 & 144 \\
+\code{advisedBy}      & 0.93 & 1{,}673 \\
+\code{seriesOf}       & 0.84 & 1{,}555 \\
+\code{subAdvisedBy}   & 0.84 & 946 \\
+\code{administrator}  & 0.80 & 2{,}066 \\
+\code{transferAgent}  & 0.72 & 1{,}721 \\
+\code{custodian}      & 0.63 & 1{,}761 \\
+\code{underwrittenBy} & 0.62 & 863 \\
 \midrule
-micro-average         & 0.48 & 1{,}194 \\
+micro-average         & 0.65 & 6{,}479 \\
 \bottomrule
 \end{tabular}
 \end{table}
@ -497,37 +497,42 @@ quantified rather than assumed.
 \section{Corpus statistics}
 % ====================================================================

-Table~\ref{tab:stats} summarises one quarter (2025\,Q3) of N-CEN gold (after
-primary-custodian scoping) and the proof-of-concept \emph{per-fund} samples. The
-gold graph holds $15{,}739$ entity-to-entity edges across $435$ trusts and
-$2{,}421$ funds. Fetching all prospectus books and applying the robust segmenter
-yields $141$ samples whose per-fund median input is $\sim\!3.7\times10^{4}$
-characters against a $\sim\!6.5\times10^{2}$-character target: a median ratio near
-$55\!:\!1$, the inverse of the symmetric-size benchmarks.
+Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
+N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
+entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
+prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
+closed-end or interval funds file no standard prospectus) and applying the robust
+per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
+$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
+near $117\!:\!1$ (input prose to target serialization), and across all samples the
+median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.

 \begin{table}[h]
 \centering
-\caption{Corpus statistics. Left: N-CEN gold graph for 2025\,Q3 (primary-custodian
-scope). Right: proof-of-concept samples (multi-book fetch, per-fund segmentation).}
+\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
+(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
+trust, per-fund segmentation).}
 \label{tab:stats}
 \small
 \begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
 \toprule
-\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{PoC samples} \\
+\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
 \cmidrule(r){1-2}\cmidrule(l){3-4}
-Trust graphs        & 435      & Samples (total)        & 141 \\
-Funds (series)      & 2{,}421  & \;segmented per-fund   & 135 \\
-Entity-entity edges & 15{,}739 & \;whole-trust fallback & 6 \\
-\;custodian (primary) & 3{,}045 & Per-fund input (median) & 36{,}856 \\
-\;advisedBy         & 2{,}588  & Per-fund target (median) & 654 \\
-Distributors        & 458      & Ratio (median)         & $54.8\!:\!1$ \\
+Trust graphs        & 435      & Samples (total)         & 852 \\
+Funds (series)      & 2{,}421  & \;segmented per-fund    & 659 \\
+Entity-entity edges & 15{,}739 & \;whole-trust fallback  & 193 \\
+\;custodian (primary) & 3{,}045 & Trusts fetched         & 393 \\
+\;advisedBy         & 2{,}588  & Prospectus filings     & 2{,}326 \\
+Distributors        & 458      & Ratio (median, per-fund) & $117\!:\!1$ \\
 \bottomrule
 \end{tabular}
 \end{table}

-The full quarter therefore yields on the order of $2{,}400$ fund-level graphs
-from $435$ prospectus filings; multiple quarters and the dropping of ontology
-subsets (per the thesis's augmentation strategy) expand this further.
+\paragraph{Train/validation/test split.} Partitioned at the trust level by a
+deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
+(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
+more than one split. Multiple quarters and the dropping of ontology subsets (per
+the thesis's augmentation strategy) expand the corpus further.

 % ====================================================================
 \section{Use in the thesis experiments}