Update dataset description with full 2025Q3 build statistics
Full build: 2,326 prospectus filings across 393 trusts -> 852 samples (659 segmented per-fund + 193 fallback), trust-level split 655/122/75, no-model baseline F1=0.79. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
1993658fb2
commit
63e650fa14
Binary file not shown.
@ -453,18 +453,18 @@ reports this on the proof-of-concept slice after primary-custodian scoping,
|
||||
multi-book fetching and per-fund segmentation. Because the baseline requires an
|
||||
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
|
||||
strict lower bound: a fund's adviser, for instance, must be named in that fund's
|
||||
segment under a literal spelling. With the right prospectus books present, the
|
||||
adviser is recovered with recall $1.00$ and the micro-averaged $F_1$ reaches
|
||||
$0.65$; the recovered custodian recall is $0.63$ (up from $0.07$ unscoped and
|
||||
$0.37$ after scoping alone). The residual gap from $1.0$ is attributable to
|
||||
surface-form variation (``State Street Bank and Trust Company'' vs.\ ``State
|
||||
Street'') that a trained model handles but exact matching does not.
|
||||
segment under a literal spelling. On the full quarter the adviser is recovered
|
||||
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
|
||||
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
|
||||
alone). The residual gap from $1.0$ is attributable to surface-form variation
|
||||
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
|
||||
model handles but exact matching does not.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{No-model string-match baseline on the proof-of-concept slice, after
|
||||
\caption{No-model string-match baseline on the full 2025\,Q3 build, after
|
||||
primary-custodian scoping, multi-book fetching and per-fund segmentation
|
||||
($141$ samples). Precision is $1.00$ by construction; recall is a strict
|
||||
($852$ samples). Precision is $1.00$ by construction; recall is a strict
|
||||
exact-match lower bound.}
|
||||
\label{tab:baseline}
|
||||
\small
|
||||
@ -472,15 +472,15 @@ exact-match lower bound.}
|
||||
\toprule
|
||||
Relation & Recall & Gold edges \\
|
||||
\midrule
|
||||
\code{advisedBy} & 1.00 & 513 \\
|
||||
\code{subAdvisedBy} & 0.94 & 232 \\
|
||||
\code{seriesOf} & 0.86 & 496 \\
|
||||
\code{administrator} & 0.84 & 730 \\
|
||||
\code{transferAgent} & 0.79 & 537 \\
|
||||
\code{custodian} & 0.63 & 601 \\
|
||||
\code{underwrittenBy} & 0.40 & 144 \\
|
||||
\code{advisedBy} & 0.93 & 1{,}673 \\
|
||||
\code{seriesOf} & 0.84 & 1{,}555 \\
|
||||
\code{subAdvisedBy} & 0.84 & 946 \\
|
||||
\code{administrator} & 0.80 & 2{,}066 \\
|
||||
\code{transferAgent} & 0.72 & 1{,}721 \\
|
||||
\code{custodian} & 0.63 & 1{,}761 \\
|
||||
\code{underwrittenBy} & 0.62 & 863 \\
|
||||
\midrule
|
||||
micro-average & 0.48 & 1{,}194 \\
|
||||
micro-average & 0.65 & 6{,}479 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
@ -497,37 +497,42 @@ quantified rather than assumed.
|
||||
\section{Corpus statistics}
|
||||
% ====================================================================
|
||||
|
||||
Table~\ref{tab:stats} summarises one quarter (2025\,Q3) of N-CEN gold (after
|
||||
primary-custodian scoping) and the proof-of-concept \emph{per-fund} samples. The
|
||||
gold graph holds $15{,}739$ entity-to-entity edges across $435$ trusts and
|
||||
$2{,}421$ funds. Fetching all prospectus books and applying the robust segmenter
|
||||
yields $141$ samples whose per-fund median input is $\sim\!3.7\times10^{4}$
|
||||
characters against a $\sim\!6.5\times10^{2}$-character target: a median ratio near
|
||||
$55\!:\!1$, the inverse of the symmetric-size benchmarks.
|
||||
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
|
||||
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
|
||||
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
|
||||
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
|
||||
closed-end or interval funds file no standard prospectus) and applying the robust
|
||||
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
|
||||
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
|
||||
near $117\!:\!1$ (input prose to target serialization), and across all samples the
|
||||
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Corpus statistics. Left: N-CEN gold graph for 2025\,Q3 (primary-custodian
|
||||
scope). Right: proof-of-concept samples (multi-book fetch, per-fund segmentation).}
|
||||
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
|
||||
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
|
||||
trust, per-fund segmentation).}
|
||||
\label{tab:stats}
|
||||
\small
|
||||
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
|
||||
\toprule
|
||||
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{PoC samples} \\
|
||||
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
|
||||
\cmidrule(r){1-2}\cmidrule(l){3-4}
|
||||
Trust graphs & 435 & Samples (total) & 141 \\
|
||||
Funds (series) & 2{,}421 & \;segmented per-fund & 135 \\
|
||||
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 6 \\
|
||||
\;custodian (primary) & 3{,}045 & Per-fund input (median) & 36{,}856 \\
|
||||
\;advisedBy & 2{,}588 & Per-fund target (median) & 654 \\
|
||||
Distributors & 458 & Ratio (median) & $54.8\!:\!1$ \\
|
||||
Trust graphs & 435 & Samples (total) & 852 \\
|
||||
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
|
||||
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\
|
||||
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\
|
||||
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\
|
||||
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
The full quarter therefore yields on the order of $2{,}400$ fund-level graphs
|
||||
from $435$ prospectus filings; multiple quarters and the dropping of ontology
|
||||
subsets (per the thesis's augmentation strategy) expand this further.
|
||||
\paragraph{Train/validation/test split.} Partitioned at the trust level by a
|
||||
deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
|
||||
(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
|
||||
more than one split. Multiple quarters and the dropping of ontology subsets (per
|
||||
the thesis's augmentation strategy) expand the corpus further.
|
||||
|
||||
% ====================================================================
|
||||
\section{Use in the thesis experiments}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user