Update dataset description with full 2025Q3 build statistics
Full build: 2,326 prospectus filings across 393 trusts -> 852 samples (659 segmented per-fund + 193 fallback), trust-level split 655/122/75, no-model baseline F1=0.79. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
1993658fb2
commit
63e650fa14
Binary file not shown.
@ -453,18 +453,18 @@ reports this on the proof-of-concept slice after primary-custodian scoping,
|
|||||||
multi-book fetching and per-fund segmentation. Because the baseline requires an
|
multi-book fetching and per-fund segmentation. Because the baseline requires an
|
||||||
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
|
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
|
||||||
strict lower bound: a fund's adviser, for instance, must be named in that fund's
|
strict lower bound: a fund's adviser, for instance, must be named in that fund's
|
||||||
segment under a literal spelling. With the right prospectus books present, the
|
segment under a literal spelling. On the full quarter the adviser is recovered
|
||||||
adviser is recovered with recall $1.00$ and the micro-averaged $F_1$ reaches
|
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
|
||||||
$0.65$; the recovered custodian recall is $0.63$ (up from $0.07$ unscoped and
|
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
|
||||||
$0.37$ after scoping alone). The residual gap from $1.0$ is attributable to
|
alone). The residual gap from $1.0$ is attributable to surface-form variation
|
||||||
surface-form variation (``State Street Bank and Trust Company'' vs.\ ``State
|
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
|
||||||
Street'') that a trained model handles but exact matching does not.
|
model handles but exact matching does not.
|
||||||
|
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
\centering
|
\centering
|
||||||
\caption{No-model string-match baseline on the proof-of-concept slice, after
|
\caption{No-model string-match baseline on the full 2025\,Q3 build, after
|
||||||
primary-custodian scoping, multi-book fetching and per-fund segmentation
|
primary-custodian scoping, multi-book fetching and per-fund segmentation
|
||||||
($141$ samples). Precision is $1.00$ by construction; recall is a strict
|
($852$ samples). Precision is $1.00$ by construction; recall is a strict
|
||||||
exact-match lower bound.}
|
exact-match lower bound.}
|
||||||
\label{tab:baseline}
|
\label{tab:baseline}
|
||||||
\small
|
\small
|
||||||
@ -472,15 +472,15 @@ exact-match lower bound.}
|
|||||||
\toprule
|
\toprule
|
||||||
Relation & Recall & Gold edges \\
|
Relation & Recall & Gold edges \\
|
||||||
\midrule
|
\midrule
|
||||||
\code{advisedBy} & 1.00 & 513 \\
|
\code{advisedBy} & 0.93 & 1{,}673 \\
|
||||||
\code{subAdvisedBy} & 0.94 & 232 \\
|
\code{seriesOf} & 0.84 & 1{,}555 \\
|
||||||
\code{seriesOf} & 0.86 & 496 \\
|
\code{subAdvisedBy} & 0.84 & 946 \\
|
||||||
\code{administrator} & 0.84 & 730 \\
|
\code{administrator} & 0.80 & 2{,}066 \\
|
||||||
\code{transferAgent} & 0.79 & 537 \\
|
\code{transferAgent} & 0.72 & 1{,}721 \\
|
||||||
\code{custodian} & 0.63 & 601 \\
|
\code{custodian} & 0.63 & 1{,}761 \\
|
||||||
\code{underwrittenBy} & 0.40 & 144 \\
|
\code{underwrittenBy} & 0.62 & 863 \\
|
||||||
\midrule
|
\midrule
|
||||||
micro-average & 0.48 & 1{,}194 \\
|
micro-average & 0.65 & 6{,}479 \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\end{table}
|
\end{table}
|
||||||
@ -497,37 +497,42 @@ quantified rather than assumed.
|
|||||||
\section{Corpus statistics}
|
\section{Corpus statistics}
|
||||||
% ====================================================================
|
% ====================================================================
|
||||||
|
|
||||||
Table~\ref{tab:stats} summarises one quarter (2025\,Q3) of N-CEN gold (after
|
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
|
||||||
primary-custodian scoping) and the proof-of-concept \emph{per-fund} samples. The
|
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
|
||||||
gold graph holds $15{,}739$ entity-to-entity edges across $435$ trusts and
|
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
|
||||||
$2{,}421$ funds. Fetching all prospectus books and applying the robust segmenter
|
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
|
||||||
yields $141$ samples whose per-fund median input is $\sim\!3.7\times10^{4}$
|
closed-end or interval funds file no standard prospectus) and applying the robust
|
||||||
characters against a $\sim\!6.5\times10^{2}$-character target: a median ratio near
|
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
|
||||||
$55\!:\!1$, the inverse of the symmetric-size benchmarks.
|
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
|
||||||
|
near $117\!:\!1$ (input prose to target serialization), and across all samples the
|
||||||
|
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
|
||||||
|
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
\centering
|
\centering
|
||||||
\caption{Corpus statistics. Left: N-CEN gold graph for 2025\,Q3 (primary-custodian
|
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
|
||||||
scope). Right: proof-of-concept samples (multi-book fetch, per-fund segmentation).}
|
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
|
||||||
|
trust, per-fund segmentation).}
|
||||||
\label{tab:stats}
|
\label{tab:stats}
|
||||||
\small
|
\small
|
||||||
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
|
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
|
||||||
\toprule
|
\toprule
|
||||||
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{PoC samples} \\
|
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
|
||||||
\cmidrule(r){1-2}\cmidrule(l){3-4}
|
\cmidrule(r){1-2}\cmidrule(l){3-4}
|
||||||
Trust graphs & 435 & Samples (total) & 141 \\
|
Trust graphs & 435 & Samples (total) & 852 \\
|
||||||
Funds (series) & 2{,}421 & \;segmented per-fund & 135 \\
|
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
|
||||||
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 6 \\
|
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\
|
||||||
\;custodian (primary) & 3{,}045 & Per-fund input (median) & 36{,}856 \\
|
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\
|
||||||
\;advisedBy & 2{,}588 & Per-fund target (median) & 654 \\
|
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\
|
||||||
Distributors & 458 & Ratio (median) & $54.8\!:\!1$ \\
|
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
The full quarter therefore yields on the order of $2{,}400$ fund-level graphs
|
\paragraph{Train/validation/test split.} Partitioned at the trust level by a
|
||||||
from $435$ prospectus filings; multiple quarters and the dropping of ontology
|
deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
|
||||||
subsets (per the thesis's augmentation strategy) expand this further.
|
(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
|
||||||
|
more than one split. Multiple quarters and the dropping of ontology subsets (per
|
||||||
|
the thesis's augmentation strategy) expand the corpus further.
|
||||||
|
|
||||||
% ====================================================================
|
% ====================================================================
|
||||||
\section{Use in the thesis experiments}
|
\section{Use in the thesis experiments}
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user