diff --git a/dataset_description.pdf b/dataset_description.pdf index 0a30536..f6dd5e0 100644 Binary files a/dataset_description.pdf and b/dataset_description.pdf differ diff --git a/dataset_description.tex b/dataset_description.tex index df38a9a..10f764c 100644 --- a/dataset_description.tex +++ b/dataset_description.tex @@ -453,18 +453,18 @@ reports this on the proof-of-concept slice after primary-custodian scoping, multi-book fetching and per-fund segmentation. Because the baseline requires an \emph{exact substring} match within the fund's \emph{own} section, its recall is a strict lower bound: a fund's adviser, for instance, must be named in that fund's -segment under a literal spelling. With the right prospectus books present, the -adviser is recovered with recall $1.00$ and the micro-averaged $F_1$ reaches -$0.65$; the recovered custodian recall is $0.63$ (up from $0.07$ unscoped and -$0.37$ after scoping alone). The residual gap from $1.0$ is attributable to -surface-form variation (``State Street Bank and Trust Company'' vs.\ ``State -Street'') that a trained model handles but exact matching does not. +segment under a literal spelling. On the full quarter the adviser is recovered +with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered +custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping +alone). The residual gap from $1.0$ is attributable to surface-form variation +(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained +model handles but exact matching does not. \begin{table}[h] \centering -\caption{No-model string-match baseline on the proof-of-concept slice, after +\caption{No-model string-match baseline on the full 2025\,Q3 build, after primary-custodian scoping, multi-book fetching and per-fund segmentation -($141$ samples). Precision is $1.00$ by construction; recall is a strict +($852$ samples). Precision is $1.00$ by construction; recall is a strict exact-match lower bound.} \label{tab:baseline} \small @@ -472,15 +472,15 @@ exact-match lower bound.} \toprule Relation & Recall & Gold edges \\ \midrule -\code{advisedBy} & 1.00 & 513 \\ -\code{subAdvisedBy} & 0.94 & 232 \\ -\code{seriesOf} & 0.86 & 496 \\ -\code{administrator} & 0.84 & 730 \\ -\code{transferAgent} & 0.79 & 537 \\ -\code{custodian} & 0.63 & 601 \\ -\code{underwrittenBy} & 0.40 & 144 \\ +\code{advisedBy} & 0.93 & 1{,}673 \\ +\code{seriesOf} & 0.84 & 1{,}555 \\ +\code{subAdvisedBy} & 0.84 & 946 \\ +\code{administrator} & 0.80 & 2{,}066 \\ +\code{transferAgent} & 0.72 & 1{,}721 \\ +\code{custodian} & 0.63 & 1{,}761 \\ +\code{underwrittenBy} & 0.62 & 863 \\ \midrule -micro-average & 0.48 & 1{,}194 \\ +micro-average & 0.65 & 6{,}479 \\ \bottomrule \end{tabular} \end{table} @@ -497,37 +497,42 @@ quantified rather than assumed. \section{Corpus statistics} % ==================================================================== -Table~\ref{tab:stats} summarises one quarter (2025\,Q3) of N-CEN gold (after -primary-custodian scoping) and the proof-of-concept \emph{per-fund} samples. The -gold graph holds $15{,}739$ entity-to-entity edges across $435$ trusts and -$2{,}421$ funds. Fetching all prospectus books and applying the robust segmenter -yields $141$ samples whose per-fund median input is $\sim\!3.7\times10^{4}$ -characters against a $\sim\!6.5\times10^{2}$-character target: a median ratio near -$55\!:\!1$, the inverse of the symmetric-size benchmarks. +Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The +N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$ +entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full +prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$ +closed-end or interval funds file no standard prospectus) and applying the robust +per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus +$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio +near $117\!:\!1$ (input prose to target serialization), and across all samples the +median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks. \begin{table}[h] \centering -\caption{Corpus statistics. Left: N-CEN gold graph for 2025\,Q3 (primary-custodian -scope). Right: proof-of-concept samples (multi-book fetch, per-fund segmentation).} +\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph +(primary-custodian scope). Right: text-to-triple samples (all prospectus books per +trust, per-fund segmentation).} \label{tab:stats} \small \begin{tabular}{@{}lr@{\hskip 3em}lr@{}} \toprule -\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{PoC samples} \\ +\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\ \cmidrule(r){1-2}\cmidrule(l){3-4} -Trust graphs & 435 & Samples (total) & 141 \\ -Funds (series) & 2{,}421 & \;segmented per-fund & 135 \\ -Entity-entity edges & 15{,}739 & \;whole-trust fallback & 6 \\ -\;custodian (primary) & 3{,}045 & Per-fund input (median) & 36{,}856 \\ -\;advisedBy & 2{,}588 & Per-fund target (median) & 654 \\ -Distributors & 458 & Ratio (median) & $54.8\!:\!1$ \\ +Trust graphs & 435 & Samples (total) & 852 \\ +Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\ +Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\ +\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\ +\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\ +Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\ \bottomrule \end{tabular} \end{table} -The full quarter therefore yields on the order of $2{,}400$ fund-level graphs -from $435$ prospectus filings; multiple quarters and the dropping of ontology -subsets (per the thesis's augmentation strategy) expand this further. +\paragraph{Train/validation/test split.} Partitioned at the trust level by a +deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples +(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in +more than one split. Multiple quarters and the dropping of ontology subsets (per +the thesis's augmentation strategy) expand the corpus further. % ==================================================================== \section{Use in the thesis experiments}