fund_rfid_data/dataset_description.tex
Florian Herzog 63e650fa14 Update dataset description with full 2025Q3 build statistics
Full build: 2,326 prospectus filings across 393 trusts -> 852 samples
(659 segmented per-fund + 193 fallback), trust-level split 655/122/75,
no-model baseline F1=0.79.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 11:21:23 +02:00

588 lines
30 KiB
TeX

\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[margin=2.5cm]{geometry}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{positioning,arrows.meta,shapes.geometric}
\usepackage[hidelinks]{hyperref}
\usepackage{caption}
\usepackage{enumitem}
% ---- listing style (matches the Turtle/JSON look of the thesis) ----
\definecolor{kw}{rgb}{0.0,0.3,0.6}
\definecolor{str}{rgb}{0.6,0.2,0.1}
\definecolor{cmt}{rgb}{0.3,0.5,0.3}
\lstset{
basicstyle=\ttfamily\footnotesize,
breaklines=true,
showstringspaces=false,
keywordstyle=\color{kw}\bfseries,
stringstyle=\color{str},
commentstyle=\color{cmt}\itshape,
frame=single,
framesep=4pt,
rulecolor=\color{black!25},
backgroundcolor=\color{black!2},
numbers=left,
numberstyle=\tiny\color{black!40},
xleftmargin=2.2em,
}
\newcommand{\code}[1]{\texttt{\small #1}}
\newcommand{\tstart}{\code{<triple\_start>}}
\newcommand{\predm}{\code{<predicate\_marker>}}
\newcommand{\objm}{\code{<object\_marker>}}
\newcommand{\tend}{\code{<triple\_end>}}
\title{\textbf{A Relationship-Rich Financial Dataset for\\
Text-to-RDF-Triple Extraction:\\
SEC Fund Disclosures as a Knowledge-Graph Source}}
\author{Dr.\ Florian Herzog\\\small Thesis supervisor --- companion technical note to the thesis
\emph{Magical RDF Triples and how to synthetize them}}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
This note specifies a finance-domain dataset for training and evaluating models
that extract Resource Description Framework (RDF) triples from plain text, the
task at the centre of the accompanying thesis. The dataset is constructed from
mandatory U.S.\ Securities and Exchange Commission (SEC) fund disclosures. Unlike
the Wikidata-derived corpora commonly used for this task --- where the source text
is written \emph{from} the triples and is therefore roughly the same size as its
target --- here a long natural-language prospectus (on the order of
$10^{5}$--$10^{7}$ characters) maps to a compact graph of a few hundred triples,
yielding a realistic text-to-output size ratio of roughly $20\!:\!1$. Crucially,
the target is a genuine graph of \emph{entity-to-entity relationships} (a fund
\emph{advised by} a management company, \emph{distributed by} an underwriter,
\emph{holding} a security \emph{issued by} an issuer), not a flat list of literal
attributes. Two distinct ground-truth regimes are available: a \emph{model-free
gold} baseline derived from parallel structured SEC filings (N-CEN, N-PORT,
Series/Class reference data), and a \emph{strong-model silver} baseline for the
relations expressed only in prose. We describe the source filings, the target
ontology and graph structure, the holdings sub-graph, the serialization into the
thesis's grammar-terminal token format, and how the resulting samples are used to
train and benchmark the four models under study.
\end{abstract}
\tableofcontents
% ====================================================================
\section{Motivation: the size-ratio and relationship gap}
% ====================================================================
The thesis trains a general-purpose language model to extract serialized RDF
triples from plain text conditioned on an ontology. The quality of such a model
is bounded by the quality of its training data. The benchmarks surveyed in the
thesis (WebNLG, T-REx, REBEL, Wiki-NRE) share two properties that make them weak
proxies for the real extraction problem.
\paragraph{Symmetric size.} In WebNLG, human annotators were instructed to write
text \emph{from} a given set of triples. Consequently each sentence encodes
almost exactly the triples it was generated from: the input text and the target
JSON are of comparable length. The task degenerates towards transliteration and
never exercises the central difficulty of practical information extraction ---
locating a small number of facts inside a large, noisy document.
\paragraph{Attribute-only targets.} Many relation-extraction corpora reduce to
mapping a sentence to a single predicate label, or to a star of literal-valued
attributes around one entity. They contain few \emph{entity-to-entity} edges, and
therefore exercise little of the graph structure that motivates RDF in the first
place.
A suitable dataset must instead satisfy both of the following, simultaneously:
\begin{enumerate}[label=(\roman*)]
\item the input text is substantially larger than the target serialization, so
the model must perform genuine reading comprehension over a long document;
\item the target is a multi-entity-type graph of relationships, so the inferred
ontology contains edges of the form
\code{TypeA\ ---predicate--->\ TypeB}, not only
\code{TypeA\ ---predicate--->\ literal}.
\end{enumerate}
SEC fund disclosures satisfy both, and additionally provide a rare third
property: a \emph{free, non-model ground truth}, because the same facts that
appear in the prose are independently filed by the same registrants in structured
form.
% ====================================================================
\section{Source filings}
% ====================================================================
The dataset draws on four public SEC data sources, summarised in
Table~\ref{tab:sources}. Their division of labour is the key design idea: the
\emph{prose} filings provide the model input, while the \emph{structured} filings
provide the ground-truth graph.
\begin{table}[h]
\centering
\caption{SEC source filings and their role in the dataset.}
\label{tab:sources}
\small
\begin{tabular}{@{}llll@{}}
\toprule
Source & Form & Content & Role \\
\midrule
Prospectus & N-1A (485BPOS, 497) & Investment objective, strategy, & \textbf{input text} \\
& & management, fees (prose) & \\
N-CEN & N-CEN & Service providers, classification & \textbf{gold edges} \\
N-PORT & NPORT-P & Portfolio holdings (quarterly) & \textbf{gold edges} \\
Series/Class CSV & --- & Trust/Series/Class identity & \textbf{gold skeleton} \\
Annual report (MDFP) & N-CSR & Top-holdings commentary (prose) & \textbf{input text (holdings)} \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Prospectus (N-1A).} The statutory prospectus is a long legal document
describing a fund family. It names, in prose, the fund's investment adviser,
sub-adviser, distributor, transfer agent, portfolio managers and benchmark index,
together with its objective, strategy and fee structure. A single filing covers
all funds (series) of a trust and ranges from roughly $4\times10^{5}$ to
$1\times10^{7}$ characters of extracted text.
\paragraph{N-CEN.} The annual census filing reports, in structured tabular form,
each fund's service providers --- adviser, sub-adviser, custodian, transfer
agent, administrator --- and the trust's principal underwriter, each with a Legal
Entity Identifier (LEI) where available. These rows are the gold standard for the
service-provider edges of the graph.
\paragraph{N-PORT.} The monthly portfolio filing reports, per fund, every
security held, with issuer name, identifiers (CUSIP, ISIN, LEI), asset category,
investment country and market value. These rows are the gold standard for the
holdings sub-graph (Section~\ref{sec:holdings}).
\paragraph{Series/Class reference data.} The SEC's Series/Class listing provides
the trust\,$\to$\,series\,$\to$\,share-class identity backbone, gold for the
structural \code{seriesOf} and \code{hasShareClass} edges.
A central property is \emph{redundancy across modality}: a fact such as ``the
fund is advised by Geode Capital Management, LLC'' appears both as a sentence in
the prospectus (the input) and as a structured row in N-CEN (the label). This is
what makes a model-free ground truth possible.
% ====================================================================
\section{Target ontology and graph structure}
% ====================================================================
The target of each sample is a directed, labelled multigraph $G=(E,R)$ in the
sense of the thesis, where nodes are typed entities and edges are RDF predicates.
The entity types and relations are listed in Table~\ref{tab:ontology}.
\begin{table}[h]
\centering
\caption{Target ontology: entity types and entity-to-entity relations.}
\label{tab:ontology}
\small
\begin{tabular}{@{}lll@{}}
\toprule
Subject type & Predicate & Object type \\
\midrule
Fund & \code{seriesOf} & Trust \\
Fund & \code{advisedBy} & InvestmentAdviser \\
Fund & \code{subAdvisedBy} & SubAdviser \\
Fund & \code{transferAgent} & TransferAgent \\
Fund & \code{custodian} & Custodian \\
Fund & \code{administrator} & Administrator \\
Trust & \code{underwrittenBy} & Distributor \\
\addlinespace
Fund & \code{holds} & Security \quad(holdings sub-graph) \\
Security & \code{issuedBy} & Issuer \\
Security & \code{domiciledIn} & Country \\
Fund & \code{tracksIndex} & Index \\
\bottomrule
\end{tabular}
\end{table}
Every relation in Table~\ref{tab:ontology} has an entity as its object, not a
literal. The dataset may optionally be enriched with literal-valued attribute
triples (management fee, net expense ratio, returns, portfolio turnover) drawn
from the XBRL Risk/Return filings; these are deliberately \emph{secondary},
because the purpose of the dataset is to exercise relational structure.
Following the thesis's ontology-inference procedure
(SPARQL meta-schema extraction), the per-sample ontology presented to the model
is the set of distinct \code{(subject type, predicate, object type)} patterns
realised in that sample, e.g.
\noindent\begin{minipage}{\linewidth}
\begin{lstlisting}[language=,caption={Inferred ontology for one fund trust (model input, abbreviated).},captionpos=b]
{
"Fund": {
"seriesOf": ["Trust"],
"advisedBy": ["InvestmentAdviser"],
"subAdvisedBy": ["SubAdviser"],
"transferAgent": ["TransferAgent"],
"custodian": ["Custodian"],
"administrator": ["Administrator"]
},
"Trust": { "underwrittenBy": ["Distributor"] }
}
\end{lstlisting}
\end{minipage}
\begin{figure}[h]
\centering
\begin{tikzpicture}[
node distance=13mm and 30mm,
ent/.style={draw,rounded corners,align=center,font=\footnotesize,
inner sep=3pt,minimum height=7mm,fill=blue!5},
edge/.style={-{Stealth},font=\scriptsize,shorten >=1pt,shorten <=1pt},
lbl/.style={font=\scriptsize,fill=white,inner sep=1.5pt}]
% --- service-provider / structure cluster (centre-left) ---
\node[ent] (fund) {Fund};
\node[ent,above=of fund] (trust) {Trust};
\node[ent,above=of trust] (dist) {Distributor};
\node[ent,left=of fund] (adv) {Investment\\Adviser};
\node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
\node[ent,below=of fund] (cust) {Custodian};
\node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
\node[ent,right=of fund] (admin) {Administrator};
% --- holdings cluster (far right, separated column) ---
\node[ent,right=26mm of admin] (sec) {Security};
\node[ent,above=of sec] (iss) {Issuer};
\node[ent,below=of sec] (ctry) {Country};
\draw[edge] (fund) -- node[lbl]{seriesOf} (trust);
\draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
\draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
\draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
\draw[edge] (fund) -- node[lbl]{custodian} (cust);
\draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
\draw[edge] (fund) -- node[lbl]{administrator} (admin);
% holds: arc from the Fund's top, over the Administrator, down to Security
\draw[edge] (fund.north east) to[out=35,in=140]
node[lbl,pos=0.65]{holds} (sec.north west);
\draw[edge] (sec) -- node[lbl]{issuedBy} (iss);
\draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
\end{tikzpicture}
\caption{Schematic of the target knowledge graph. Left and centre: the
service-provider/structure graph grounded in the prospectus prose. Right column
(Issuer--Security--Country): the holdings sub-graph grounded in annual-report
commentary with N-PORT gold.}
\label{fig:graph}
\end{figure}
% ====================================================================
\section{The holdings sub-graph}
\label{sec:holdings}
% ====================================================================
Portfolio holdings express the richest relationships in the data --- a fund
\emph{holds} many securities, each \emph{issued by} an issuer
\emph{domiciled in} a country --- and are the natural place to grow the graph
beyond service providers. They require care, however, because holdings are
\emph{not} disclosed in the prospectus: the prospectus describes a fund's
\emph{strategy} (``invests in large-capitalisation equities''), never its
specific positions.
The text-bearing source for holdings is the \textbf{annual or semi-annual report}
(Form N-CSR). It contains two parts:
\begin{itemize}[nosep]
\item the \emph{Schedule of Investments}, a complete table of every holding ---
structured, not prose, and therefore not an extraction target; and
\item the \emph{Management Discussion of Fund Performance} (MDFP), a narrative
in which the portfolio manager names the fund's \emph{top} positions and
explains their contribution (``our largest holdings were Apple,
Microsoft and \dots'').
\end{itemize}
The MDFP is genuine prose and yields real \code{holds} edges for the named
positions. The corresponding \textbf{N-PORT} filing provides the structured gold:
the full holdings table, against which the MDFP-named subset can be verified and
from which \code{issuedBy} and \code{domiciledIn} are taken.
This produces a second, independent text-to-graph task in the same financial
domain: \emph{MDFP commentary $\to$ holdings sub-graph}, with N-PORT as gold.
Because it pairs a \emph{different} document type with a \emph{different} relation
set, including it strengthens the cross-domain generalization claim of the thesis
(Section~3.2.3): a single model is shown to extract two structurally different
graphs in the same domain. Fund fact sheets and portfolio-manager commentaries
published by fund companies are an additional, off-EDGAR prose source for the same
edges, at the cost of having no standardized machine-readable gold.
A practical caveat applies to holdings as it does to service providers
(Section~\ref{sec:baselines}): only the positions \emph{named in prose} are
recoverable from the input. The benchmark therefore scopes the \code{holds}
target to the MDFP-named subset rather than the full N-PORT schedule, to avoid
penalising a model for failing to extract facts that are absent from its input.
% ====================================================================
\section{Serialization and the marker token format}
% ====================================================================
Targets are serialized in the grammar-terminal token format introduced in the
thesis (Section~5.2), in which four special tokens delimit triple components and
shared subjects/predicates are factored out, mirroring Turtle's predicate-object
lists:
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (primary-custodian scope).},captionpos=b]
<triple_start> Small Cap Special Values Fund
<predicate_marker> seriesOf
<object_marker> VALIC Co I
<predicate_marker> advisedBy
<object_marker> The Variable Annuity Life Insurance Company
<predicate_marker> subAdvisedBy
<object_marker> SunAmerica Asset Management, LLC
<predicate_marker> administrator
<object_marker> SunAmerica Asset Management, LLC
<predicate_marker> custodian
<object_marker> State Street Bank and Trust Company
<triple_end>
\end{lstlisting}
To support the four-model comparison from a single dataset, each sample carries
\emph{two} serializations of the identical triples: \code{target\_serialized},
the marker form above (for Models 2/4, whose vocabulary is extended with the four
grammar-terminal tokens), and \code{target\_serialized\_plain}, a Turtle-like form
using ordinary `\code{;}' and `\code{,}' delimiters and no special tokens (for
Models 1/3). Because the two differ only in whether the delimiters are dedicated
tokens, the comparison isolates exactly the effect under study in research
question~1.
Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
inferred ontology (\code{ontology}), the target triples as a structured list
(\code{target\_triples}) and in both serializations, the trust/series identifiers,
and size statistics.
% ====================================================================
\section{Per-fund segmentation}
\label{sec:segmentation}
% ====================================================================
A single prospectus filing covers \emph{all} funds of a trust and may exceed
$10^{7}$ characters, beyond any practical context window. Treating one filing as
one sample is also semantically wrong: the target would mix the subgraphs of
dozens of unrelated funds. The dataset therefore segments each filing into
\emph{per-fund} samples, so that one fund's prospectus section maps to that one
fund's subgraph.
\paragraph{Fetching all books of a trust.} Large fund families split their funds
across \emph{several} prospectus books, so the single most recent filing covers
only a fraction of a trust's funds. The fetcher therefore retrieves the most
recent \emph{full} prospectuses (forms 485BPOS/485APOS) for each trust ---
preferring them over the much shorter 497/497K supplements, which are used only as
a fallback --- and concatenates their text. On the proof-of-concept slice this
raised the fetched text from one book per trust to a mean of seven, e.g.\ from
$5\times10^{5}$ to $2.2\times10^{7}$ characters for a large ETF trust, so that far
more fund sections are present.
\paragraph{Section anchors.} Statutory prospectuses open each fund's block with
the fund name immediately followed by a summary heading. Filers use several
styles, so the segmenter accepts any of: ``Fund Summary'', ``Investment
Objective'', ``Principal Investment Strategies'', the ETF objective sentence
``The Fund seeks\dots'', or a class/ticker header (``Class/Ticker:\dots'').
\paragraph{Boundary selection and a collapse guard.} The segmenter collects
\emph{all} anchored heading positions across the concatenated text, sorts them,
and cuts each segment from one heading to the next. Because a fund name can also
occur in tables of contents and cross-references, na\"ively taking the first
occurrence collapses segments to a few characters; the segmenter therefore
discards any candidate whose resulting segment is shorter than a minimum
($1{,}500$ characters) and, for each fund, keeps the longest surviving segment.
Each segment is paired with that fund's edges plus the fund-anchored
\code{seriesOf} edge and the trust-level \code{underwrittenBy} edge.
\paragraph{Name-variant matching.} The fund name filed in N-CEN and the heading
printed in the prospectus frequently differ in the legal-form suffix --- a fund
filed as ``\dots\ Fund'' may be headed ``\dots\ ETF'' or ``\dots\ Portfolio''.
The segmenter matches on a set of normalized variants (suffix swapped or dropped)
rather than the exact N-CEN string.
\paragraph{Coverage and fallback.} Where a fund's section cannot be located it is
skipped and counted (never silently dropped); where no section in a trust can be
located, the builder emits a single whole-trust fallback sample. On the
proof-of-concept slice, fetching all books and applying the robust segmenter
turns $14$ trusts into $141$ samples ($135$ cleanly segmented per-fund plus $6$
whole-trust fallbacks), with a per-fund median input of $\sim\!3.7\times10^{4}$
characters against a $\sim\!6.5\times10^{2}$-character target --- a per-fund
text-to-JSON ratio with a median near $55\!:\!1$. The residual misses are
dominated not by segmentation but by an \emph{entity-resolution} gap: some trusts
file their prospectuses under a different CIK (or fund brand) than their N-CEN
report, so the N-CEN fund names do not appear in the fetched text at all. Closing
that gap requires joining on the SEC Series identifier across CIKs rather than
fetching more filings of the same CIK, and is left to the full-scale build.
% ====================================================================
\section{Ground truth and baselines}
\label{sec:baselines}
% ====================================================================
The dataset offers two independent ground-truth regimes.
\paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
\code{transferAgent}, \code{custodian}, \code{administrator} and
\code{underwrittenBy}, the labels come directly from N-CEN; for \code{seriesOf}
and \code{hasShareClass}, from the Series/Class reference data; for \code{holds},
\code{issuedBy} and \code{domiciledIn}, from N-PORT. No model is involved in
producing these labels, which makes them an unusually trustworthy reference for a
generative-extraction benchmark.
\paragraph{The custodian relation and edge scoping.} The custodian relation
illustrates a subtlety that any honest benchmark on this data must address. N-CEN
reports, for a global fund, not only its \emph{primary} custodian but the entire
chain of \emph{foreign sub-custodians} --- one bank per market it invests in.
These sub-custodians have two damaging properties. First, they are
\emph{unextractable}: they appear only in the N-CEN table and essentially never in
the prospectus prose (a naive string-match recovers $7\%$ of them), so keeping
them as targets asks the model to extract facts absent from its input. Second,
they \emph{dominate}: with \code{IS\_SUB\_CUSTODIAN${=}$Y} accounting for $88\%$
of custodian rows, they constitute roughly two thirds of \emph{all} edges in the
unfiltered graph, inflating both the target size and the training loss with noise.
The dataset therefore scopes the custodian relation to the \emph{primary}
custodian (\code{IS\_SUB\_CUSTODIAN${\neq}$Y}, a median of one per fund), which is
genuinely prose-grounded --- it is named in the prospectus or its Statement of
Additional Information (e.g.\ ``State Street Bank and Trust Company serves as
custodian''). This single change reduces the corpus from $36{,}880$ to $15{,}739$
edges, all of prose-grounded relation types, and is the configurable default
(\code{--custodian-scope primary}). The full sub-custodian chain remains available
in N-CEN as a structured-only relation outside the text-to-triples task. This is a
dataset-quality decision of the same kind the thesis notes for T-REx and REBEL,
whose non-exhaustive references unfairly penalise correct extractions.
\paragraph{No-model lower bound.} A trivial string-matching baseline --- emit a
gold edge iff the object's name occurs in the prose --- establishes a \emph{floor}
and measures \emph{how prose-grounded each relation is}. Table~\ref{tab:baseline}
reports this on the proof-of-concept slice after primary-custodian scoping,
multi-book fetching and per-fund segmentation. Because the baseline requires an
\emph{exact substring} match within the fund's \emph{own} section, its recall is a
strict lower bound: a fund's adviser, for instance, must be named in that fund's
segment under a literal spelling. On the full quarter the adviser is recovered
with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
alone). The residual gap from $1.0$ is attributable to surface-form variation
(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
model handles but exact matching does not.
\begin{table}[h]
\centering
\caption{No-model string-match baseline on the full 2025\,Q3 build, after
primary-custodian scoping, multi-book fetching and per-fund segmentation
($852$ samples). Precision is $1.00$ by construction; recall is a strict
exact-match lower bound.}
\label{tab:baseline}
\small
\begin{tabular}{@{}lrr@{}}
\toprule
Relation & Recall & Gold edges \\
\midrule
\code{advisedBy} & 0.93 & 1{,}673 \\
\code{seriesOf} & 0.84 & 1{,}555 \\
\code{subAdvisedBy} & 0.84 & 946 \\
\code{administrator} & 0.80 & 2{,}066 \\
\code{transferAgent} & 0.72 & 1{,}721 \\
\code{custodian} & 0.63 & 1{,}761 \\
\code{underwrittenBy} & 0.62 & 863 \\
\midrule
micro-average & 0.65 & 6{,}479 \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Strong-model silver.} For relations that are expressed only in prose
and lack a clean structured source --- portfolio managers (\code{managedBy}),
the named benchmark index (\code{tracksIndex}), and MDFP-named holdings --- a
strong reference model (e.g.\ GPT-4 or Claude Opus) produces silver labels.
Because the structured relations have model-free gold, the silver-labelling model
can itself be \emph{measured} on the overlapping gold edges, so its reliability is
quantified rather than assumed.
% ====================================================================
\section{Corpus statistics}
% ====================================================================
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
closed-end or interval funds file no standard prospectus) and applying the robust
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
near $117\!:\!1$ (input prose to target serialization), and across all samples the
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
\begin{table}[h]
\centering
\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
trust, per-fund segmentation).}
\label{tab:stats}
\small
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
\toprule
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
\cmidrule(r){1-2}\cmidrule(l){3-4}
Trust graphs & 435 & Samples (total) & 852 \\
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\
\;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\
\;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Train/validation/test split.} Partitioned at the trust level by a
deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
more than one split. Multiple quarters and the dropping of ontology subsets (per
the thesis's augmentation strategy) expand the corpus further.
% ====================================================================
\section{Use in the thesis experiments}
% ====================================================================
Each sample is a triple $(x,\,\sigma,\,y)$ where $x$ is the prospectus prose,
$\sigma$ is the inferred ontology, and $y$ is the marker-serialized triple graph.
The model is trained to compute $y = f_{\theta}(x,\sigma)$. This dataset feeds the
four-model comparison of the thesis directly:
\begin{itemize}[nosep]
\item \textbf{Model 1/3} (decoder-only / encoder-decoder, no extra tokens):
trained on the plain serialization \code{target\_serialized\_plain}.
\item \textbf{Model 2/4} (with grammar-terminal tokens): trained on the marker
serialization \code{target\_serialized}, with the four markers
\tstart, \predm, \objm, \tend{} added to the vocabulary as single tokens,
testing research question~1 (do dedicated terminal tokens reduce loss and
raise $F_1$).
\end{itemize}
\paragraph{Splits.} The dataset is partitioned into train/validation/test at the
\emph{trust} level ($80/10/10$), assigned by a deterministic hash of the trust
CIK. Splitting by trust rather than by fund prevents leakage: funds of one trust
share advisers, distributors and custodians, so a fund-level split would let the
model memorise trust-specific entities seen in training and inflate test scores.
The builder verifies that no trust appears in more than one split.
Because the dataset's input is far longer than its output and its target is a
relational graph, it stresses precisely the capabilities the thesis cares about:
long-context reading comprehension and faithful generation of entity-to-entity
structure. Evaluation uses triple-level precision, recall and $F_1$ against the
model-free gold, matched on \code{(subject type, predicate, normalized object
label)} so that IRI-slug differences do not create spurious errors. The same
metric scores the strong-model silver baseline, giving a like-for-like comparison
between the finetuned models and a state-of-the-art prompted extractor.
% ====================================================================
\section{Reproducibility}
% ====================================================================
The dataset is built by two scripts accompanying this note.
\code{build\_rdf\_dataset.py} has three stages: \code{gold} parses the local
N-CEN flat files into per-trust gold graphs (with \code{--custodian-scope} to
choose primary-only, all, or no custodian edges); \code{fetch} downloads all
recent full prospectus books per trust from EDGAR (rate-limited, \code{gzip}-aware,
\code{--max-filings} per trust) and concatenates them; \code{samples} segments the
prose per fund and joins it with the gold into the $(x,\sigma,y)$ records described
above. \code{score\_baseline.py} computes the no-model string-match baseline and
scores any strong-model predictions against the gold. All inputs are public SEC
filings; no licensing restriction applies to the derived dataset.
\end{document}