fund_rfid_data/dataset_description.tex
Florian Herzog 00f51859e0 Drop non-extractable custodian relation; add per-triple grounded flag
Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN,
never in the prospectus prose, so they are not a valid text->triple target.
Per-fund the custodian object name occurs in only 28% of segments, the weakest
of all relations. Default is now --custodian-scope none.

Every triple now carries a 'grounded' boolean (object name present in the
sample's input text); 80% of triples are grounded across the full build. This
lets training/eval restrict to text-extractable targets.

- build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats
- gold rebuilt without custodian (15,739 -> 12,694 edges)
- dataset_description + README updated (custodian dropped, grounding documented)

Reported by thesis author: Citibank custodians in triples for 0001529390 never
appear in that prospectus text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 10:34:14 +02:00

591 lines
30 KiB
TeX

\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[margin=2.5cm]{geometry}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{positioning,arrows.meta,shapes.geometric}
\usepackage[hidelinks]{hyperref}
\usepackage{caption}
\usepackage{enumitem}
% ---- listing style (matches the Turtle/JSON look of the thesis) ----
\definecolor{kw}{rgb}{0.0,0.3,0.6}
\definecolor{str}{rgb}{0.6,0.2,0.1}
\definecolor{cmt}{rgb}{0.3,0.5,0.3}
\lstset{
basicstyle=\ttfamily\footnotesize,
breaklines=true,
showstringspaces=false,
keywordstyle=\color{kw}\bfseries,
stringstyle=\color{str},
commentstyle=\color{cmt}\itshape,
frame=single,
framesep=4pt,
rulecolor=\color{black!25},
backgroundcolor=\color{black!2},
numbers=left,
numberstyle=\tiny\color{black!40},
xleftmargin=2.2em,
}
\newcommand{\code}[1]{\texttt{\small #1}}
\newcommand{\tstart}{\code{<triple\_start>}}
\newcommand{\predm}{\code{<predicate\_marker>}}
\newcommand{\objm}{\code{<object\_marker>}}
\newcommand{\tend}{\code{<triple\_end>}}
\title{\textbf{A Relationship-Rich Financial Dataset for\\
Text-to-RDF-Triple Extraction:\\
SEC Fund Disclosures as a Knowledge-Graph Source}}
\author{Dr.\ Florian Herzog\\\small Thesis supervisor --- companion technical note to the thesis
\emph{Magical RDF Triples and how to synthetize them}}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
This note specifies a finance-domain dataset for training and evaluating models
that extract Resource Description Framework (RDF) triples from plain text, the
task at the centre of the accompanying thesis. The dataset is constructed from
mandatory U.S.\ Securities and Exchange Commission (SEC) fund disclosures. Unlike
the Wikidata-derived corpora commonly used for this task --- where the source text
is written \emph{from} the triples and is therefore roughly the same size as its
target --- here a long natural-language prospectus (on the order of
$10^{5}$--$10^{7}$ characters) maps to a compact graph of a few hundred triples,
yielding a realistic text-to-output size ratio of roughly $20\!:\!1$. Crucially,
the target is a genuine graph of \emph{entity-to-entity relationships} (a fund
\emph{advised by} a management company, \emph{distributed by} an underwriter,
\emph{holding} a security \emph{issued by} an issuer), not a flat list of literal
attributes. Two distinct ground-truth regimes are available: a \emph{model-free
gold} baseline derived from parallel structured SEC filings (N-CEN, N-PORT,
Series/Class reference data), and a \emph{strong-model silver} baseline for the
relations expressed only in prose. We describe the source filings, the target
ontology and graph structure, the holdings sub-graph, the serialization into the
thesis's grammar-terminal token format, and how the resulting samples are used to
train and benchmark the four models under study.
\end{abstract}
\tableofcontents
% ====================================================================
\section{Motivation: the size-ratio and relationship gap}
% ====================================================================
The thesis trains a general-purpose language model to extract serialized RDF
triples from plain text conditioned on an ontology. The quality of such a model
is bounded by the quality of its training data. The benchmarks surveyed in the
thesis (WebNLG, T-REx, REBEL, Wiki-NRE) share two properties that make them weak
proxies for the real extraction problem.
\paragraph{Symmetric size.} In WebNLG, human annotators were instructed to write
text \emph{from} a given set of triples. Consequently each sentence encodes
almost exactly the triples it was generated from: the input text and the target
JSON are of comparable length. The task degenerates towards transliteration and
never exercises the central difficulty of practical information extraction ---
locating a small number of facts inside a large, noisy document.
\paragraph{Attribute-only targets.} Many relation-extraction corpora reduce to
mapping a sentence to a single predicate label, or to a star of literal-valued
attributes around one entity. They contain few \emph{entity-to-entity} edges, and
therefore exercise little of the graph structure that motivates RDF in the first
place.
A suitable dataset must instead satisfy both of the following, simultaneously:
\begin{enumerate}[label=(\roman*)]
\item the input text is substantially larger than the target serialization, so
the model must perform genuine reading comprehension over a long document;
\item the target is a multi-entity-type graph of relationships, so the inferred
ontology contains edges of the form
\code{TypeA\ ---predicate--->\ TypeB}, not only
\code{TypeA\ ---predicate--->\ literal}.
\end{enumerate}
SEC fund disclosures satisfy both, and additionally provide a rare third
property: a \emph{free, non-model ground truth}, because the same facts that
appear in the prose are independently filed by the same registrants in structured
form.
% ====================================================================
\section{Source filings}
% ====================================================================
The dataset draws on four public SEC data sources, summarised in
Table~\ref{tab:sources}. Their division of labour is the key design idea: the
\emph{prose} filings provide the model input, while the \emph{structured} filings
provide the ground-truth graph.
\begin{table}[h]
\centering
\caption{SEC source filings and their role in the dataset.}
\label{tab:sources}
\small
\begin{tabular}{@{}llll@{}}
\toprule
Source & Form & Content & Role \\
\midrule
Prospectus & N-1A (485BPOS, 497) & Investment objective, strategy, & \textbf{input text} \\
& & management, fees (prose) & \\
N-CEN & N-CEN & Service providers, classification & \textbf{gold edges} \\
N-PORT & NPORT-P & Portfolio holdings (quarterly) & \textbf{gold edges} \\
Series/Class CSV & --- & Trust/Series/Class identity & \textbf{gold skeleton} \\
Annual report (MDFP) & N-CSR & Top-holdings commentary (prose) & \textbf{input text (holdings)} \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Prospectus (N-1A).} The statutory prospectus is a long legal document
describing a fund family. It names, in prose, the fund's investment adviser,
sub-adviser, distributor, transfer agent, portfolio managers and benchmark index,
together with its objective, strategy and fee structure. A single filing covers
all funds (series) of a trust and ranges from roughly $4\times10^{5}$ to
$1\times10^{7}$ characters of extracted text.
\paragraph{N-CEN.} The annual census filing reports, in structured tabular form,
each fund's service providers --- adviser, sub-adviser, custodian, transfer
agent, administrator --- and the trust's principal underwriter, each with a Legal
Entity Identifier (LEI) where available. These rows are the gold standard for the
service-provider edges of the graph.
\paragraph{N-PORT.} The monthly portfolio filing reports, per fund, every
security held, with issuer name, identifiers (CUSIP, ISIN, LEI), asset category,
investment country and market value. These rows are the gold standard for the
holdings sub-graph (Section~\ref{sec:holdings}).
\paragraph{Series/Class reference data.} The SEC's Series/Class listing provides
the trust\,$\to$\,series\,$\to$\,share-class identity backbone, gold for the
structural \code{seriesOf} and \code{hasShareClass} edges.
A central property is \emph{redundancy across modality}: a fact such as ``the
fund is advised by Geode Capital Management, LLC'' appears both as a sentence in
the prospectus (the input) and as a structured row in N-CEN (the label). This is
what makes a model-free ground truth possible.
% ====================================================================
\section{Target ontology and graph structure}
% ====================================================================
The target of each sample is a directed, labelled multigraph $G=(E,R)$ in the
sense of the thesis, where nodes are typed entities and edges are RDF predicates.
The entity types and relations are listed in Table~\ref{tab:ontology}.
\begin{table}[h]
\centering
\caption{Target ontology: entity types and entity-to-entity relations.}
\label{tab:ontology}
\small
\begin{tabular}{@{}lll@{}}
\toprule
Subject type & Predicate & Object type \\
\midrule
Fund & \code{seriesOf} & Trust \\
Fund & \code{advisedBy} & InvestmentAdviser \\
Fund & \code{subAdvisedBy} & SubAdviser \\
Fund & \code{transferAgent} & TransferAgent \\
Fund & \code{administrator} & Administrator \\
Trust & \code{underwrittenBy} & Distributor \\
\addlinespace
\multicolumn{3}{@{}l}{\emph{dropped (not prose-grounded, see \S\ref{sec:baselines}):}}\\
Fund & \code{custodian} & Custodian \\
\addlinespace
\multicolumn{3}{@{}l}{\emph{holdings sub-graph (planned 2nd track):}}\\
Fund & \code{holds} & Security \\
Security & \code{issuedBy} & Issuer \\
Security & \code{domiciledIn} & Country \\
Fund & \code{tracksIndex} & Index \\
\bottomrule
\end{tabular}
\end{table}
Every relation in Table~\ref{tab:ontology} has an entity as its object, not a
literal. The dataset may optionally be enriched with literal-valued attribute
triples (management fee, net expense ratio, returns, portfolio turnover) drawn
from the XBRL Risk/Return filings; these are deliberately \emph{secondary},
because the purpose of the dataset is to exercise relational structure.
Following the thesis's ontology-inference procedure
(SPARQL meta-schema extraction), the per-sample ontology presented to the model
is the set of distinct \code{(subject type, predicate, object type)} patterns
realised in that sample, e.g.
\noindent\begin{minipage}{\linewidth}
\begin{lstlisting}[language=,caption={Inferred ontology for one fund trust (model input, abbreviated).},captionpos=b]
{
"Fund": {
"seriesOf": ["Trust"],
"advisedBy": ["InvestmentAdviser"],
"subAdvisedBy": ["SubAdviser"],
"transferAgent": ["TransferAgent"],
"administrator": ["Administrator"]
},
"Trust": { "underwrittenBy": ["Distributor"] }
}
\end{lstlisting}
\end{minipage}
\begin{figure}[h]
\centering
\begin{tikzpicture}[
node distance=13mm and 30mm,
ent/.style={draw,rounded corners,align=center,font=\footnotesize,
inner sep=3pt,minimum height=7mm,fill=blue!5},
edge/.style={-{Stealth},font=\scriptsize,shorten >=1pt,shorten <=1pt},
lbl/.style={font=\scriptsize,fill=white,inner sep=1.5pt}]
% --- service-provider / structure cluster (centre-left) ---
\node[ent] (fund) {Fund};
\node[ent,above=of fund] (trust) {Trust};
\node[ent,above=of trust] (dist) {Distributor};
\node[ent,left=of fund] (adv) {Investment\\Adviser};
\node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
\node[ent,dashed,fill=black!5,below=of fund] (cust) {Custodian};
\node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
\node[ent,right=of fund] (admin) {Administrator};
% --- holdings cluster (far right, separated column) ---
\node[ent,right=26mm of admin] (sec) {Security};
\node[ent,above=of sec] (iss) {Issuer};
\node[ent,below=of sec] (ctry) {Country};
\draw[edge] (fund) -- node[lbl]{seriesOf} (trust);
\draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
\draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
\draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
\draw[edge,dashed,gray] (fund) -- node[lbl]{custodian} (cust);
\draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
\draw[edge] (fund) -- node[lbl]{administrator} (admin);
% holds: arc from the Fund's top, over the Administrator, down to Security
\draw[edge] (fund.north east) to[out=35,in=140]
node[lbl,pos=0.65]{holds} (sec.north west);
\draw[edge] (sec) -- node[lbl]{issuedBy} (iss);
\draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
\end{tikzpicture}
\caption{Schematic of the target knowledge graph. Left and centre: the
service-provider/structure graph grounded in the prospectus prose. The dashed
\code{custodian} edge is \emph{dropped} from the dataset (not prose-grounded,
\S\ref{sec:baselines}). Right column (Issuer--Security--Country): the holdings
sub-graph grounded in annual-report commentary with N-PORT gold.}
\label{fig:graph}
\end{figure}
% ====================================================================
\section{The holdings sub-graph}
\label{sec:holdings}
% ====================================================================
Portfolio holdings express the richest relationships in the data --- a fund
\emph{holds} many securities, each \emph{issued by} an issuer
\emph{domiciled in} a country --- and are the natural place to grow the graph
beyond service providers. They require care, however, because holdings are
\emph{not} disclosed in the prospectus: the prospectus describes a fund's
\emph{strategy} (``invests in large-capitalisation equities''), never its
specific positions.
The text-bearing source for holdings is the \textbf{annual or semi-annual report}
(Form N-CSR). It contains two parts:
\begin{itemize}[nosep]
\item the \emph{Schedule of Investments}, a complete table of every holding ---
structured, not prose, and therefore not an extraction target; and
\item the \emph{Management Discussion of Fund Performance} (MDFP), a narrative
in which the portfolio manager names the fund's \emph{top} positions and
explains their contribution (``our largest holdings were Apple,
Microsoft and \dots'').
\end{itemize}
The MDFP is genuine prose and yields real \code{holds} edges for the named
positions. The corresponding \textbf{N-PORT} filing provides the structured gold:
the full holdings table, against which the MDFP-named subset can be verified and
from which \code{issuedBy} and \code{domiciledIn} are taken.
This produces a second, independent text-to-graph task in the same financial
domain: \emph{MDFP commentary $\to$ holdings sub-graph}, with N-PORT as gold.
Because it pairs a \emph{different} document type with a \emph{different} relation
set, including it strengthens the cross-domain generalization claim of the thesis
(Section~3.2.3): a single model is shown to extract two structurally different
graphs in the same domain. Fund fact sheets and portfolio-manager commentaries
published by fund companies are an additional, off-EDGAR prose source for the same
edges, at the cost of having no standardized machine-readable gold.
A practical caveat applies to holdings as it does to service providers
(Section~\ref{sec:baselines}): only the positions \emph{named in prose} are
recoverable from the input. The benchmark therefore scopes the \code{holds}
target to the MDFP-named subset rather than the full N-PORT schedule, to avoid
penalising a model for failing to extract facts that are absent from its input.
% ====================================================================
\section{Serialization and the marker token format}
% ====================================================================
Targets are serialized in the grammar-terminal token format introduced in the
thesis (Section~5.2), in which four special tokens delimit triple components and
shared subjects/predicates are factored out, mirroring Turtle's predicate-object
lists:
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (custodian dropped).},captionpos=b]
<triple_start> John Hancock Bond Fund
<predicate_marker> seriesOf
<object_marker> JOHN HANCOCK SOVEREIGN BOND FUND
<predicate_marker> advisedBy
<object_marker> John Hancock Investment Management LLC
<predicate_marker> subAdvisedBy
<object_marker> Manulife Investment Management (US) LLC
<predicate_marker> administrator
<object_marker> John Hancock Investment Management LLC
<predicate_marker> transferAgent
<object_marker> John Hancock Signature Services, Inc.
<triple_end>
<triple_start> JOHN HANCOCK SOVEREIGN BOND FUND
<predicate_marker> underwrittenBy
<object_marker> JOHN HANCOCK INVESTMENT MANAGEMENT DISTRIBUTORS LLC
<triple_end>
\end{lstlisting}
To support the four-model comparison from a single dataset, each sample carries
\emph{two} serializations of the identical triples: \code{target\_serialized},
the marker form above (for Models 2/4, whose vocabulary is extended with the four
grammar-terminal tokens), and \code{target\_serialized\_plain}, a Turtle-like form
using ordinary `\code{;}' and `\code{,}' delimiters and no special tokens (for
Models 1/3). Because the two differ only in whether the delimiters are dedicated
tokens, the comparison isolates exactly the effect under study in research
question~1.
Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
inferred ontology (\code{ontology}), the target triples as a structured list
(\code{target\_triples}, each triple carrying a \code{grounded} flag) and in both
serializations, the trust/series identifiers, and size statistics (including the
grounded-triple count).
% ====================================================================
\section{Per-fund segmentation}
\label{sec:segmentation}
% ====================================================================
A single prospectus filing covers \emph{all} funds of a trust and may exceed
$10^{7}$ characters, beyond any practical context window. Treating one filing as
one sample is also semantically wrong: the target would mix the subgraphs of
dozens of unrelated funds. The dataset therefore segments each filing into
\emph{per-fund} samples, so that one fund's prospectus section maps to that one
fund's subgraph.
\paragraph{Fetching all books of a trust.} Large fund families split their funds
across \emph{several} prospectus books, so the single most recent filing covers
only a fraction of a trust's funds. The fetcher therefore retrieves the most
recent \emph{full} prospectuses (forms 485BPOS/485APOS) for each trust ---
preferring them over the much shorter 497/497K supplements, which are used only as
a fallback --- and concatenates their text. On the proof-of-concept slice this
raised the fetched text from one book per trust to a mean of seven, e.g.\ from
$5\times10^{5}$ to $2.2\times10^{7}$ characters for a large ETF trust, so that far
more fund sections are present.
\paragraph{Section anchors.} Statutory prospectuses open each fund's block with
the fund name immediately followed by a summary heading. Filers use several
styles, so the segmenter accepts any of: ``Fund Summary'', ``Investment
Objective'', ``Principal Investment Strategies'', the ETF objective sentence
``The Fund seeks\dots'', or a class/ticker header (``Class/Ticker:\dots'').
\paragraph{Boundary selection and a collapse guard.} The segmenter collects
\emph{all} anchored heading positions across the concatenated text, sorts them,
and cuts each segment from one heading to the next. Because a fund name can also
occur in tables of contents and cross-references, na\"ively taking the first
occurrence collapses segments to a few characters; the segmenter therefore
discards any candidate whose resulting segment is shorter than a minimum
($1{,}500$ characters) and, for each fund, keeps the longest surviving segment.
Each segment is paired with that fund's edges plus the fund-anchored
\code{seriesOf} edge and the trust-level \code{underwrittenBy} edge.
\paragraph{Name-variant matching.} The fund name filed in N-CEN and the heading
printed in the prospectus frequently differ in the legal-form suffix --- a fund
filed as ``\dots\ Fund'' may be headed ``\dots\ ETF'' or ``\dots\ Portfolio''.
The segmenter matches on a set of normalized variants (suffix swapped or dropped)
rather than the exact N-CEN string.
\paragraph{Coverage and fallback.} Where a fund's section cannot be located it is
skipped and counted (never silently dropped); where no section in a trust can be
located, the builder emits a single whole-trust fallback sample. On the
proof-of-concept slice, fetching all books and applying the robust segmenter
turns $14$ trusts into $141$ samples ($135$ cleanly segmented per-fund plus $6$
whole-trust fallbacks), with a per-fund median input of $\sim\!3.7\times10^{4}$
characters against a $\sim\!6.5\times10^{2}$-character target --- a per-fund
text-to-JSON ratio with a median near $55\!:\!1$. The residual misses are
dominated not by segmentation but by an \emph{entity-resolution} gap: some trusts
file their prospectuses under a different CIK (or fund brand) than their N-CEN
report, so the N-CEN fund names do not appear in the fetched text at all. Closing
that gap requires joining on the SEC Series identifier across CIKs rather than
fetching more filings of the same CIK, and is left to the full-scale build.
% ====================================================================
\section{Ground truth and baselines}
\label{sec:baselines}
% ====================================================================
The dataset offers two independent ground-truth regimes.
\paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
\code{transferAgent}, \code{administrator} and \code{underwrittenBy}, the labels
come directly from N-CEN; for \code{seriesOf}, from the Series/Class reference
data; for the (planned) \code{holds}, \code{issuedBy} and \code{domiciledIn}, from
N-PORT. No model is involved in producing these labels, which makes them an
unusually trustworthy reference for a generative-extraction benchmark.
\paragraph{Why the custodian relation is dropped.} A text-to-triple target is only
useful if the fact can be found in the input text. The custodian relation fails
this test and is therefore \emph{excluded} from the dataset
(\code{--custodian-scope none}). N-CEN reports, for a global fund, the entire chain
of \emph{foreign sub-custodians} --- one bank per market it invests in (Citibank
Brazil, Banco de Chile, Cititrust Colombia, \dots). These names appear only in the
structured N-CEN table and in \emph{no} prose document: the summary prospectus
refers to the custodian only generically (``\dots\ including the adviser, the
custodian, and the transfer agent\dots'') and never names it. Even the
\emph{primary} custodian is typically named only in the separately filed Statement
of Additional Information (N-1A Part~B), which is not part of the fetched input.
Measured per fund, the custodian object name occurs in the fund's own prospectus
segment only $28\%$ of the time --- by far the weakest of all relations --- so
keeping it would systematically ask the model to extract facts absent from its
input. The full custodian chain remains available in N-CEN as a structured-only
relation, outside the text-to-triples task; recovering it from text would require
adding the SAI as an input source (a separate crawl).
\paragraph{Per-triple grounding flag.} Because even the retained relations are not
\emph{always} named in a fund's segment, every triple carries a boolean
\code{grounded} flag: true iff the normalized object name occurs in that sample's
input text. This lets training and evaluation restrict to grounded triples rather
than silently carrying unextractable targets. Across the full build, $80\%$ of
triples are grounded; per relation the rate ranges from $93\%$ (\code{advisedBy})
down to $62\%$ (\code{underwrittenBy}), as shown in Table~\ref{tab:baseline}.
\paragraph{No-model lower bound.} The grounding flag is itself a trivial,
model-free baseline (emit a gold edge iff its object name occurs in the prose);
its per-relation rate is a strict lower bound on recall, since it requires an
\emph{exact substring} match within the fund's own segment and so misses surface
variants (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a
trained model handles. Table~\ref{tab:baseline} reports it on the full build.
\begin{table}[h]
\centering
\caption{Per-relation prose-grounding on the full 2025\,Q3 build ($852$ samples,
$8{,}824$ triples; custodian dropped). ``Grounded'' = object name present in the
sample's input; a strict, model-free lower bound on recall.}
\label{tab:baseline}
\small
\begin{tabular}{@{}lrr@{}}
\toprule
Relation & Triples & Grounded \\
\midrule
\code{advisedBy} & 1{,}673 & 93\% \\
\code{seriesOf} & 1{,}555 & 84\% \\
\code{subAdvisedBy} & 946 & 84\% \\
\code{administrator} & 2{,}066 & 80\% \\
\code{transferAgent} & 1{,}721 & 72\% \\
\code{underwrittenBy} & 863 & 62\% \\
\midrule
all & 8{,}824 & 80\% \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Strong-model silver.} For relations that are expressed only in prose
and lack a clean structured source --- portfolio managers (\code{managedBy}),
the named benchmark index (\code{tracksIndex}), and MDFP-named holdings --- a
strong reference model (e.g.\ GPT-4 or Claude Opus) produces silver labels.
Because the structured relations have model-free gold, the silver-labelling model
can itself be \emph{measured} on the overlapping gold edges, so its reliability is
quantified rather than assumed.
% ====================================================================
\section{Corpus statistics}
% ====================================================================
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. With
the custodian relation dropped, the N-CEN gold graph holds $12{,}694$
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
closed-end or interval funds file no standard prospectus) and applying the robust
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
$193$ whole-trust fallbacks) containing $8{,}824$ target triples, of which $80\%$
are prose-grounded. The segmented samples have a per-fund median ratio near
$117\!:\!1$ (input prose to target serialization), and across all samples the
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
\begin{table}[h]
\centering
\caption{Corpus statistics for the full 2025\,Q3 build (custodian dropped). Left:
N-CEN gold graph. Right: text-to-triple samples (all prospectus books per trust,
per-fund segmentation).}
\label{tab:stats}
\small
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
\toprule
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
\cmidrule(r){1-2}\cmidrule(l){3-4}
Trust graphs & 435 & Samples (total) & 852 \\
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
Entity-entity edges & 12{,}694 & \;whole-trust fallback & 193 \\
\;administrator & 3{,}288 & Target triples & 8{,}824 \\
\;advisedBy & 2{,}588 & \;grounded & 80\% \\
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Train/validation/test split.} Partitioned at the trust level by a
deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
more than one split. Multiple quarters and the dropping of ontology subsets (per
the thesis's augmentation strategy) expand the corpus further.
% ====================================================================
\section{Use in the thesis experiments}
% ====================================================================
Each sample is a triple $(x,\,\sigma,\,y)$ where $x$ is the prospectus prose,
$\sigma$ is the inferred ontology, and $y$ is the marker-serialized triple graph.
The model is trained to compute $y = f_{\theta}(x,\sigma)$. This dataset feeds the
four-model comparison of the thesis directly:
\begin{itemize}[nosep]
\item \textbf{Model 1/3} (decoder-only / encoder-decoder, no extra tokens):
trained on the plain serialization \code{target\_serialized\_plain}.
\item \textbf{Model 2/4} (with grammar-terminal tokens): trained on the marker
serialization \code{target\_serialized}, with the four markers
\tstart, \predm, \objm, \tend{} added to the vocabulary as single tokens,
testing research question~1 (do dedicated terminal tokens reduce loss and
raise $F_1$).
\end{itemize}
\paragraph{Splits.} The dataset is partitioned into train/validation/test at the
\emph{trust} level ($80/10/10$), assigned by a deterministic hash of the trust
CIK. Splitting by trust rather than by fund prevents leakage: funds of one trust
share advisers, distributors and custodians, so a fund-level split would let the
model memorise trust-specific entities seen in training and inflate test scores.
The builder verifies that no trust appears in more than one split.
Because the dataset's input is far longer than its output and its target is a
relational graph, it stresses precisely the capabilities the thesis cares about:
long-context reading comprehension and faithful generation of entity-to-entity
structure. Evaluation uses triple-level precision, recall and $F_1$ against the
model-free gold, matched on \code{(subject type, predicate, normalized object
label)} so that IRI-slug differences do not create spurious errors. The same
metric scores the strong-model silver baseline, giving a like-for-like comparison
between the finetuned models and a state-of-the-art prompted extractor.
% ====================================================================
\section{Reproducibility}
% ====================================================================
The dataset is built by two scripts accompanying this note.
\code{build\_rdf\_dataset.py} has three stages: \code{gold} parses the local
N-CEN flat files into per-trust gold graphs (with \code{--custodian-scope} to
choose primary-only, all, or no custodian edges); \code{fetch} downloads all
recent full prospectus books per trust from EDGAR (rate-limited, \code{gzip}-aware,
\code{--max-filings} per trust) and concatenates them; \code{samples} segments the
prose per fund and joins it with the gold into the $(x,\sigma,y)$ records described
above. \code{score\_baseline.py} computes the no-model string-match baseline and
scores any strong-model predictions against the gold. All inputs are public SEC
filings; no licensing restriction applies to the derived dataset.
\end{document}