Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN, never in the prospectus prose, so they are not a valid text->triple target. Per-fund the custodian object name occurs in only 28% of segments, the weakest of all relations. Default is now --custodian-scope none. Every triple now carries a 'grounded' boolean (object name present in the sample's input text); 80% of triples are grounded across the full build. This lets training/eval restrict to text-extractable targets. - build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats - gold rebuilt without custodian (15,739 -> 12,694 edges) - dataset_description + README updated (custodian dropped, grounding documented) Reported by thesis author: Citibank custodians in triples for 0001529390 never appear in that prospectus text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
591 lines
30 KiB
TeX
591 lines
30 KiB
TeX
\documentclass[11pt,a4paper]{article}
|
|
|
|
\usepackage[utf8]{inputenc}
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage{lmodern}
|
|
\usepackage[margin=2.5cm]{geometry}
|
|
\usepackage{amsmath,amssymb}
|
|
\usepackage{booktabs}
|
|
\usepackage{graphicx}
|
|
\usepackage{listings}
|
|
\usepackage{xcolor}
|
|
\usepackage{tikz}
|
|
\usetikzlibrary{positioning,arrows.meta,shapes.geometric}
|
|
\usepackage[hidelinks]{hyperref}
|
|
\usepackage{caption}
|
|
\usepackage{enumitem}
|
|
|
|
% ---- listing style (matches the Turtle/JSON look of the thesis) ----
|
|
\definecolor{kw}{rgb}{0.0,0.3,0.6}
|
|
\definecolor{str}{rgb}{0.6,0.2,0.1}
|
|
\definecolor{cmt}{rgb}{0.3,0.5,0.3}
|
|
\lstset{
|
|
basicstyle=\ttfamily\footnotesize,
|
|
breaklines=true,
|
|
showstringspaces=false,
|
|
keywordstyle=\color{kw}\bfseries,
|
|
stringstyle=\color{str},
|
|
commentstyle=\color{cmt}\itshape,
|
|
frame=single,
|
|
framesep=4pt,
|
|
rulecolor=\color{black!25},
|
|
backgroundcolor=\color{black!2},
|
|
numbers=left,
|
|
numberstyle=\tiny\color{black!40},
|
|
xleftmargin=2.2em,
|
|
}
|
|
|
|
\newcommand{\code}[1]{\texttt{\small #1}}
|
|
\newcommand{\tstart}{\code{<triple\_start>}}
|
|
\newcommand{\predm}{\code{<predicate\_marker>}}
|
|
\newcommand{\objm}{\code{<object\_marker>}}
|
|
\newcommand{\tend}{\code{<triple\_end>}}
|
|
|
|
\title{\textbf{A Relationship-Rich Financial Dataset for\\
|
|
Text-to-RDF-Triple Extraction:\\
|
|
SEC Fund Disclosures as a Knowledge-Graph Source}}
|
|
\author{Dr.\ Florian Herzog\\\small Thesis supervisor --- companion technical note to the thesis
|
|
\emph{Magical RDF Triples and how to synthetize them}}
|
|
\date{\today}
|
|
|
|
\begin{document}
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
This note specifies a finance-domain dataset for training and evaluating models
|
|
that extract Resource Description Framework (RDF) triples from plain text, the
|
|
task at the centre of the accompanying thesis. The dataset is constructed from
|
|
mandatory U.S.\ Securities and Exchange Commission (SEC) fund disclosures. Unlike
|
|
the Wikidata-derived corpora commonly used for this task --- where the source text
|
|
is written \emph{from} the triples and is therefore roughly the same size as its
|
|
target --- here a long natural-language prospectus (on the order of
|
|
$10^{5}$--$10^{7}$ characters) maps to a compact graph of a few hundred triples,
|
|
yielding a realistic text-to-output size ratio of roughly $20\!:\!1$. Crucially,
|
|
the target is a genuine graph of \emph{entity-to-entity relationships} (a fund
|
|
\emph{advised by} a management company, \emph{distributed by} an underwriter,
|
|
\emph{holding} a security \emph{issued by} an issuer), not a flat list of literal
|
|
attributes. Two distinct ground-truth regimes are available: a \emph{model-free
|
|
gold} baseline derived from parallel structured SEC filings (N-CEN, N-PORT,
|
|
Series/Class reference data), and a \emph{strong-model silver} baseline for the
|
|
relations expressed only in prose. We describe the source filings, the target
|
|
ontology and graph structure, the holdings sub-graph, the serialization into the
|
|
thesis's grammar-terminal token format, and how the resulting samples are used to
|
|
train and benchmark the four models under study.
|
|
\end{abstract}
|
|
|
|
\tableofcontents
|
|
|
|
% ====================================================================
|
|
\section{Motivation: the size-ratio and relationship gap}
|
|
% ====================================================================
|
|
|
|
The thesis trains a general-purpose language model to extract serialized RDF
|
|
triples from plain text conditioned on an ontology. The quality of such a model
|
|
is bounded by the quality of its training data. The benchmarks surveyed in the
|
|
thesis (WebNLG, T-REx, REBEL, Wiki-NRE) share two properties that make them weak
|
|
proxies for the real extraction problem.
|
|
|
|
\paragraph{Symmetric size.} In WebNLG, human annotators were instructed to write
|
|
text \emph{from} a given set of triples. Consequently each sentence encodes
|
|
almost exactly the triples it was generated from: the input text and the target
|
|
JSON are of comparable length. The task degenerates towards transliteration and
|
|
never exercises the central difficulty of practical information extraction ---
|
|
locating a small number of facts inside a large, noisy document.
|
|
|
|
\paragraph{Attribute-only targets.} Many relation-extraction corpora reduce to
|
|
mapping a sentence to a single predicate label, or to a star of literal-valued
|
|
attributes around one entity. They contain few \emph{entity-to-entity} edges, and
|
|
therefore exercise little of the graph structure that motivates RDF in the first
|
|
place.
|
|
|
|
A suitable dataset must instead satisfy both of the following, simultaneously:
|
|
\begin{enumerate}[label=(\roman*)]
|
|
\item the input text is substantially larger than the target serialization, so
|
|
the model must perform genuine reading comprehension over a long document;
|
|
\item the target is a multi-entity-type graph of relationships, so the inferred
|
|
ontology contains edges of the form
|
|
\code{TypeA\ ---predicate--->\ TypeB}, not only
|
|
\code{TypeA\ ---predicate--->\ literal}.
|
|
\end{enumerate}
|
|
|
|
SEC fund disclosures satisfy both, and additionally provide a rare third
|
|
property: a \emph{free, non-model ground truth}, because the same facts that
|
|
appear in the prose are independently filed by the same registrants in structured
|
|
form.
|
|
|
|
% ====================================================================
|
|
\section{Source filings}
|
|
% ====================================================================
|
|
|
|
The dataset draws on four public SEC data sources, summarised in
|
|
Table~\ref{tab:sources}. Their division of labour is the key design idea: the
|
|
\emph{prose} filings provide the model input, while the \emph{structured} filings
|
|
provide the ground-truth graph.
|
|
|
|
\begin{table}[h]
|
|
\centering
|
|
\caption{SEC source filings and their role in the dataset.}
|
|
\label{tab:sources}
|
|
\small
|
|
\begin{tabular}{@{}llll@{}}
|
|
\toprule
|
|
Source & Form & Content & Role \\
|
|
\midrule
|
|
Prospectus & N-1A (485BPOS, 497) & Investment objective, strategy, & \textbf{input text} \\
|
|
& & management, fees (prose) & \\
|
|
N-CEN & N-CEN & Service providers, classification & \textbf{gold edges} \\
|
|
N-PORT & NPORT-P & Portfolio holdings (quarterly) & \textbf{gold edges} \\
|
|
Series/Class CSV & --- & Trust/Series/Class identity & \textbf{gold skeleton} \\
|
|
Annual report (MDFP) & N-CSR & Top-holdings commentary (prose) & \textbf{input text (holdings)} \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
\paragraph{Prospectus (N-1A).} The statutory prospectus is a long legal document
|
|
describing a fund family. It names, in prose, the fund's investment adviser,
|
|
sub-adviser, distributor, transfer agent, portfolio managers and benchmark index,
|
|
together with its objective, strategy and fee structure. A single filing covers
|
|
all funds (series) of a trust and ranges from roughly $4\times10^{5}$ to
|
|
$1\times10^{7}$ characters of extracted text.
|
|
|
|
\paragraph{N-CEN.} The annual census filing reports, in structured tabular form,
|
|
each fund's service providers --- adviser, sub-adviser, custodian, transfer
|
|
agent, administrator --- and the trust's principal underwriter, each with a Legal
|
|
Entity Identifier (LEI) where available. These rows are the gold standard for the
|
|
service-provider edges of the graph.
|
|
|
|
\paragraph{N-PORT.} The monthly portfolio filing reports, per fund, every
|
|
security held, with issuer name, identifiers (CUSIP, ISIN, LEI), asset category,
|
|
investment country and market value. These rows are the gold standard for the
|
|
holdings sub-graph (Section~\ref{sec:holdings}).
|
|
|
|
\paragraph{Series/Class reference data.} The SEC's Series/Class listing provides
|
|
the trust\,$\to$\,series\,$\to$\,share-class identity backbone, gold for the
|
|
structural \code{seriesOf} and \code{hasShareClass} edges.
|
|
|
|
A central property is \emph{redundancy across modality}: a fact such as ``the
|
|
fund is advised by Geode Capital Management, LLC'' appears both as a sentence in
|
|
the prospectus (the input) and as a structured row in N-CEN (the label). This is
|
|
what makes a model-free ground truth possible.
|
|
|
|
% ====================================================================
|
|
\section{Target ontology and graph structure}
|
|
% ====================================================================
|
|
|
|
The target of each sample is a directed, labelled multigraph $G=(E,R)$ in the
|
|
sense of the thesis, where nodes are typed entities and edges are RDF predicates.
|
|
The entity types and relations are listed in Table~\ref{tab:ontology}.
|
|
|
|
\begin{table}[h]
|
|
\centering
|
|
\caption{Target ontology: entity types and entity-to-entity relations.}
|
|
\label{tab:ontology}
|
|
\small
|
|
\begin{tabular}{@{}lll@{}}
|
|
\toprule
|
|
Subject type & Predicate & Object type \\
|
|
\midrule
|
|
Fund & \code{seriesOf} & Trust \\
|
|
Fund & \code{advisedBy} & InvestmentAdviser \\
|
|
Fund & \code{subAdvisedBy} & SubAdviser \\
|
|
Fund & \code{transferAgent} & TransferAgent \\
|
|
Fund & \code{administrator} & Administrator \\
|
|
Trust & \code{underwrittenBy} & Distributor \\
|
|
\addlinespace
|
|
\multicolumn{3}{@{}l}{\emph{dropped (not prose-grounded, see \S\ref{sec:baselines}):}}\\
|
|
Fund & \code{custodian} & Custodian \\
|
|
\addlinespace
|
|
\multicolumn{3}{@{}l}{\emph{holdings sub-graph (planned 2nd track):}}\\
|
|
Fund & \code{holds} & Security \\
|
|
Security & \code{issuedBy} & Issuer \\
|
|
Security & \code{domiciledIn} & Country \\
|
|
Fund & \code{tracksIndex} & Index \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
Every relation in Table~\ref{tab:ontology} has an entity as its object, not a
|
|
literal. The dataset may optionally be enriched with literal-valued attribute
|
|
triples (management fee, net expense ratio, returns, portfolio turnover) drawn
|
|
from the XBRL Risk/Return filings; these are deliberately \emph{secondary},
|
|
because the purpose of the dataset is to exercise relational structure.
|
|
|
|
Following the thesis's ontology-inference procedure
|
|
(SPARQL meta-schema extraction), the per-sample ontology presented to the model
|
|
is the set of distinct \code{(subject type, predicate, object type)} patterns
|
|
realised in that sample, e.g.
|
|
|
|
\noindent\begin{minipage}{\linewidth}
|
|
\begin{lstlisting}[language=,caption={Inferred ontology for one fund trust (model input, abbreviated).},captionpos=b]
|
|
{
|
|
"Fund": {
|
|
"seriesOf": ["Trust"],
|
|
"advisedBy": ["InvestmentAdviser"],
|
|
"subAdvisedBy": ["SubAdviser"],
|
|
"transferAgent": ["TransferAgent"],
|
|
"administrator": ["Administrator"]
|
|
},
|
|
"Trust": { "underwrittenBy": ["Distributor"] }
|
|
}
|
|
\end{lstlisting}
|
|
\end{minipage}
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{tikzpicture}[
|
|
node distance=13mm and 30mm,
|
|
ent/.style={draw,rounded corners,align=center,font=\footnotesize,
|
|
inner sep=3pt,minimum height=7mm,fill=blue!5},
|
|
edge/.style={-{Stealth},font=\scriptsize,shorten >=1pt,shorten <=1pt},
|
|
lbl/.style={font=\scriptsize,fill=white,inner sep=1.5pt}]
|
|
% --- service-provider / structure cluster (centre-left) ---
|
|
\node[ent] (fund) {Fund};
|
|
\node[ent,above=of fund] (trust) {Trust};
|
|
\node[ent,above=of trust] (dist) {Distributor};
|
|
\node[ent,left=of fund] (adv) {Investment\\Adviser};
|
|
\node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
|
|
\node[ent,dashed,fill=black!5,below=of fund] (cust) {Custodian};
|
|
\node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
|
|
\node[ent,right=of fund] (admin) {Administrator};
|
|
% --- holdings cluster (far right, separated column) ---
|
|
\node[ent,right=26mm of admin] (sec) {Security};
|
|
\node[ent,above=of sec] (iss) {Issuer};
|
|
\node[ent,below=of sec] (ctry) {Country};
|
|
|
|
\draw[edge] (fund) -- node[lbl]{seriesOf} (trust);
|
|
\draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
|
|
\draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
|
|
\draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
|
|
\draw[edge,dashed,gray] (fund) -- node[lbl]{custodian} (cust);
|
|
\draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
|
|
\draw[edge] (fund) -- node[lbl]{administrator} (admin);
|
|
% holds: arc from the Fund's top, over the Administrator, down to Security
|
|
\draw[edge] (fund.north east) to[out=35,in=140]
|
|
node[lbl,pos=0.65]{holds} (sec.north west);
|
|
\draw[edge] (sec) -- node[lbl]{issuedBy} (iss);
|
|
\draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
|
|
\end{tikzpicture}
|
|
\caption{Schematic of the target knowledge graph. Left and centre: the
|
|
service-provider/structure graph grounded in the prospectus prose. The dashed
|
|
\code{custodian} edge is \emph{dropped} from the dataset (not prose-grounded,
|
|
\S\ref{sec:baselines}). Right column (Issuer--Security--Country): the holdings
|
|
sub-graph grounded in annual-report commentary with N-PORT gold.}
|
|
\label{fig:graph}
|
|
\end{figure}
|
|
|
|
% ====================================================================
|
|
\section{The holdings sub-graph}
|
|
\label{sec:holdings}
|
|
% ====================================================================
|
|
|
|
Portfolio holdings express the richest relationships in the data --- a fund
|
|
\emph{holds} many securities, each \emph{issued by} an issuer
|
|
\emph{domiciled in} a country --- and are the natural place to grow the graph
|
|
beyond service providers. They require care, however, because holdings are
|
|
\emph{not} disclosed in the prospectus: the prospectus describes a fund's
|
|
\emph{strategy} (``invests in large-capitalisation equities''), never its
|
|
specific positions.
|
|
|
|
The text-bearing source for holdings is the \textbf{annual or semi-annual report}
|
|
(Form N-CSR). It contains two parts:
|
|
\begin{itemize}[nosep]
|
|
\item the \emph{Schedule of Investments}, a complete table of every holding ---
|
|
structured, not prose, and therefore not an extraction target; and
|
|
\item the \emph{Management Discussion of Fund Performance} (MDFP), a narrative
|
|
in which the portfolio manager names the fund's \emph{top} positions and
|
|
explains their contribution (``our largest holdings were Apple,
|
|
Microsoft and \dots'').
|
|
\end{itemize}
|
|
The MDFP is genuine prose and yields real \code{holds} edges for the named
|
|
positions. The corresponding \textbf{N-PORT} filing provides the structured gold:
|
|
the full holdings table, against which the MDFP-named subset can be verified and
|
|
from which \code{issuedBy} and \code{domiciledIn} are taken.
|
|
|
|
This produces a second, independent text-to-graph task in the same financial
|
|
domain: \emph{MDFP commentary $\to$ holdings sub-graph}, with N-PORT as gold.
|
|
Because it pairs a \emph{different} document type with a \emph{different} relation
|
|
set, including it strengthens the cross-domain generalization claim of the thesis
|
|
(Section~3.2.3): a single model is shown to extract two structurally different
|
|
graphs in the same domain. Fund fact sheets and portfolio-manager commentaries
|
|
published by fund companies are an additional, off-EDGAR prose source for the same
|
|
edges, at the cost of having no standardized machine-readable gold.
|
|
|
|
A practical caveat applies to holdings as it does to service providers
|
|
(Section~\ref{sec:baselines}): only the positions \emph{named in prose} are
|
|
recoverable from the input. The benchmark therefore scopes the \code{holds}
|
|
target to the MDFP-named subset rather than the full N-PORT schedule, to avoid
|
|
penalising a model for failing to extract facts that are absent from its input.
|
|
|
|
% ====================================================================
|
|
\section{Serialization and the marker token format}
|
|
% ====================================================================
|
|
|
|
Targets are serialized in the grammar-terminal token format introduced in the
|
|
thesis (Section~5.2), in which four special tokens delimit triple components and
|
|
shared subjects/predicates are factored out, mirroring Turtle's predicate-object
|
|
lists:
|
|
|
|
\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (custodian dropped).},captionpos=b]
|
|
<triple_start> John Hancock Bond Fund
|
|
<predicate_marker> seriesOf
|
|
<object_marker> JOHN HANCOCK SOVEREIGN BOND FUND
|
|
<predicate_marker> advisedBy
|
|
<object_marker> John Hancock Investment Management LLC
|
|
<predicate_marker> subAdvisedBy
|
|
<object_marker> Manulife Investment Management (US) LLC
|
|
<predicate_marker> administrator
|
|
<object_marker> John Hancock Investment Management LLC
|
|
<predicate_marker> transferAgent
|
|
<object_marker> John Hancock Signature Services, Inc.
|
|
<triple_end>
|
|
<triple_start> JOHN HANCOCK SOVEREIGN BOND FUND
|
|
<predicate_marker> underwrittenBy
|
|
<object_marker> JOHN HANCOCK INVESTMENT MANAGEMENT DISTRIBUTORS LLC
|
|
<triple_end>
|
|
\end{lstlisting}
|
|
|
|
To support the four-model comparison from a single dataset, each sample carries
|
|
\emph{two} serializations of the identical triples: \code{target\_serialized},
|
|
the marker form above (for Models 2/4, whose vocabulary is extended with the four
|
|
grammar-terminal tokens), and \code{target\_serialized\_plain}, a Turtle-like form
|
|
using ordinary `\code{;}' and `\code{,}' delimiters and no special tokens (for
|
|
Models 1/3). Because the two differ only in whether the delimiters are dedicated
|
|
tokens, the comparison isolates exactly the effect under study in research
|
|
question~1.
|
|
|
|
Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
|
|
inferred ontology (\code{ontology}), the target triples as a structured list
|
|
(\code{target\_triples}, each triple carrying a \code{grounded} flag) and in both
|
|
serializations, the trust/series identifiers, and size statistics (including the
|
|
grounded-triple count).
|
|
|
|
% ====================================================================
|
|
\section{Per-fund segmentation}
|
|
\label{sec:segmentation}
|
|
% ====================================================================
|
|
|
|
A single prospectus filing covers \emph{all} funds of a trust and may exceed
|
|
$10^{7}$ characters, beyond any practical context window. Treating one filing as
|
|
one sample is also semantically wrong: the target would mix the subgraphs of
|
|
dozens of unrelated funds. The dataset therefore segments each filing into
|
|
\emph{per-fund} samples, so that one fund's prospectus section maps to that one
|
|
fund's subgraph.
|
|
|
|
\paragraph{Fetching all books of a trust.} Large fund families split their funds
|
|
across \emph{several} prospectus books, so the single most recent filing covers
|
|
only a fraction of a trust's funds. The fetcher therefore retrieves the most
|
|
recent \emph{full} prospectuses (forms 485BPOS/485APOS) for each trust ---
|
|
preferring them over the much shorter 497/497K supplements, which are used only as
|
|
a fallback --- and concatenates their text. On the proof-of-concept slice this
|
|
raised the fetched text from one book per trust to a mean of seven, e.g.\ from
|
|
$5\times10^{5}$ to $2.2\times10^{7}$ characters for a large ETF trust, so that far
|
|
more fund sections are present.
|
|
|
|
\paragraph{Section anchors.} Statutory prospectuses open each fund's block with
|
|
the fund name immediately followed by a summary heading. Filers use several
|
|
styles, so the segmenter accepts any of: ``Fund Summary'', ``Investment
|
|
Objective'', ``Principal Investment Strategies'', the ETF objective sentence
|
|
``The Fund seeks\dots'', or a class/ticker header (``Class/Ticker:\dots'').
|
|
|
|
\paragraph{Boundary selection and a collapse guard.} The segmenter collects
|
|
\emph{all} anchored heading positions across the concatenated text, sorts them,
|
|
and cuts each segment from one heading to the next. Because a fund name can also
|
|
occur in tables of contents and cross-references, na\"ively taking the first
|
|
occurrence collapses segments to a few characters; the segmenter therefore
|
|
discards any candidate whose resulting segment is shorter than a minimum
|
|
($1{,}500$ characters) and, for each fund, keeps the longest surviving segment.
|
|
Each segment is paired with that fund's edges plus the fund-anchored
|
|
\code{seriesOf} edge and the trust-level \code{underwrittenBy} edge.
|
|
|
|
\paragraph{Name-variant matching.} The fund name filed in N-CEN and the heading
|
|
printed in the prospectus frequently differ in the legal-form suffix --- a fund
|
|
filed as ``\dots\ Fund'' may be headed ``\dots\ ETF'' or ``\dots\ Portfolio''.
|
|
The segmenter matches on a set of normalized variants (suffix swapped or dropped)
|
|
rather than the exact N-CEN string.
|
|
|
|
\paragraph{Coverage and fallback.} Where a fund's section cannot be located it is
|
|
skipped and counted (never silently dropped); where no section in a trust can be
|
|
located, the builder emits a single whole-trust fallback sample. On the
|
|
proof-of-concept slice, fetching all books and applying the robust segmenter
|
|
turns $14$ trusts into $141$ samples ($135$ cleanly segmented per-fund plus $6$
|
|
whole-trust fallbacks), with a per-fund median input of $\sim\!3.7\times10^{4}$
|
|
characters against a $\sim\!6.5\times10^{2}$-character target --- a per-fund
|
|
text-to-JSON ratio with a median near $55\!:\!1$. The residual misses are
|
|
dominated not by segmentation but by an \emph{entity-resolution} gap: some trusts
|
|
file their prospectuses under a different CIK (or fund brand) than their N-CEN
|
|
report, so the N-CEN fund names do not appear in the fetched text at all. Closing
|
|
that gap requires joining on the SEC Series identifier across CIKs rather than
|
|
fetching more filings of the same CIK, and is left to the full-scale build.
|
|
|
|
% ====================================================================
|
|
\section{Ground truth and baselines}
|
|
\label{sec:baselines}
|
|
% ====================================================================
|
|
|
|
The dataset offers two independent ground-truth regimes.
|
|
|
|
\paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
|
|
\code{transferAgent}, \code{administrator} and \code{underwrittenBy}, the labels
|
|
come directly from N-CEN; for \code{seriesOf}, from the Series/Class reference
|
|
data; for the (planned) \code{holds}, \code{issuedBy} and \code{domiciledIn}, from
|
|
N-PORT. No model is involved in producing these labels, which makes them an
|
|
unusually trustworthy reference for a generative-extraction benchmark.
|
|
|
|
\paragraph{Why the custodian relation is dropped.} A text-to-triple target is only
|
|
useful if the fact can be found in the input text. The custodian relation fails
|
|
this test and is therefore \emph{excluded} from the dataset
|
|
(\code{--custodian-scope none}). N-CEN reports, for a global fund, the entire chain
|
|
of \emph{foreign sub-custodians} --- one bank per market it invests in (Citibank
|
|
Brazil, Banco de Chile, Cititrust Colombia, \dots). These names appear only in the
|
|
structured N-CEN table and in \emph{no} prose document: the summary prospectus
|
|
refers to the custodian only generically (``\dots\ including the adviser, the
|
|
custodian, and the transfer agent\dots'') and never names it. Even the
|
|
\emph{primary} custodian is typically named only in the separately filed Statement
|
|
of Additional Information (N-1A Part~B), which is not part of the fetched input.
|
|
Measured per fund, the custodian object name occurs in the fund's own prospectus
|
|
segment only $28\%$ of the time --- by far the weakest of all relations --- so
|
|
keeping it would systematically ask the model to extract facts absent from its
|
|
input. The full custodian chain remains available in N-CEN as a structured-only
|
|
relation, outside the text-to-triples task; recovering it from text would require
|
|
adding the SAI as an input source (a separate crawl).
|
|
|
|
\paragraph{Per-triple grounding flag.} Because even the retained relations are not
|
|
\emph{always} named in a fund's segment, every triple carries a boolean
|
|
\code{grounded} flag: true iff the normalized object name occurs in that sample's
|
|
input text. This lets training and evaluation restrict to grounded triples rather
|
|
than silently carrying unextractable targets. Across the full build, $80\%$ of
|
|
triples are grounded; per relation the rate ranges from $93\%$ (\code{advisedBy})
|
|
down to $62\%$ (\code{underwrittenBy}), as shown in Table~\ref{tab:baseline}.
|
|
|
|
\paragraph{No-model lower bound.} The grounding flag is itself a trivial,
|
|
model-free baseline (emit a gold edge iff its object name occurs in the prose);
|
|
its per-relation rate is a strict lower bound on recall, since it requires an
|
|
\emph{exact substring} match within the fund's own segment and so misses surface
|
|
variants (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a
|
|
trained model handles. Table~\ref{tab:baseline} reports it on the full build.
|
|
|
|
\begin{table}[h]
|
|
\centering
|
|
\caption{Per-relation prose-grounding on the full 2025\,Q3 build ($852$ samples,
|
|
$8{,}824$ triples; custodian dropped). ``Grounded'' = object name present in the
|
|
sample's input; a strict, model-free lower bound on recall.}
|
|
\label{tab:baseline}
|
|
\small
|
|
\begin{tabular}{@{}lrr@{}}
|
|
\toprule
|
|
Relation & Triples & Grounded \\
|
|
\midrule
|
|
\code{advisedBy} & 1{,}673 & 93\% \\
|
|
\code{seriesOf} & 1{,}555 & 84\% \\
|
|
\code{subAdvisedBy} & 946 & 84\% \\
|
|
\code{administrator} & 2{,}066 & 80\% \\
|
|
\code{transferAgent} & 1{,}721 & 72\% \\
|
|
\code{underwrittenBy} & 863 & 62\% \\
|
|
\midrule
|
|
all & 8{,}824 & 80\% \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
\paragraph{Strong-model silver.} For relations that are expressed only in prose
|
|
and lack a clean structured source --- portfolio managers (\code{managedBy}),
|
|
the named benchmark index (\code{tracksIndex}), and MDFP-named holdings --- a
|
|
strong reference model (e.g.\ GPT-4 or Claude Opus) produces silver labels.
|
|
Because the structured relations have model-free gold, the silver-labelling model
|
|
can itself be \emph{measured} on the overlapping gold edges, so its reliability is
|
|
quantified rather than assumed.
|
|
|
|
% ====================================================================
|
|
\section{Corpus statistics}
|
|
% ====================================================================
|
|
|
|
Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. With
|
|
the custodian relation dropped, the N-CEN gold graph holds $12{,}694$
|
|
entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
|
|
prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
|
|
closed-end or interval funds file no standard prospectus) and applying the robust
|
|
per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
|
|
$193$ whole-trust fallbacks) containing $8{,}824$ target triples, of which $80\%$
|
|
are prose-grounded. The segmented samples have a per-fund median ratio near
|
|
$117\!:\!1$ (input prose to target serialization), and across all samples the
|
|
median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.
|
|
|
|
\begin{table}[h]
|
|
\centering
|
|
\caption{Corpus statistics for the full 2025\,Q3 build (custodian dropped). Left:
|
|
N-CEN gold graph. Right: text-to-triple samples (all prospectus books per trust,
|
|
per-fund segmentation).}
|
|
\label{tab:stats}
|
|
\small
|
|
\begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
|
|
\toprule
|
|
\multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\
|
|
\cmidrule(r){1-2}\cmidrule(l){3-4}
|
|
Trust graphs & 435 & Samples (total) & 852 \\
|
|
Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\
|
|
Entity-entity edges & 12{,}694 & \;whole-trust fallback & 193 \\
|
|
\;administrator & 3{,}288 & Target triples & 8{,}824 \\
|
|
\;advisedBy & 2{,}588 & \;grounded & 80\% \\
|
|
Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
\paragraph{Train/validation/test split.} Partitioned at the trust level by a
|
|
deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples
|
|
(from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in
|
|
more than one split. Multiple quarters and the dropping of ontology subsets (per
|
|
the thesis's augmentation strategy) expand the corpus further.
|
|
|
|
% ====================================================================
|
|
\section{Use in the thesis experiments}
|
|
% ====================================================================
|
|
|
|
Each sample is a triple $(x,\,\sigma,\,y)$ where $x$ is the prospectus prose,
|
|
$\sigma$ is the inferred ontology, and $y$ is the marker-serialized triple graph.
|
|
The model is trained to compute $y = f_{\theta}(x,\sigma)$. This dataset feeds the
|
|
four-model comparison of the thesis directly:
|
|
|
|
\begin{itemize}[nosep]
|
|
\item \textbf{Model 1/3} (decoder-only / encoder-decoder, no extra tokens):
|
|
trained on the plain serialization \code{target\_serialized\_plain}.
|
|
\item \textbf{Model 2/4} (with grammar-terminal tokens): trained on the marker
|
|
serialization \code{target\_serialized}, with the four markers
|
|
\tstart, \predm, \objm, \tend{} added to the vocabulary as single tokens,
|
|
testing research question~1 (do dedicated terminal tokens reduce loss and
|
|
raise $F_1$).
|
|
\end{itemize}
|
|
|
|
\paragraph{Splits.} The dataset is partitioned into train/validation/test at the
|
|
\emph{trust} level ($80/10/10$), assigned by a deterministic hash of the trust
|
|
CIK. Splitting by trust rather than by fund prevents leakage: funds of one trust
|
|
share advisers, distributors and custodians, so a fund-level split would let the
|
|
model memorise trust-specific entities seen in training and inflate test scores.
|
|
The builder verifies that no trust appears in more than one split.
|
|
|
|
Because the dataset's input is far longer than its output and its target is a
|
|
relational graph, it stresses precisely the capabilities the thesis cares about:
|
|
long-context reading comprehension and faithful generation of entity-to-entity
|
|
structure. Evaluation uses triple-level precision, recall and $F_1$ against the
|
|
model-free gold, matched on \code{(subject type, predicate, normalized object
|
|
label)} so that IRI-slug differences do not create spurious errors. The same
|
|
metric scores the strong-model silver baseline, giving a like-for-like comparison
|
|
between the finetuned models and a state-of-the-art prompted extractor.
|
|
|
|
% ====================================================================
|
|
\section{Reproducibility}
|
|
% ====================================================================
|
|
|
|
The dataset is built by two scripts accompanying this note.
|
|
\code{build\_rdf\_dataset.py} has three stages: \code{gold} parses the local
|
|
N-CEN flat files into per-trust gold graphs (with \code{--custodian-scope} to
|
|
choose primary-only, all, or no custodian edges); \code{fetch} downloads all
|
|
recent full prospectus books per trust from EDGAR (rate-limited, \code{gzip}-aware,
|
|
\code{--max-filings} per trust) and concatenates them; \code{samples} segments the
|
|
prose per fund and joins it with the gold into the $(x,\sigma,y)$ records described
|
|
above. \code{score\_baseline.py} computes the no-model string-match baseline and
|
|
scores any strong-model predictions against the gold. All inputs are public SEC
|
|
filings; no licensing restriction applies to the derived dataset.
|
|
|
|
\end{document}
|