\documentclass[11pt,a4paper]{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{lmodern} \usepackage[margin=2.5cm]{geometry} \usepackage{amsmath,amssymb} \usepackage{booktabs} \usepackage{graphicx} \usepackage{listings} \usepackage{xcolor} \usepackage{tikz} \usetikzlibrary{positioning,arrows.meta,shapes.geometric} \usepackage[hidelinks]{hyperref} \usepackage{caption} \usepackage{enumitem} % ---- listing style (matches the Turtle/JSON look of the thesis) ---- \definecolor{kw}{rgb}{0.0,0.3,0.6} \definecolor{str}{rgb}{0.6,0.2,0.1} \definecolor{cmt}{rgb}{0.3,0.5,0.3} \lstset{ basicstyle=\ttfamily\footnotesize, breaklines=true, showstringspaces=false, keywordstyle=\color{kw}\bfseries, stringstyle=\color{str}, commentstyle=\color{cmt}\itshape, frame=single, framesep=4pt, rulecolor=\color{black!25}, backgroundcolor=\color{black!2}, numbers=left, numberstyle=\tiny\color{black!40}, xleftmargin=2.2em, } \newcommand{\code}[1]{\texttt{\small #1}} \newcommand{\tstart}{\code{}} \newcommand{\predm}{\code{}} \newcommand{\objm}{\code{}} \newcommand{\tend}{\code{}} \title{\textbf{A Relationship-Rich Financial Dataset for\\ Text-to-RDF-Triple Extraction:\\ SEC Fund Disclosures as a Knowledge-Graph Source}} \author{Dr.\ Florian Herzog\\\small Thesis supervisor --- companion technical note to the thesis \emph{Magical RDF Triples and how to synthetize them}} \date{\today} \begin{document} \maketitle \begin{abstract} This note specifies a finance-domain dataset for training and evaluating models that extract Resource Description Framework (RDF) triples from plain text, the task at the centre of the accompanying thesis. The dataset is constructed from mandatory U.S.\ Securities and Exchange Commission (SEC) fund disclosures. Unlike the Wikidata-derived corpora commonly used for this task --- where the source text is written \emph{from} the triples and is therefore roughly the same size as its target --- here a long natural-language prospectus (on the order of $10^{5}$--$10^{7}$ characters) maps to a compact graph of a few hundred triples, yielding a realistic text-to-output size ratio of roughly $20\!:\!1$. Crucially, the target is a genuine graph of \emph{entity-to-entity relationships} (a fund \emph{advised by} a management company, \emph{distributed by} an underwriter, \emph{holding} a security \emph{issued by} an issuer), not a flat list of literal attributes. Two distinct ground-truth regimes are available: a \emph{model-free gold} baseline derived from parallel structured SEC filings (N-CEN, N-PORT, Series/Class reference data), and a \emph{strong-model silver} baseline for the relations expressed only in prose. We describe the source filings, the target ontology and graph structure, the holdings sub-graph, the serialization into the thesis's grammar-terminal token format, and how the resulting samples are used to train and benchmark the four models under study. \end{abstract} \tableofcontents % ==================================================================== \section{Motivation: the size-ratio and relationship gap} % ==================================================================== The thesis trains a general-purpose language model to extract serialized RDF triples from plain text conditioned on an ontology. The quality of such a model is bounded by the quality of its training data. The benchmarks surveyed in the thesis (WebNLG, T-REx, REBEL, Wiki-NRE) share two properties that make them weak proxies for the real extraction problem. \paragraph{Symmetric size.} In WebNLG, human annotators were instructed to write text \emph{from} a given set of triples. Consequently each sentence encodes almost exactly the triples it was generated from: the input text and the target JSON are of comparable length. The task degenerates towards transliteration and never exercises the central difficulty of practical information extraction --- locating a small number of facts inside a large, noisy document. \paragraph{Attribute-only targets.} Many relation-extraction corpora reduce to mapping a sentence to a single predicate label, or to a star of literal-valued attributes around one entity. They contain few \emph{entity-to-entity} edges, and therefore exercise little of the graph structure that motivates RDF in the first place. A suitable dataset must instead satisfy both of the following, simultaneously: \begin{enumerate}[label=(\roman*)] \item the input text is substantially larger than the target serialization, so the model must perform genuine reading comprehension over a long document; \item the target is a multi-entity-type graph of relationships, so the inferred ontology contains edges of the form \code{TypeA\ ---predicate--->\ TypeB}, not only \code{TypeA\ ---predicate--->\ literal}. \end{enumerate} SEC fund disclosures satisfy both, and additionally provide a rare third property: a \emph{free, non-model ground truth}, because the same facts that appear in the prose are independently filed by the same registrants in structured form. % ==================================================================== \section{Source filings} % ==================================================================== The dataset draws on four public SEC data sources, summarised in Table~\ref{tab:sources}. Their division of labour is the key design idea: the \emph{prose} filings provide the model input, while the \emph{structured} filings provide the ground-truth graph. \begin{table}[h] \centering \caption{SEC source filings and their role in the dataset.} \label{tab:sources} \small \begin{tabular}{@{}llll@{}} \toprule Source & Form & Content & Role \\ \midrule Prospectus & N-1A (485BPOS, 497) & Investment objective, strategy, & \textbf{input text} \\ & & management, fees (prose) & \\ N-CEN & N-CEN & Service providers, classification & \textbf{gold edges} \\ N-PORT & NPORT-P & Portfolio holdings (quarterly) & \textbf{gold edges} \\ Series/Class CSV & --- & Trust/Series/Class identity & \textbf{gold skeleton} \\ Annual report (MDFP) & N-CSR & Top-holdings commentary (prose) & \textbf{input text (holdings)} \\ \bottomrule \end{tabular} \end{table} \paragraph{Prospectus (N-1A).} The statutory prospectus is a long legal document describing a fund family. It names, in prose, the fund's investment adviser, sub-adviser, distributor, transfer agent, portfolio managers and benchmark index, together with its objective, strategy and fee structure. A single filing covers all funds (series) of a trust and ranges from roughly $4\times10^{5}$ to $1\times10^{7}$ characters of extracted text. \paragraph{N-CEN.} The annual census filing reports, in structured tabular form, each fund's service providers --- adviser, sub-adviser, custodian, transfer agent, administrator --- and the trust's principal underwriter, each with a Legal Entity Identifier (LEI) where available. These rows are the gold standard for the service-provider edges of the graph. \paragraph{N-PORT.} The monthly portfolio filing reports, per fund, every security held, with issuer name, identifiers (CUSIP, ISIN, LEI), asset category, investment country and market value. These rows are the gold standard for the holdings sub-graph (Section~\ref{sec:holdings}). \paragraph{Series/Class reference data.} The SEC's Series/Class listing provides the trust\,$\to$\,series\,$\to$\,share-class identity backbone, gold for the structural \code{seriesOf} and \code{hasShareClass} edges. A central property is \emph{redundancy across modality}: a fact such as ``the fund is advised by Geode Capital Management, LLC'' appears both as a sentence in the prospectus (the input) and as a structured row in N-CEN (the label). This is what makes a model-free ground truth possible. % ==================================================================== \section{Target ontology and graph structure} % ==================================================================== The target of each sample is a directed, labelled multigraph $G=(E,R)$ in the sense of the thesis, where nodes are typed entities and edges are RDF predicates. The entity types and relations are listed in Table~\ref{tab:ontology}. \begin{table}[h] \centering \caption{Target ontology: entity types and entity-to-entity relations.} \label{tab:ontology} \small \begin{tabular}{@{}lll@{}} \toprule Subject type & Predicate & Object type \\ \midrule Fund & \code{seriesOf} & Trust \\ Fund & \code{advisedBy} & InvestmentAdviser \\ Fund & \code{subAdvisedBy} & SubAdviser \\ Fund & \code{transferAgent} & TransferAgent \\ Fund & \code{custodian} & Custodian \\ Fund & \code{administrator} & Administrator \\ Trust & \code{underwrittenBy} & Distributor \\ \addlinespace Fund & \code{holds} & Security \quad(holdings sub-graph) \\ Security & \code{issuedBy} & Issuer \\ Security & \code{domiciledIn} & Country \\ Fund & \code{tracksIndex} & Index \\ \bottomrule \end{tabular} \end{table} Every relation in Table~\ref{tab:ontology} has an entity as its object, not a literal. The dataset may optionally be enriched with literal-valued attribute triples (management fee, net expense ratio, returns, portfolio turnover) drawn from the XBRL Risk/Return filings; these are deliberately \emph{secondary}, because the purpose of the dataset is to exercise relational structure. Following the thesis's ontology-inference procedure (SPARQL meta-schema extraction), the per-sample ontology presented to the model is the set of distinct \code{(subject type, predicate, object type)} patterns realised in that sample, e.g. \noindent\begin{minipage}{\linewidth} \begin{lstlisting}[language=,caption={Inferred ontology for one fund trust (model input, abbreviated).},captionpos=b] { "Fund": { "seriesOf": ["Trust"], "advisedBy": ["InvestmentAdviser"], "subAdvisedBy": ["SubAdviser"], "transferAgent": ["TransferAgent"], "custodian": ["Custodian"], "administrator": ["Administrator"] }, "Trust": { "underwrittenBy": ["Distributor"] } } \end{lstlisting} \end{minipage} \begin{figure}[h] \centering \begin{tikzpicture}[ node distance=13mm and 30mm, ent/.style={draw,rounded corners,align=center,font=\footnotesize, inner sep=3pt,minimum height=7mm,fill=blue!5}, edge/.style={-{Stealth},font=\scriptsize,shorten >=1pt,shorten <=1pt}, lbl/.style={font=\scriptsize,fill=white,inner sep=1.5pt}] % --- service-provider / structure cluster (centre-left) --- \node[ent] (fund) {Fund}; \node[ent,above=of fund] (trust) {Trust}; \node[ent,above=of trust] (dist) {Distributor}; \node[ent,left=of fund] (adv) {Investment\\Adviser}; \node[ent,below=16mm of adv] (sub) {Sub-\\Adviser}; \node[ent,below=of fund] (cust) {Custodian}; \node[ent,right=24mm of cust] (ta) {Transfer\\Agent}; \node[ent,right=of fund] (admin) {Administrator}; % --- holdings cluster (far right, separated column) --- \node[ent,right=26mm of admin] (sec) {Security}; \node[ent,above=of sec] (iss) {Issuer}; \node[ent,below=of sec] (ctry) {Country}; \draw[edge] (fund) -- node[lbl]{seriesOf} (trust); \draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist); \draw[edge] (fund) -- node[lbl]{advisedBy} (adv); \draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub); \draw[edge] (fund) -- node[lbl]{custodian} (cust); \draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta); \draw[edge] (fund) -- node[lbl]{administrator} (admin); % holds: arc from the Fund's top, over the Administrator, down to Security \draw[edge] (fund.north east) to[out=35,in=140] node[lbl,pos=0.65]{holds} (sec.north west); \draw[edge] (sec) -- node[lbl]{issuedBy} (iss); \draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry); \end{tikzpicture} \caption{Schematic of the target knowledge graph. Left and centre: the service-provider/structure graph grounded in the prospectus prose. Right column (Issuer--Security--Country): the holdings sub-graph grounded in annual-report commentary with N-PORT gold.} \label{fig:graph} \end{figure} % ==================================================================== \section{The holdings sub-graph} \label{sec:holdings} % ==================================================================== Portfolio holdings express the richest relationships in the data --- a fund \emph{holds} many securities, each \emph{issued by} an issuer \emph{domiciled in} a country --- and are the natural place to grow the graph beyond service providers. They require care, however, because holdings are \emph{not} disclosed in the prospectus: the prospectus describes a fund's \emph{strategy} (``invests in large-capitalisation equities''), never its specific positions. The text-bearing source for holdings is the \textbf{annual or semi-annual report} (Form N-CSR). It contains two parts: \begin{itemize}[nosep] \item the \emph{Schedule of Investments}, a complete table of every holding --- structured, not prose, and therefore not an extraction target; and \item the \emph{Management Discussion of Fund Performance} (MDFP), a narrative in which the portfolio manager names the fund's \emph{top} positions and explains their contribution (``our largest holdings were Apple, Microsoft and \dots''). \end{itemize} The MDFP is genuine prose and yields real \code{holds} edges for the named positions. The corresponding \textbf{N-PORT} filing provides the structured gold: the full holdings table, against which the MDFP-named subset can be verified and from which \code{issuedBy} and \code{domiciledIn} are taken. This produces a second, independent text-to-graph task in the same financial domain: \emph{MDFP commentary $\to$ holdings sub-graph}, with N-PORT as gold. Because it pairs a \emph{different} document type with a \emph{different} relation set, including it strengthens the cross-domain generalization claim of the thesis (Section~3.2.3): a single model is shown to extract two structurally different graphs in the same domain. Fund fact sheets and portfolio-manager commentaries published by fund companies are an additional, off-EDGAR prose source for the same edges, at the cost of having no standardized machine-readable gold. A practical caveat applies to holdings as it does to service providers (Section~\ref{sec:baselines}): only the positions \emph{named in prose} are recoverable from the input. The benchmark therefore scopes the \code{holds} target to the MDFP-named subset rather than the full N-PORT schedule, to avoid penalising a model for failing to extract facts that are absent from its input. % ==================================================================== \section{Serialization and the marker token format} % ==================================================================== Targets are serialized in the grammar-terminal token format introduced in the thesis (Section~5.2), in which four special tokens delimit triple components and shared subjects/predicates are factored out, mirroring Turtle's predicate-object lists: \begin{lstlisting}[language=,caption={Target serialization for one segmented fund (primary-custodian scope).},captionpos=b] Small Cap Special Values Fund seriesOf VALIC Co I advisedBy The Variable Annuity Life Insurance Company subAdvisedBy SunAmerica Asset Management, LLC administrator SunAmerica Asset Management, LLC custodian State Street Bank and Trust Company \end{lstlisting} To support the four-model comparison from a single dataset, each sample carries \emph{two} serializations of the identical triples: \code{target\_serialized}, the marker form above (for Models 2/4, whose vocabulary is extended with the four grammar-terminal tokens), and \code{target\_serialized\_plain}, a Turtle-like form using ordinary `\code{;}' and `\code{,}' delimiters and no special tokens (for Models 1/3). Because the two differ only in whether the delimiters are dedicated tokens, the comparison isolates exactly the effect under study in research question~1. Each sample is thus a JSON record with: the input prose (\code{input\_text}), the inferred ontology (\code{ontology}), the target triples as a structured list (\code{target\_triples}) and in both serializations, the trust/series identifiers, and size statistics. % ==================================================================== \section{Per-fund segmentation} \label{sec:segmentation} % ==================================================================== A single prospectus filing covers \emph{all} funds of a trust and may exceed $10^{7}$ characters, beyond any practical context window. Treating one filing as one sample is also semantically wrong: the target would mix the subgraphs of dozens of unrelated funds. The dataset therefore segments each filing into \emph{per-fund} samples, so that one fund's prospectus section maps to that one fund's subgraph. \paragraph{Fetching all books of a trust.} Large fund families split their funds across \emph{several} prospectus books, so the single most recent filing covers only a fraction of a trust's funds. The fetcher therefore retrieves the most recent \emph{full} prospectuses (forms 485BPOS/485APOS) for each trust --- preferring them over the much shorter 497/497K supplements, which are used only as a fallback --- and concatenates their text. On the proof-of-concept slice this raised the fetched text from one book per trust to a mean of seven, e.g.\ from $5\times10^{5}$ to $2.2\times10^{7}$ characters for a large ETF trust, so that far more fund sections are present. \paragraph{Section anchors.} Statutory prospectuses open each fund's block with the fund name immediately followed by a summary heading. Filers use several styles, so the segmenter accepts any of: ``Fund Summary'', ``Investment Objective'', ``Principal Investment Strategies'', the ETF objective sentence ``The Fund seeks\dots'', or a class/ticker header (``Class/Ticker:\dots''). \paragraph{Boundary selection and a collapse guard.} The segmenter collects \emph{all} anchored heading positions across the concatenated text, sorts them, and cuts each segment from one heading to the next. Because a fund name can also occur in tables of contents and cross-references, na\"ively taking the first occurrence collapses segments to a few characters; the segmenter therefore discards any candidate whose resulting segment is shorter than a minimum ($1{,}500$ characters) and, for each fund, keeps the longest surviving segment. Each segment is paired with that fund's edges plus the fund-anchored \code{seriesOf} edge and the trust-level \code{underwrittenBy} edge. \paragraph{Name-variant matching.} The fund name filed in N-CEN and the heading printed in the prospectus frequently differ in the legal-form suffix --- a fund filed as ``\dots\ Fund'' may be headed ``\dots\ ETF'' or ``\dots\ Portfolio''. The segmenter matches on a set of normalized variants (suffix swapped or dropped) rather than the exact N-CEN string. \paragraph{Coverage and fallback.} Where a fund's section cannot be located it is skipped and counted (never silently dropped); where no section in a trust can be located, the builder emits a single whole-trust fallback sample. On the proof-of-concept slice, fetching all books and applying the robust segmenter turns $14$ trusts into $141$ samples ($135$ cleanly segmented per-fund plus $6$ whole-trust fallbacks), with a per-fund median input of $\sim\!3.7\times10^{4}$ characters against a $\sim\!6.5\times10^{2}$-character target --- a per-fund text-to-JSON ratio with a median near $55\!:\!1$. The residual misses are dominated not by segmentation but by an \emph{entity-resolution} gap: some trusts file their prospectuses under a different CIK (or fund brand) than their N-CEN report, so the N-CEN fund names do not appear in the fetched text at all. Closing that gap requires joining on the SEC Series identifier across CIKs rather than fetching more filings of the same CIK, and is left to the full-scale build. % ==================================================================== \section{Ground truth and baselines} \label{sec:baselines} % ==================================================================== The dataset offers two independent ground-truth regimes. \paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy}, \code{transferAgent}, \code{custodian}, \code{administrator} and \code{underwrittenBy}, the labels come directly from N-CEN; for \code{seriesOf} and \code{hasShareClass}, from the Series/Class reference data; for \code{holds}, \code{issuedBy} and \code{domiciledIn}, from N-PORT. No model is involved in producing these labels, which makes them an unusually trustworthy reference for a generative-extraction benchmark. \paragraph{The custodian relation and edge scoping.} The custodian relation illustrates a subtlety that any honest benchmark on this data must address. N-CEN reports, for a global fund, not only its \emph{primary} custodian but the entire chain of \emph{foreign sub-custodians} --- one bank per market it invests in. These sub-custodians have two damaging properties. First, they are \emph{unextractable}: they appear only in the N-CEN table and essentially never in the prospectus prose (a naive string-match recovers $7\%$ of them), so keeping them as targets asks the model to extract facts absent from its input. Second, they \emph{dominate}: with \code{IS\_SUB\_CUSTODIAN${=}$Y} accounting for $88\%$ of custodian rows, they constitute roughly two thirds of \emph{all} edges in the unfiltered graph, inflating both the target size and the training loss with noise. The dataset therefore scopes the custodian relation to the \emph{primary} custodian (\code{IS\_SUB\_CUSTODIAN${\neq}$Y}, a median of one per fund), which is genuinely prose-grounded --- it is named in the prospectus or its Statement of Additional Information (e.g.\ ``State Street Bank and Trust Company serves as custodian''). This single change reduces the corpus from $36{,}880$ to $15{,}739$ edges, all of prose-grounded relation types, and is the configurable default (\code{--custodian-scope primary}). The full sub-custodian chain remains available in N-CEN as a structured-only relation outside the text-to-triples task. This is a dataset-quality decision of the same kind the thesis notes for T-REx and REBEL, whose non-exhaustive references unfairly penalise correct extractions. \paragraph{No-model lower bound.} A trivial string-matching baseline --- emit a gold edge iff the object's name occurs in the prose --- establishes a \emph{floor} and measures \emph{how prose-grounded each relation is}. Table~\ref{tab:baseline} reports this on the proof-of-concept slice after primary-custodian scoping, multi-book fetching and per-fund segmentation. Because the baseline requires an \emph{exact substring} match within the fund's \emph{own} section, its recall is a strict lower bound: a fund's adviser, for instance, must be named in that fund's segment under a literal spelling. On the full quarter the adviser is recovered with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping alone). The residual gap from $1.0$ is attributable to surface-form variation (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained model handles but exact matching does not. \begin{table}[h] \centering \caption{No-model string-match baseline on the full 2025\,Q3 build, after primary-custodian scoping, multi-book fetching and per-fund segmentation ($852$ samples). Precision is $1.00$ by construction; recall is a strict exact-match lower bound.} \label{tab:baseline} \small \begin{tabular}{@{}lrr@{}} \toprule Relation & Recall & Gold edges \\ \midrule \code{advisedBy} & 0.93 & 1{,}673 \\ \code{seriesOf} & 0.84 & 1{,}555 \\ \code{subAdvisedBy} & 0.84 & 946 \\ \code{administrator} & 0.80 & 2{,}066 \\ \code{transferAgent} & 0.72 & 1{,}721 \\ \code{custodian} & 0.63 & 1{,}761 \\ \code{underwrittenBy} & 0.62 & 863 \\ \midrule micro-average & 0.65 & 6{,}479 \\ \bottomrule \end{tabular} \end{table} \paragraph{Strong-model silver.} For relations that are expressed only in prose and lack a clean structured source --- portfolio managers (\code{managedBy}), the named benchmark index (\code{tracksIndex}), and MDFP-named holdings --- a strong reference model (e.g.\ GPT-4 or Claude Opus) produces silver labels. Because the structured relations have model-free gold, the silver-labelling model can itself be \emph{measured} on the overlapping gold edges, so its reliability is quantified rather than assumed. % ==================================================================== \section{Corpus statistics} % ==================================================================== Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$ entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$ closed-end or interval funds file no standard prospectus) and applying the robust per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus $193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio near $117\!:\!1$ (input prose to target serialization), and across all samples the median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks. \begin{table}[h] \centering \caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph (primary-custodian scope). Right: text-to-triple samples (all prospectus books per trust, per-fund segmentation).} \label{tab:stats} \small \begin{tabular}{@{}lr@{\hskip 3em}lr@{}} \toprule \multicolumn{2}{c}{Gold graph (2025\,Q3)} & \multicolumn{2}{c}{Samples (2025\,Q3)} \\ \cmidrule(r){1-2}\cmidrule(l){3-4} Trust graphs & 435 & Samples (total) & 852 \\ Funds (series) & 2{,}421 & \;segmented per-fund & 659 \\ Entity-entity edges & 15{,}739 & \;whole-trust fallback & 193 \\ \;custodian (primary) & 3{,}045 & Trusts fetched & 393 \\ \;advisedBy & 2{,}588 & Prospectus filings & 2{,}326 \\ Distributors & 458 & Ratio (median, per-fund) & $117\!:\!1$ \\ \bottomrule \end{tabular} \end{table} \paragraph{Train/validation/test split.} Partitioned at the trust level by a deterministic hash of the CIK: $655$ train, $122$ validation, $75$ test samples (from $268$, $37$ and $36$ trusts respectively), with \emph{no} trust appearing in more than one split. Multiple quarters and the dropping of ontology subsets (per the thesis's augmentation strategy) expand the corpus further. % ==================================================================== \section{Use in the thesis experiments} % ==================================================================== Each sample is a triple $(x,\,\sigma,\,y)$ where $x$ is the prospectus prose, $\sigma$ is the inferred ontology, and $y$ is the marker-serialized triple graph. The model is trained to compute $y = f_{\theta}(x,\sigma)$. This dataset feeds the four-model comparison of the thesis directly: \begin{itemize}[nosep] \item \textbf{Model 1/3} (decoder-only / encoder-decoder, no extra tokens): trained on the plain serialization \code{target\_serialized\_plain}. \item \textbf{Model 2/4} (with grammar-terminal tokens): trained on the marker serialization \code{target\_serialized}, with the four markers \tstart, \predm, \objm, \tend{} added to the vocabulary as single tokens, testing research question~1 (do dedicated terminal tokens reduce loss and raise $F_1$). \end{itemize} \paragraph{Splits.} The dataset is partitioned into train/validation/test at the \emph{trust} level ($80/10/10$), assigned by a deterministic hash of the trust CIK. Splitting by trust rather than by fund prevents leakage: funds of one trust share advisers, distributors and custodians, so a fund-level split would let the model memorise trust-specific entities seen in training and inflate test scores. The builder verifies that no trust appears in more than one split. Because the dataset's input is far longer than its output and its target is a relational graph, it stresses precisely the capabilities the thesis cares about: long-context reading comprehension and faithful generation of entity-to-entity structure. Evaluation uses triple-level precision, recall and $F_1$ against the model-free gold, matched on \code{(subject type, predicate, normalized object label)} so that IRI-slug differences do not create spurious errors. The same metric scores the strong-model silver baseline, giving a like-for-like comparison between the finetuned models and a state-of-the-art prompted extractor. % ==================================================================== \section{Reproducibility} % ==================================================================== The dataset is built by two scripts accompanying this note. \code{build\_rdf\_dataset.py} has three stages: \code{gold} parses the local N-CEN flat files into per-trust gold graphs (with \code{--custodian-scope} to choose primary-only, all, or no custodian edges); \code{fetch} downloads all recent full prospectus books per trust from EDGAR (rate-limited, \code{gzip}-aware, \code{--max-filings} per trust) and concatenates them; \code{samples} segments the prose per fund and joins it with the gold into the $(x,\sigma,y)$ records described above. \code{score\_baseline.py} computes the no-model string-match baseline and scores any strong-model predictions against the gold. All inputs are public SEC filings; no licensing restriction applies to the derived dataset. \end{document}