Drop non-extractable custodian relation; add per-triple grounded flag

Custodian names (esp. foreign sub-custodians) appear only in structured N-CEN, never in the prospectus prose, so they are not a valid text->triple target. Per-fund the custodian object name occurs in only 28% of segments, the weakest of all relations. Default is now --custodian-scope none. Every triple now carries a 'grounded' boolean (object name present in the sample's input text); 80% of triples are grounded across the full build. This lets training/eval restrict to text-extractable targets. - build_rdf_dataset.py: annotate_grounding() + grounded flag in samples/stats - gold rebuilt without custodian (15,739 -> 12,694 edges) - dataset_description + README updated (custodian dropped, grounding documented) Reported by thesis author: Citibank custodians in triples for 0001529390 never appear in that prospectus text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 10:34:14 +02:00 · 2026-06-05 10:34:14 +02:00 · 00f51859e0
commit 00f51859e0
parent 63e650fa14
5 changed files with 566 additions and 512 deletions
--- a/README.md
+++ b/README.md
@ -54,19 +54,33 @@ Each line of `samples.jsonl` / `train|val|test.jsonl` is a JSON record:
 |---|---|
 | `input_text` | prospectus prose for the fund (model input) |
 | `ontology` | inferred meta-schema (subject type → predicate → object type) |
-| `target_triples` | structured `{s,p,o}` list |
+| `target_triples` | structured `{s,p,o,grounded}` list (`grounded` = object name appears in `input_text`) |
 | `target_serialized` | marker form (`<triple_start>` …) for Models 2/4 |
 | `target_serialized_plain` | Turtle-like form, no special tokens, for Models 1/3 |
 | `cik`, `series_id`, `fund`, `trust_name` | identifiers |
-| `stats` | input/target sizes, triple count, text:json ratio |
+| `stats` | input/target sizes, triple count, `n_grounded`, text:json ratio |

 ## Relations

 Entity-to-entity edges (gold from N-CEN / Series-Class):
-`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `custodian` (primary
-only by default), `administrator`, `underwrittenBy`. Holdings edges
-(`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned second track
-from annual-report (N-CSR) commentary — see the description PDF.
+`seriesOf`, `advisedBy`, `subAdvisedBy`, `transferAgent`, `administrator`,
+`underwrittenBy`.
+
+**`custodian` is dropped by default** (`--custodian-scope none`): custodian names
+— especially foreign sub-custodians — appear only in the structured N-CEN table
+and in *no* prose document (the summary prospectus says only "the custodian"),
+so they are not extractable from text. The primary custodian is named only in the
+separately-filed SAI (N-1A Part B), which is not part of the input. Use
+`--custodian-scope primary` or `all` to re-include it if you add the SAI as input.
+
+**Prose-grounding:** every triple carries a `grounded` flag (object name present
+in the sample's input). Across the full build ~80 % of triples are grounded
+(per relation: advisedBy 93 %, seriesOf/subAdvisedBy/administrator 80–84 %,
+transferAgent 72 %, underwrittenBy 62 %). Filter on `grounded` to train/evaluate
+only on text-extractable targets.
+
+Holdings edges (`holds`/`issuedBy`/`domiciledIn`, gold from N-PORT) are a planned
+second track from annual-report (N-CSR) commentary — see the description PDF.

 ## Data sources

--- a/build_rdf_dataset.py
+++ b/build_rdf_dataset.py
@ -394,6 +394,39 @@ def serialize_triples_plain(triples, entities) -> str:
    return "\n".join(chunks)


+_GND_SUFFIX = re.compile(
+    r"\b(llc|l\.l\.c|inc|incorporated|corp|corporation|company|co|ltd|limited"
+    r"|lp|l\.p|llp|n\.a|na|the|trust)\b", re.I)
+
+
+def _gnorm(s: str) -> str:
+    """Normalize a name for prose-grounding checks (lowercase, strip legal suffixes)."""
+    s = _GND_SUFFIX.sub(" ", (s or "").lower())
+    s = re.sub(r"[^a-z0-9]+", " ", s)
+    return re.sub(r"\s+", " ", s).strip()
+
+
+def annotate_grounding(triples, entities, text):
+    """Mark each triple as prose-grounded if its object name appears in `text`.
+
+    A triple is only useful as a TEXT->triple training target if the fact can be
+    found in the input. This adds a boolean `grounded` to each triple (the
+    normalized object label occurs as a substring of the normalized input), so
+    training/eval can restrict to grounded triples. Returns (annotated_triples,
+    n_grounded). Mutates copies, not the input dicts.
+    """
+    ntext = _gnorm(text)
+    out = []
+    n_grounded = 0
+    for t in triples:
+        olbl = entities.get(t["o"], {}).get("label", "")
+        no = _gnorm(olbl)
+        g = bool(no) and no in ntext
+        n_grounded += int(g)
+        out.append({**t, "grounded": g})
+    return out, n_grounded
+
+
 def ontology_schema(triples, entities) -> dict:
    """Inferred meta-schema (subject type -> predicate -> object type), per thesis 5.3."""
    schema = defaultdict(lambda: defaultdict(set))
@ -580,6 +613,7 @@ def _build_samples_per_fund():
            if not segs:  # whole-trust fallback (no section located)
                n_fallback_trusts += 1
                triples = g["triples"]
+                triples, n_gnd = annotate_grounding(triples, ents, text)
                target = serialize_triples(triples, ents)
                target_plain = serialize_triples_plain(triples, ents)
                rec = {
@ -590,7 +624,8 @@ def _build_samples_per_fund():
                    "target_serialized": target,
                    "target_serialized_plain": target_plain,
                    "stats": {"input_chars": len(text), "target_chars": len(target),
-                              "n_triples": len(triples), "n_entities": len(ents),
+                              "n_triples": len(triples), "n_grounded": n_gnd,
+                              "n_entities": len(ents),
                              "text_to_json_ratio": round(len(text) / max(1, len(target)), 1)},
                }
                out.write(json.dumps(rec, ensure_ascii=False) + "\n")
@ -611,6 +646,7 @@ def _build_samples_per_fund():
                for t in triples:
                    ref.add(t["s"]); ref.add(t["o"])
                sub_ents = {k: ents[k] for k in ref if k in ents}
+                triples, n_gnd = annotate_grounding(triples, sub_ents, seg)
                target = serialize_triples(triples, sub_ents)
                target_plain = serialize_triples_plain(triples, sub_ents)
                rec = {
@ -623,7 +659,8 @@ def _build_samples_per_fund():
                    "target_serialized": target,
                    "target_serialized_plain": target_plain,
                    "stats": {"input_chars": len(seg), "target_chars": len(target),
-                              "n_triples": len(triples), "n_entities": len(sub_ents),
+                              "n_triples": len(triples), "n_grounded": n_gnd,
+                              "n_entities": len(sub_ents),
                              "text_to_json_ratio": round(len(seg) / max(1, len(target)), 1)},
                }
                out.write(json.dumps(rec, ensure_ascii=False) + "\n")
--- a/data/rdf_poc/gold_graphs.jsonl
+++ b/data/rdf_poc/gold_graphs.jsonl
--- a/dataset_description.pdf
+++ b/dataset_description.pdf
--- a/dataset_description.tex
+++ b/dataset_description.tex
@ -189,11 +189,14 @@ Fund  & \code{seriesOf}       & Trust \\
 Fund  & \code{advisedBy}      & InvestmentAdviser \\
 Fund  & \code{subAdvisedBy}   & SubAdviser \\
 Fund  & \code{transferAgent}  & TransferAgent \\
-Fund  & \code{custodian}      & Custodian \\
 Fund  & \code{administrator}  & Administrator \\
 Trust & \code{underwrittenBy} & Distributor \\
 \addlinespace
-Fund     & \code{holds}        & Security \quad(holdings sub-graph) \\
+\multicolumn{3}{@{}l}{\emph{dropped (not prose-grounded, see \S\ref{sec:baselines}):}}\\
+Fund  & \code{custodian}      & Custodian \\
+\addlinespace
+\multicolumn{3}{@{}l}{\emph{holdings sub-graph (planned 2nd track):}}\\
+Fund     & \code{holds}        & Security \\
 Security & \code{issuedBy}     & Issuer \\
 Security & \code{domiciledIn}  & Country \\
 Fund     & \code{tracksIndex}  & Index \\
@ -220,7 +223,6 @@ realised in that sample, e.g.
    "advisedBy":     ["InvestmentAdviser"],
    "subAdvisedBy":  ["SubAdviser"],
    "transferAgent": ["TransferAgent"],
-    "custodian":     ["Custodian"],
    "administrator": ["Administrator"]
  },
  "Trust": { "underwrittenBy": ["Distributor"] }
@ -242,7 +244,7 @@ realised in that sample, e.g.
  \node[ent,above=of trust] (dist) {Distributor};
  \node[ent,left=of fund]  (adv)  {Investment\\Adviser};
  \node[ent,below=16mm of adv] (sub) {Sub-\\Adviser};
-  \node[ent,below=of fund] (cust) {Custodian};
+  \node[ent,dashed,fill=black!5,below=of fund] (cust) {Custodian};
  \node[ent,right=24mm of cust] (ta) {Transfer\\Agent};
  \node[ent,right=of fund] (admin) {Administrator};
  % --- holdings cluster (far right, separated column) ---
@ -254,7 +256,7 @@ realised in that sample, e.g.
  \draw[edge] (trust) -- node[lbl]{underwrittenBy} (dist);
  \draw[edge] (fund) -- node[lbl]{advisedBy} (adv);
  \draw[edge] (fund) -- node[lbl,pos=0.55]{subAdvisedBy} (sub);
-  \draw[edge] (fund) -- node[lbl]{custodian} (cust);
+  \draw[edge,dashed,gray] (fund) -- node[lbl]{custodian} (cust);
  \draw[edge] (fund) -- node[lbl,pos=0.55]{transferAgent} (ta);
  \draw[edge] (fund) -- node[lbl]{administrator} (admin);
  % holds: arc from the Fund's top, over the Administrator, down to Security
@ -264,9 +266,10 @@ realised in that sample, e.g.
  \draw[edge] (sec) -- node[lbl]{domiciledIn} (ctry);
 \end{tikzpicture}
 \caption{Schematic of the target knowledge graph. Left and centre: the
-service-provider/structure graph grounded in the prospectus prose. Right column
-(Issuer--Security--Country): the holdings sub-graph grounded in annual-report
-commentary with N-PORT gold.}
+service-provider/structure graph grounded in the prospectus prose. The dashed
+\code{custodian} edge is \emph{dropped} from the dataset (not prose-grounded,
+\S\ref{sec:baselines}). Right column (Issuer--Security--Country): the holdings
+sub-graph grounded in annual-report commentary with N-PORT gold.}
 \label{fig:graph}
 \end{figure}

@ -322,18 +325,22 @@ thesis (Section~5.2), in which four special tokens delimit triple components and
 shared subjects/predicates are factored out, mirroring Turtle's predicate-object
 lists:

-\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (primary-custodian scope).},captionpos=b]
-<triple_start> Small Cap Special Values Fund
+\begin{lstlisting}[language=,caption={Target serialization for one segmented fund (custodian dropped).},captionpos=b]
+<triple_start> John Hancock Bond Fund
  <predicate_marker> seriesOf
-    <object_marker> VALIC Co I
+    <object_marker> JOHN HANCOCK SOVEREIGN BOND FUND
  <predicate_marker> advisedBy
-    <object_marker> The Variable Annuity Life Insurance Company
+    <object_marker> John Hancock Investment Management LLC
  <predicate_marker> subAdvisedBy
-    <object_marker> SunAmerica Asset Management, LLC
+    <object_marker> Manulife Investment Management (US) LLC
  <predicate_marker> administrator
-    <object_marker> SunAmerica Asset Management, LLC
-  <predicate_marker> custodian
-    <object_marker> State Street Bank and Trust Company
+    <object_marker> John Hancock Investment Management LLC
+  <predicate_marker> transferAgent
+    <object_marker> John Hancock Signature Services, Inc.
+<triple_end>
+<triple_start> JOHN HANCOCK SOVEREIGN BOND FUND
+  <predicate_marker> underwrittenBy
+    <object_marker> JOHN HANCOCK INVESTMENT MANAGEMENT DISTRIBUTORS LLC
 <triple_end>
 \end{lstlisting}

@ -348,8 +355,9 @@ question~1.

 Each sample is thus a JSON record with: the input prose (\code{input\_text}), the
 inferred ontology (\code{ontology}), the target triples as a structured list
-(\code{target\_triples}) and in both serializations, the trust/series identifiers,
-and size statistics.
+(\code{target\_triples}, each triple carrying a \code{grounded} flag) and in both
+serializations, the trust/series identifiers, and size statistics (including the
+grounded-triple count).

 % ====================================================================
 \section{Per-fund segmentation}
@ -417,70 +425,64 @@ fetching more filings of the same CIK, and is left to the full-scale build.
 The dataset offers two independent ground-truth regimes.

 \paragraph{Model-free gold.} For \code{advisedBy}, \code{subAdvisedBy},
-\code{transferAgent}, \code{custodian}, \code{administrator} and
-\code{underwrittenBy}, the labels come directly from N-CEN; for \code{seriesOf}
-and \code{hasShareClass}, from the Series/Class reference data; for \code{holds},
-\code{issuedBy} and \code{domiciledIn}, from N-PORT. No model is involved in
-producing these labels, which makes them an unusually trustworthy reference for a
-generative-extraction benchmark.
+\code{transferAgent}, \code{administrator} and \code{underwrittenBy}, the labels
+come directly from N-CEN; for \code{seriesOf}, from the Series/Class reference
+data; for the (planned) \code{holds}, \code{issuedBy} and \code{domiciledIn}, from
+N-PORT. No model is involved in producing these labels, which makes them an
+unusually trustworthy reference for a generative-extraction benchmark.

-\paragraph{The custodian relation and edge scoping.} The custodian relation
-illustrates a subtlety that any honest benchmark on this data must address. N-CEN
-reports, for a global fund, not only its \emph{primary} custodian but the entire
-chain of \emph{foreign sub-custodians} --- one bank per market it invests in.
-These sub-custodians have two damaging properties. First, they are
-\emph{unextractable}: they appear only in the N-CEN table and essentially never in
-the prospectus prose (a naive string-match recovers $7\%$ of them), so keeping
-them as targets asks the model to extract facts absent from its input. Second,
-they \emph{dominate}: with \code{IS\_SUB\_CUSTODIAN${=}$Y} accounting for $88\%$
-of custodian rows, they constitute roughly two thirds of \emph{all} edges in the
-unfiltered graph, inflating both the target size and the training loss with noise.
-The dataset therefore scopes the custodian relation to the \emph{primary}
-custodian (\code{IS\_SUB\_CUSTODIAN${\neq}$Y}, a median of one per fund), which is
-genuinely prose-grounded --- it is named in the prospectus or its Statement of
-Additional Information (e.g.\ ``State Street Bank and Trust Company serves as
-custodian''). This single change reduces the corpus from $36{,}880$ to $15{,}739$
-edges, all of prose-grounded relation types, and is the configurable default
-(\code{--custodian-scope primary}). The full sub-custodian chain remains available
-in N-CEN as a structured-only relation outside the text-to-triples task. This is a
-dataset-quality decision of the same kind the thesis notes for T-REx and REBEL,
-whose non-exhaustive references unfairly penalise correct extractions.
+\paragraph{Why the custodian relation is dropped.} A text-to-triple target is only
+useful if the fact can be found in the input text. The custodian relation fails
+this test and is therefore \emph{excluded} from the dataset
+(\code{--custodian-scope none}). N-CEN reports, for a global fund, the entire chain
+of \emph{foreign sub-custodians} --- one bank per market it invests in (Citibank
+Brazil, Banco de Chile, Cititrust Colombia, \dots). These names appear only in the
+structured N-CEN table and in \emph{no} prose document: the summary prospectus
+refers to the custodian only generically (``\dots\ including the adviser, the
+custodian, and the transfer agent\dots'') and never names it. Even the
+\emph{primary} custodian is typically named only in the separately filed Statement
+of Additional Information (N-1A Part~B), which is not part of the fetched input.
+Measured per fund, the custodian object name occurs in the fund's own prospectus
+segment only $28\%$ of the time --- by far the weakest of all relations --- so
+keeping it would systematically ask the model to extract facts absent from its
+input. The full custodian chain remains available in N-CEN as a structured-only
+relation, outside the text-to-triples task; recovering it from text would require
+adding the SAI as an input source (a separate crawl).

-\paragraph{No-model lower bound.} A trivial string-matching baseline --- emit a
-gold edge iff the object's name occurs in the prose --- establishes a \emph{floor}
-and measures \emph{how prose-grounded each relation is}. Table~\ref{tab:baseline}
-reports this on the proof-of-concept slice after primary-custodian scoping,
-multi-book fetching and per-fund segmentation. Because the baseline requires an
-\emph{exact substring} match within the fund's \emph{own} section, its recall is a
-strict lower bound: a fund's adviser, for instance, must be named in that fund's
-segment under a literal spelling. On the full quarter the adviser is recovered
-with recall $0.93$ and the micro-averaged $F_1$ reaches $0.79$; the recovered
-custodian recall is $0.63$ (up from $0.07$ unscoped and $0.37$ after scoping
-alone). The residual gap from $1.0$ is attributable to surface-form variation
-(``State Street Bank and Trust Company'' vs.\ ``State Street'') that a trained
-model handles but exact matching does not.
+\paragraph{Per-triple grounding flag.} Because even the retained relations are not
+\emph{always} named in a fund's segment, every triple carries a boolean
+\code{grounded} flag: true iff the normalized object name occurs in that sample's
+input text. This lets training and evaluation restrict to grounded triples rather
+than silently carrying unextractable targets. Across the full build, $80\%$ of
+triples are grounded; per relation the rate ranges from $93\%$ (\code{advisedBy})
+down to $62\%$ (\code{underwrittenBy}), as shown in Table~\ref{tab:baseline}.
+
+\paragraph{No-model lower bound.} The grounding flag is itself a trivial,
+model-free baseline (emit a gold edge iff its object name occurs in the prose);
+its per-relation rate is a strict lower bound on recall, since it requires an
+\emph{exact substring} match within the fund's own segment and so misses surface
+variants (``State Street Bank and Trust Company'' vs.\ ``State Street'') that a
+trained model handles. Table~\ref{tab:baseline} reports it on the full build.

 \begin{table}[h]
 \centering
-\caption{No-model string-match baseline on the full 2025\,Q3 build, after
-primary-custodian scoping, multi-book fetching and per-fund segmentation
-($852$ samples). Precision is $1.00$ by construction; recall is a strict
-exact-match lower bound.}
+\caption{Per-relation prose-grounding on the full 2025\,Q3 build ($852$ samples,
+$8{,}824$ triples; custodian dropped). ``Grounded'' = object name present in the
+sample's input; a strict, model-free lower bound on recall.}
 \label{tab:baseline}
 \small
 \begin{tabular}{@{}lrr@{}}
 \toprule
-Relation & Recall & Gold edges \\
+Relation & Triples & Grounded \\
 \midrule
-\code{advisedBy}      & 0.93 & 1{,}673 \\
-\code{seriesOf}       & 0.84 & 1{,}555 \\
-\code{subAdvisedBy}   & 0.84 & 946 \\
-\code{administrator}  & 0.80 & 2{,}066 \\
-\code{transferAgent}  & 0.72 & 1{,}721 \\
-\code{custodian}      & 0.63 & 1{,}761 \\
-\code{underwrittenBy} & 0.62 & 863 \\
+\code{advisedBy}      & 1{,}673 & 93\% \\
+\code{seriesOf}       & 1{,}555 & 84\% \\
+\code{subAdvisedBy}   & 946     & 84\% \\
+\code{administrator}  & 2{,}066 & 80\% \\
+\code{transferAgent}  & 1{,}721 & 72\% \\
+\code{underwrittenBy} & 863     & 62\% \\
 \midrule
-micro-average         & 0.65 & 6{,}479 \\
+all                   & 8{,}824 & 80\% \\
 \bottomrule
 \end{tabular}
 \end{table}
@ -497,21 +499,22 @@ quantified rather than assumed.
 \section{Corpus statistics}
 % ====================================================================

-Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. The
-N-CEN gold graph (after primary-custodian scoping) holds $15{,}739$
+Table~\ref{tab:stats} summarises one full quarter (2025\,Q3) of the dataset. With
+the custodian relation dropped, the N-CEN gold graph holds $12{,}694$
 entity-to-entity edges across $435$ trusts and $2{,}421$ funds. Fetching all full
 prospectus books for every trust ($2{,}326$ filings across $393$ trusts; $42$
 closed-end or interval funds file no standard prospectus) and applying the robust
 per-fund segmenter yields $852$ samples ($659$ cleanly segmented per-fund plus
-$193$ whole-trust fallbacks). The segmented samples have a per-fund median ratio
-near $117\!:\!1$ (input prose to target serialization), and across all samples the
+$193$ whole-trust fallbacks) containing $8{,}824$ target triples, of which $80\%$
+are prose-grounded. The segmented samples have a per-fund median ratio near
+$117\!:\!1$ (input prose to target serialization), and across all samples the
 median exceeds $400\!:\!1$ --- the inverse of the symmetric-size benchmarks.

 \begin{table}[h]
 \centering
-\caption{Corpus statistics for the full 2025\,Q3 build. Left: N-CEN gold graph
-(primary-custodian scope). Right: text-to-triple samples (all prospectus books per
-trust, per-fund segmentation).}
+\caption{Corpus statistics for the full 2025\,Q3 build (custodian dropped). Left:
+N-CEN gold graph. Right: text-to-triple samples (all prospectus books per trust,
+per-fund segmentation).}
 \label{tab:stats}
 \small
 \begin{tabular}{@{}lr@{\hskip 3em}lr@{}}
@ -520,9 +523,9 @@ trust, per-fund segmentation).}
 \cmidrule(r){1-2}\cmidrule(l){3-4}
 Trust graphs        & 435      & Samples (total)         & 852 \\
 Funds (series)      & 2{,}421  & \;segmented per-fund    & 659 \\
-Entity-entity edges & 15{,}739 & \;whole-trust fallback  & 193 \\
-\;custodian (primary) & 3{,}045 & Trusts fetched         & 393 \\
-\;advisedBy         & 2{,}588  & Prospectus filings     & 2{,}326 \\
+Entity-entity edges & 12{,}694 & \;whole-trust fallback  & 193 \\
+\;administrator     & 3{,}288  & Target triples          & 8{,}824 \\
+\;advisedBy         & 2{,}588  & \;grounded              & 80\% \\
 Distributors        & 458      & Ratio (median, per-fund) & $117\!:\!1$ \\
 \bottomrule
 \end{tabular}