fund_rfid_data

herzogfloria/fund_rfid_data

Fork 0

Commit Graph

Author	SHA1	Message	Date
Florian Herzog	63e650fa14	Update dataset description with full 2025Q3 build statistics Full build: 2,326 prospectus filings across 393 trusts -> 852 samples (659 segmented per-fund + 193 fallback), trust-level split 655/122/75, no-model baseline F1=0.79. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 11:21:23 +02:00
Florian Herzog	1993658fb2	Add SEC fund prospectus -> RDF triple dataset pipeline Builds a relationship-rich finance dataset for text-to-RDF-triple extraction from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and how to synthetize them'. - build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose, all books per trust), samples (per-fund segmentation, marker + plain serializations), split (trust-level 80/10/10, no leakage) - score_baseline.py: no-model string-match baseline + strong-model scorer - dataset_description.{tex,pdf}: scientific description of the dataset - data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3) - Large prose/sample files and raw SEC downloads are gitignored (reproducible) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 10:31:35 +02:00

Author

SHA1

Message

Date

Florian Herzog

63e650fa14

Update dataset description with full 2025Q3 build statistics

Full build: 2,326 prospectus filings across 393 trusts -> 852 samples
(659 segmented per-fund + 193 fallback), trust-level split 655/122/75,
no-model baseline F1=0.79.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-03 11:21:23 +02:00

Florian Herzog

1993658fb2

Add SEC fund prospectus -> RDF triple dataset pipeline

Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-03 10:31:35 +02:00

2 Commits