fund_rfid_data/requirements.txt at 00f51859e06492cf20a338d87252434561f240c9 - fund_rfid_data - FHGR Git

herzogfloria/fund_rfid_data

Florian Herzog 1993658fb2 Add SEC fund prospectus -> RDF triple dataset pipeline

Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.

- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
  all books per trust), samples (per-fund segmentation, marker + plain
  serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-03 10:31:35 +02:00

7 lines

142 B

Plaintext

Raw Blame History

 requests>=2.31.0
 beautifulsoup4>=4.12.0
 lxml>=5.1.0
 pandas>=2.2.0
 tqdm>=4.66.0
 # SQLite3 is part of Python stdlib — no extra package needed