Builds a relationship-rich finance dataset for text-to-RDF-triple extraction
from SEC fund disclosures, the dataset for the thesis 'Magical RDF Triples and
how to synthetize them'.
- build_rdf_dataset.py: gold (N-CEN graphs), fetch (EDGAR prospectus prose,
all books per trust), samples (per-fund segmentation, marker + plain
serializations), split (trust-level 80/10/10, no leakage)
- score_baseline.py: no-model string-match baseline + strong-model scorer
- dataset_description.{tex,pdf}: scientific description of the dataset
- data/rdf_poc/gold_graphs.jsonl: structured gold knowledge graph (2025Q3)
- Large prose/sample files and raw SEC downloads are gitignored (reproducible)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
39 lines
643 B
Plaintext
39 lines
643 B
Plaintext
# ---- Large / derived data (reproducible via build_rdf_dataset.py) ----
|
|
# Raw prospectus prose fetched from EDGAR (GBs)
|
|
data/rdf_poc/prose/
|
|
# Generated training samples and splits (embed raw SEC text, 100s of MB)
|
|
data/rdf_poc/samples.jsonl
|
|
data/rdf_poc/train.jsonl
|
|
data/rdf_poc/val.jsonl
|
|
data/rdf_poc/test.jsonl
|
|
# Raw SEC bulk downloads (re-downloadable from sec.gov)
|
|
data/ncen/
|
|
data/nport/
|
|
data/xbrl_rr/
|
|
|
|
# SQLite working DB
|
|
fund_data.db
|
|
fund_data.db-shm
|
|
fund_data.db-wal
|
|
|
|
# Archives
|
|
*.zip
|
|
|
|
# Python
|
|
__pycache__/
|
|
*.pyc
|
|
*.pyo
|
|
|
|
# LaTeX build artifacts
|
|
*.aux
|
|
*.log
|
|
*.out
|
|
*.toc
|
|
*.fls
|
|
*.fdb_latexmk
|
|
*.synctex.gz
|
|
|
|
# OS / editor
|
|
.DS_Store
|
|
.claude/
|