schuetoliver/CDS202-Atlas

Fork 0

Go to file

DotNaos a50680766a Add Dataset explaination

2025-06-12 16:43:42 +02:00

lexikon

Add Frontend Code

2025-06-12 16:36:02 +02:00

librarian

Update Readme and cleanup

2025-06-01 17:32:50 +02:00

.gitignore

Initialize Monorepo

2025-05-24 12:15:48 +02:00

README.md

Add Dataset explaination

2025-06-12 16:43:42 +02:00

README.md

Atlas Librarian

A comprehensive content processing and management system for extracting, chunking, and vectorizing information from various sources.

Datensatz

Aus Datenschutz- und Speichergründen, ist der Datensatz hier nicht im Repo.

Falls ihr es dennoch einmal anschauen wollt, könnt ihr eine Zip-Datei von einem Kurs auf Moodle auch Manuell schnell herunterladen:

Hier einfach für unseren Kurs: https://moodle.fhgr.ch/course/downloadcontent.php?contextid=1159522

Die Zip ist der Output vom Scraper, d.h Extraktor arbeitet mit dieser Struktur weiter.

💬 Fragen & Antworten

📦 Wie gross sind die Daten?

~500 MB Rohdaten (ca. 250 MB pro Semester, abhängig von der Anzahl der hochgeladenen Bilder, aktuell 2 Semester)

🧑‍💻 Wie viele Codezeilen hat euer Tool bisher?

~20 000 Lines of Code
- Python: ~6 000
- SQL (Datenbank): ~3 500
- Website: ~10 000

⏱️ Dauert das immer 40 Minuten?

Ca. 40 Minuten bei 16 GB RAM MacBook Pro
~20 Minuten bei 64 GB RAM Windows Workstation
<1 Minute bei nur ein/zwei Kursen
Mit GPU: Noch schneller möglich, aber Batch Size muss ggf. reduziert werden, da das VRAM möglicherweise nicht ausreicht.

📊 Warum sind die Charts grün?

Grün bedeutet, dass ein Task erfolgreich abgeschlossen wurde.
Die Farben stammen vom Progress Indicator von Prefect (Workflow-Orchestrator, Darstellung ist vorgegeben):
- Blau = läuft (running)
- Grün = erfolgreich (success)
- Rot = fehlgeschlagen (failed)
Jeder Balken steht für einen Task-Run (z. B. das Verarbeiten einer Datei)

Overview

Atlas Librarian is a modular system designed to process, organize, and make searchable large amounts of content through web scraping, content extraction, chunking, and vector embeddings.

Project Structure

atlas/
|-- lexikon/ # Frontend app
├── librarian/
│   ├── atlas-librarian/     # Main application
│   ├── librarian-core/      # Core functionality and storage
│   └── plugins/
│       ├── librarian-chunker/    # Content chunking
│       ├── librarian-extractor/  # Content extraction with AI
│       ├── librarian-scraper/    # Web scraping and crawling
│       ├── librarian-summarizer/ # Daily AI summarization
│       └── librarian-vspace/     # Vector space operations

Components

Atlas Librarian: Main application with API, web app, and recipe management
Librarian Core: Shared utilities, storage, and Supabase integration
Chunker Plugin: Splits content into processable chunks
Extractor Plugin: Extracts and sanitizes content using AI
Scraper Plugin: Crawls and downloads web content
VSpace Plugin: Vector embeddings, similarity search and clustering concatenation

Getting Started

Clone the repository
Install dependencies for each component
Configure environment variables
Run the main application

Features

Web content scraping and crawling
AI-powered content extraction and sanitization
Intelligent content chunking
Vector embeddings for semantic search
Supabase integration for data storage
Modular plugin architecture

For detailed documentation, see the individual component directories.

README.md Unescape Escape

Atlas Librarian

Datensatz

💬 Fragen & Antworten

📦 Wie gross sind die Daten?

🧑‍💻 Wie viele Codezeilen hat euer Tool bisher?

⏱️ Dauert das immer 40 Minuten?

📊 Warum sind die Charts grün?

Overview

Project Structure

Components

Getting Started

Features

README.md