CDS202-Atlas/README.md

# Atlas Librarian

A comprehensive content processing and management system for extracting, chunking, and vectorizing information from various sources.

## Datensatz
Aus Datenschutz- und Speichergründen, ist der Datensatz hier nicht im Repo.

Falls ihr es dennoch einmal anschauen wollt, könnt ihr eine Zip-Datei von einem Kurs auf Moodle auch Manuell schnell herunterladen:

Hier einfach für unseren Kurs: https://moodle.fhgr.ch/course/downloadcontent.php?contextid=1159522

Die Zip ist der Output vom Scraper, d.h Extraktor arbeitet mit dieser Struktur weiter.

---

## 💬 Fragen & Antworten

### 📦 Wie gross sind die Daten?
- **~500 MB Rohdaten** (ca. **250 MB pro Semester**, abhängig von der Anzahl der hochgeladenen Bilder, aktuell 2 Semester)

---

### 🧑‍💻 Wie viele Codezeilen hat euer Tool bisher?
- **~20 000 Lines of Code**
  - Python: **~6 000**
  - SQL (Datenbank): **~3 500**
  - Website: **~10 000**

---

### ⏱️ Dauert das immer 40 Minuten?
- **Ca. 40 Minuten** bei 16 GB RAM MacBook Pro
- **~20 Minuten** bei 64 GB RAM Windows Workstation
- **<1 Minute** bei nur ein/zwei Kursen
- **Mit GPU**: Noch schneller möglich, aber Batch Size muss ggf. reduziert werden, da das VRAM möglicherweise nicht ausreicht.

---

### 📊 Warum sind die Charts grün?
- **Grün** bedeutet, dass ein Task erfolgreich abgeschlossen wurde.
- Die Farben stammen vom **Progress Indicator** von [Prefect](https://www.prefect.io/) (Workflow-Orchestrator, Darstellung ist vorgegeben):
  - **Blau** = läuft (running)
  - **Grün** = erfolgreich (success)
  - **Rot** = fehlgeschlagen (failed)
- **Jeder Balken** steht für einen Task-Run (z. B. das Verarbeiten einer Datei)

---

## Overview

Atlas Librarian is a modular system designed to process, organize, and make searchable large amounts of content through web scraping, content extraction, chunking, and vector embeddings.

## Project Structure

```
atlas/
|-- lexikon/ # Frontend app
├── librarian/
│   ├── atlas-librarian/     # Main application
│   ├── librarian-core/      # Core functionality and storage
│   └── plugins/
│       ├── librarian-chunker/    # Content chunking
│       ├── librarian-extractor/  # Content extraction with AI
│       ├── librarian-scraper/    # Web scraping and crawling
│       ├── librarian-summarizer/ # Daily AI summarization
│       └── librarian-vspace/     # Vector space operations
```

## Components

- **Atlas Librarian**: Main application with API, web app, and recipe management
- **Librarian Core**: Shared utilities, storage, and Supabase integration
- **Chunker Plugin**: Splits content into processable chunks
- **Extractor Plugin**: Extracts and sanitizes content using AI
- **Scraper Plugin**: Crawls and downloads web content
- **VSpace Plugin**: Vector embeddings, similarity search and clustering concatenation

## Getting Started

1. Clone the repository
2. Install dependencies for each component
3. Configure environment variables
4. Run the main application

## Features

- Web content scraping and crawling
- AI-powered content extraction and sanitization
- Intelligent content chunking
- Vector embeddings for semantic search
- Supabase integration for data storage
- Modular plugin architecture

---

*For detailed documentation, see the individual component directories.*