CDS202-Atlas/README.md
2025-06-12 16:23:18 +02:00

87 lines
2.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Atlas Librarian
A comprehensive content processing and management system for extracting, chunking, and vectorizing information from various sources.
---
## 💬 Fragen & Antworten
### 📦 Wie gross sind die Daten?
- **~500MB Rohdaten** (ca. **250MB pro Semester**, abhängig von der Anzahl der hochgeladenen Bilder, aktuell 2 Semester)
---
### 🧑‍💻 Wie viele Codezeilen hat euer Tool bisher?
- **~20000 Lines of Code**
- Python: **~6000**
- SQL (Datenbank): **~3500**
- Website: **~10000**
---
### ⏱️ Dauert das immer 40 Minuten?
- **Ca. 40 Minuten** bei 16GB RAM MacBook Pro
- **~20 Minuten** bei 64GB RAM Windows Workstation
- **<1 Minute** bei nur ein/zwei Kursen
- **Mit GPU**: Noch schneller möglich, aber Batch Size muss ggf. reduziert werden, da das VRAM möglicherweise nicht ausreicht.
---
### 📊 Warum sind die Charts grün?
- **Grün** bedeutet, dass ein Task erfolgreich abgeschlossen wurde.
- Die Farben stammen vom **Progress Indicator** von [Prefect](https://www.prefect.io/) (Workflow-Orchestrator, Darstellung ist vorgegeben):
- **Blau** = läuft (running)
- **Grün** = erfolgreich (success)
- **Rot** = fehlgeschlagen (failed)
- **Jeder Balken** steht für einen Task-Run (z.B. das Verarbeiten einer Datei)
---
## Overview
Atlas Librarian is a modular system designed to process, organize, and make searchable large amounts of content through web scraping, content extraction, chunking, and vector embeddings.
## Project Structure
```
atlas/
├── librarian/
│ ├── atlas-librarian/ # Main application
│ ├── librarian-core/ # Core functionality and storage
│ └── plugins/
│ ├── librarian-chunker/ # Content chunking
│ ├── librarian-extractor/ # Content extraction with AI
│ ├── librarian-scraper/ # Web scraping and crawling
│ ├── librarian-summarizer/ # Daily AI summarization
│ └── librarian-vspace/ # Vector space operations
```
## Components
- **Atlas Librarian**: Main application with API, web app, and recipe management
- **Librarian Core**: Shared utilities, storage, and Supabase integration
- **Chunker Plugin**: Splits content into processable chunks
- **Extractor Plugin**: Extracts and sanitizes content using AI
- **Scraper Plugin**: Crawls and downloads web content
- **VSpace Plugin**: Vector embeddings and similarity search
## Getting Started
1. Clone the repository
2. Install dependencies for each component
3. Configure environment variables
4. Run the main application
## Features
- Web content scraping and crawling
- AI-powered content extraction and sanitization
- Intelligent content chunking
- Vector embeddings for semantic search
- Supabase integration for data storage
- Modular plugin architecture
---
*For detailed documentation, see the individual component directories.*