Atlas Librarian

A comprehensive content processing and management system for extracting, chunking, and vectorizing information from various sources.

💬 Fragen & Antworten

📦 Wie gross sind die Daten?

~500 MB Rohdaten (ca. 250 MB pro Semester, abhängig von der Anzahl der hochgeladenen Bilder, aktuell 2 Semester)

🧑‍💻 Wie viele Codezeilen hat euer Tool bisher?

~20 000 Lines of Code
- Python: ~6 000
- SQL (Datenbank): ~3 500
- Website: ~10 000

⏱️ Dauert das immer 40 Minuten?

Ca. 40 Minuten bei 16 GB RAM MacBook Pro
~20 Minuten bei 64 GB RAM Windows Workstation
<1 Minute bei nur ein/zwei Kursen
Mit GPU: Noch schneller möglich, aber Batch Size muss ggf. reduziert werden, da das VRAM möglicherweise nicht ausreicht.

📊 Warum sind die Charts grün?

Grün bedeutet, dass ein Task erfolgreich abgeschlossen wurde.
Die Farben stammen vom Progress Indicator von Prefect (Workflow-Orchestrator, Darstellung ist vorgegeben):
- Blau = läuft (running)
- Grün = erfolgreich (success)
- Rot = fehlgeschlagen (failed)
Jeder Balken steht für einen Task-Run (z. B. das Verarbeiten einer Datei)

Overview

Atlas Librarian is a modular system designed to process, organize, and make searchable large amounts of content through web scraping, content extraction, chunking, and vector embeddings.

Project Structure

atlas/
├── librarian/
│   ├── atlas-librarian/     # Main application
│   ├── librarian-core/      # Core functionality and storage
│   └── plugins/
│       ├── librarian-chunker/    # Content chunking
│       ├── librarian-extractor/  # Content extraction with AI
│       ├── librarian-scraper/    # Web scraping and crawling
│       ├── librarian-summarizer/ # Daily AI summarization
│       └── librarian-vspace/     # Vector space operations

Components

Atlas Librarian: Main application with API, web app, and recipe management
Librarian Core: Shared utilities, storage, and Supabase integration
Chunker Plugin: Splits content into processable chunks
Extractor Plugin: Extracts and sanitizes content using AI
Scraper Plugin: Crawls and downloads web content
VSpace Plugin: Vector embeddings, similarity search and clustering concatenation

Getting Started

Clone the repository
Install dependencies for each component
Configure environment variables
Run the main application

Features

Web content scraping and crawling
AI-powered content extraction and sanitization
Intelligent content chunking
Vector embeddings for semantic search
Supabase integration for data storage
Modular plugin architecture

For detailed documentation, see the individual component directories.

README.md Unescape Escape