LLM_Inferenz_Server_1/README.md

# LLM Inferenz Server — Qwen3.5-35B-A3B

Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
**Apptainer** container on a GPU server. Two front-ends are provided:
**Open WebUI** (server-hosted ChatGPT-like UI) and a **Streamlit app**
(local chat + file editor with code execution).

## Architecture

```
Students
  │
  ├── Browser ──► Open WebUI (silicon.fhgr.ch:7081)
  │                  │  ChatGPT-like UI, user accounts, chat history
  │                  │
  ├── Streamlit ─────┤  Local app with file editor & code runner
  │                  │
  └── SDK / curl ────┘
                     ▼
          ┌──────────────────────────────┐
          │  silicon.fhgr.ch:7080       │
          │  OpenAI-compatible API      │
          ├──────────────────────────────┤
          │  vLLM Server (nightly)      │
          │  Apptainer container (.sif) │
          ├──────────────────────────────┤
          │  Qwen3.5-35B-A3B weights    │
          │  (bind-mounted from host)   │
          ├──────────────────────────────┤
          │  2× NVIDIA L40S (46 GB ea.) │
          │  Tensor Parallel = 2        │
          └──────────────────────────────┘
```

## Hardware

The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.

| Component | Value |
|-----------|-------|
| GPUs used | 2× NVIDIA L40S |
| VRAM used | ~92 GB total |
| Model size (BF16) | ~67 GB |
| Active params/token | 3B (MoE) |
| Context length | 32,768 tokens |
| Port | 7080 |

## Prerequisites

- **Apptainer** (formerly Singularity) installed on the server
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **~80 GB disk** for model weights + ~8 GB for the container image
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)

> **Note**: No `pip` or `python` is needed on the host — everything runs inside
> the Apptainer container.

---

## Step-by-Step Setup

### Step 0: SSH into the Server

```bash
ssh herzogfloria@silicon.fhgr.ch
```

### Step 1: Clone the Repository

```bash
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```

> **Note**: `git` is not installed on the host. Use the container:
> `apptainer exec vllm_qwen.sif git clone ...`
> Or copy files via `scp` from your local machine.

### Step 2: Check GPU and Environment

```bash
nvidia-smi
apptainer --version
df -h ~
```

### Step 3: Build the Apptainer Container

```bash
bash 01_build_container.sh
```

Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5
support), installs latest `transformers` from source, and packages everything
into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.

### Step 4: Download the Model (~67 GB)

```bash
bash 02_download_model.sh
```

Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
depending on bandwidth.

### Step 5: Start the Server

**Interactive (foreground) — recommended with tmux:**
```bash
tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach
```

**Background with logging:**
```bash
bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log
```

The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
```
INFO:     Uvicorn running on http://0.0.0.0:7080
```

### Step 6: Test the Server

From another terminal on the server:
```bash
curl http://localhost:7080/v1/models
```

Quick chat test:
```bash
curl http://localhost:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
```

### Step 7: Set Up Open WebUI (ChatGPT-like Interface)

Open WebUI provides a full-featured chat interface that runs on the server.
Students access it via a browser — no local setup required.

**Pull the container:**
```bash
bash 06_setup_openwebui.sh
```

**Start (foreground with tmux):**
```bash
tmux new -s webui
bash 07_start_openwebui.sh
# Ctrl+B, then D to detach
```

**Start (background with logging):**
```bash
bash 08_start_openwebui_background.sh
tail -f logs/openwebui_*.log
```

Open WebUI is ready when you see `Uvicorn running` in the logs.
Access it at `http://silicon.fhgr.ch:7081`.

> **Important**: The first user to sign up becomes the **admin**. Sign up
> yourself first before sharing the URL with students.

### Step 8: Share with Students

Distribute `STUDENT_GUIDE.md` with connection details:
- **Open WebUI**: `http://silicon.fhgr.ch:7081` (recommended for most students)
- **API Base URL**: `http://silicon.fhgr.ch:7080/v1` (for SDK / programmatic use)
- **Model name**: `qwen3.5-35b-a3b`

---

## Open WebUI

A server-hosted ChatGPT-like interface backed by the vLLM inference server.
Runs as an Apptainer container on port **7081**.

### Features

- User accounts with persistent chat history (stored in `openwebui-data/`)
- Auto-discovers models from the vLLM backend
- Streaming responses, markdown rendering, code highlighting
- Admin panel for managing users, models, and settings
- No local setup needed — students just open a browser

### Configuration

| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `7081` | HTTP port for the UI |
| `VLLM_BASE_URL` | `http://localhost:7080/v1` | vLLM API endpoint |
| `VLLM_API_KEY` | `EMPTY` | API key (if vLLM requires one) |
| `DATA_DIR` | `./openwebui-data` | Persistent storage (DB, uploads) |

### Management

```bash
# Start in background
bash 08_start_openwebui_background.sh

# View logs
tail -f logs/openwebui_*.log

# Stop
bash 09_stop_openwebui.sh

# Reconnect to tmux session
tmux attach -t webui
```

### Data Persistence

All user data (accounts, chats, settings) is stored in `openwebui-data/`.
This directory is bind-mounted into the container, so data survives
container restarts. Back it up regularly.

---

## Streamlit App

A web-based chat and file editor that connects to the inference server.
Students run it on their own machines.

### Setup

```bash
pip install -r requirements.txt
```

Or with a virtual environment:

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

### Run

```bash
streamlit run app.py
```

Opens at `http://localhost:8501` with two tabs:

- **Chat** — Conversational interface with streaming responses. Save the
  model's last response directly into a workspace file (code auto-extracted).
- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file.
  Use "Generate with LLM" to modify files via natural language instructions.

### Sidebar Controls

| Parameter | Default | Range | Purpose |
|-----------|---------|-------|---------|
| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
| Temperature | 0.7 | 0.0 – 2.0 | Creativity vs determinism |
| Max Tokens | 4096 | 256 – 16384 | Maximum response length |
| Top P | 0.95 | 0.0 – 1.0 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | 0.0 – 2.0 | Penalize repeated topics |

---

## Server Configuration

All configuration is via environment variables passed to `03_start_server.sh`:

| Variable          | Default                          | Description                    |
|-------------------|----------------------------------|--------------------------------|
| `MODEL_DIR`       | `~/models/Qwen3.5-35B-A3B`      | Path to model weights          |
| `PORT`            | `7080`                           | HTTP port                      |
| `MAX_MODEL_LEN`   | `32768`                          | Max context length (tokens)    |
| `GPU_MEM_UTIL`    | `0.92`                           | Fraction of GPU memory to use  |
| `API_KEY`         | *(empty = no auth)*              | API key for authentication     |
| `TENSOR_PARALLEL` | `2`                              | Number of GPUs                 |

### Examples

```bash
# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh

# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh

# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh
```

---

## Server Management

```bash
# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

# Reconnect to tmux session
tmux attach -t llm
```

---

## Files Overview

| File                               | Purpose                                              |
|------------------------------------|------------------------------------------------------|
| `vllm_qwen.def`                   | Apptainer container definition (vLLM nightly + deps) |
| `01_build_container.sh`            | Builds the Apptainer `.sif` image                    |
| `02_download_model.sh`             | Downloads model weights (runs inside container)      |
| `03_start_server.sh`               | Starts vLLM server (foreground)                      |
| `04_start_server_background.sh`    | Starts vLLM server in background with logging        |
| `05_stop_server.sh`                | Stops the background vLLM server                     |
| `06_setup_openwebui.sh`            | Pulls the Open WebUI container image                 |
| `07_start_openwebui.sh`            | Starts Open WebUI (foreground)                       |
| `08_start_openwebui_background.sh` | Starts Open WebUI in background with logging         |
| `09_stop_openwebui.sh`             | Stops the background Open WebUI                      |
| `app.py`                           | Streamlit chat & file editor web app                 |
| `requirements.txt`                 | Python dependencies for the Streamlit app            |
| `test_server.py`                   | Tests the running server via CLI                     |
| `STUDENT_GUIDE.md`                 | Instructions for students                            |

---

## Troubleshooting

### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)

### Container build fails
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly`

### "No NVIDIA GPU detected"
- Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is present (already in scripts)
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`

### "Model type qwen3_5_moe not recognized"
- The container needs `vllm/vllm-openai:nightly` (not `:latest`)
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`

### Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must be on the university network or VPN

### Slow generation with many users
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Disable thinking mode for faster simple responses
- Monitor: `curl http://localhost:7080/metrics`

### Open WebUI won't start
- Ensure the vLLM server is running first on port 7080
- Check that port 7081 is not already in use: `ss -tlnp | grep 7081`
- Check logs: `tail -50 logs/openwebui_*.log`
- If the database is corrupted, reset: `rm openwebui-data/webui.db` and restart

### Open WebUI shows no models
- Verify vLLM is reachable: `curl http://localhost:7080/v1/models`
- The OpenAI API base URL is set on first launch; if changed later, update
  it in the Open WebUI Admin Panel > Settings > Connections

### Syncing files to the server
- No `git` or `pip` on the host — use `scp` from your local machine:
```bash
scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/
```