2026-03-02 20:59:27 +01:00

413 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM Inferenz Server — Qwen3.5
Self-hosted LLM inference for ~15 concurrent students, served via **vLLM**
inside an **Apptainer** container on a GPU server. Two models are available
(one at a time):
| Model | Params | Active | Weights | GPUs |
|-------|--------|--------|---------|------|
| **Qwen3.5-35B-A3B** | 35B MoE | 3B | ~67 GB BF16 | 2× L40S (TP=2) |
| **Qwen3.5-122B-A10B-FP8** | 122B MoE | 10B | ~125 GB FP8 | 4× L40S (TP=4) |
Two front-ends are provided: **Open WebUI** (server-hosted ChatGPT-like UI)
and a **Streamlit app** (local chat + file editor with code execution).
## Architecture
```
Students
├── Browser ──► Open WebUI (silicon.fhgr.ch:7081)
│ │ ChatGPT-like UI, user accounts, chat history
│ │
├── Streamlit ─────┤ Local app with file editor & code runner
│ │
└── SDK / curl ────┘
┌──────────────────────────────┐
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├──────────────────────────────┤
│ vLLM Server (nightly) │
│ Apptainer container (.sif) │
├──────────────────────────────┤
│ Model weights │
│ (bind-mounted from host) │
├──────────────────────────────┤
│ 4× NVIDIA L40S (46 GB ea.) │
│ 184 GB total VRAM │
└──────────────────────────────┘
```
## Hardware
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each,
184 GB total). Only one model runs at a time on port 7080.
| | Qwen3.5-35B-A3B | Qwen3.5-122B-A10B-FP8 |
|---|---|---|
| GPUs used | 2× L40S (TP=2) | 4× L40S (TP=4) |
| VRAM used | ~92 GB | ~184 GB |
| Weight size | ~67 GB (BF16) | ~125 GB (FP8) |
| Active params/token | 3B (MoE) | 10B (MoE) |
| Context length | 32,768 tokens | 32,768 tokens |
| Port | 7080 | 7080 |
## Prerequisites
- **Apptainer** (formerly Singularity) installed on the server
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **~200 GB disk** for model weights (both models) + ~8 GB for the container image
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
> the Apptainer container.
---
## Step-by-Step Setup
### Step 0: SSH into the Server
```bash
ssh <name>@silicon.fhgr.ch
```
### Step 1: Clone the Repository
```bash
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```
> **Note**: `git` is not installed on the host. Use the container:
> `apptainer exec vllm_qwen.sif git clone ...`
> Or copy files via `scp` from your local machine.
### Step 2: Check GPU and Environment
```bash
nvidia-smi
apptainer --version
df -h ~
```
### Step 3: Build the Apptainer Container
```bash
bash 01_build_container.sh
```
Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5
support), installs latest `transformers` from source, and packages everything
into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
### Step 4: Download Model Weights
**35B model (~67 GB):**
```bash
bash 02_download_model.sh
```
**122B model (~125 GB):**
```bash
bash 10_download_model_122b.sh
```
Both use `huggingface-cli` **inside the container**. Stored at
`~/models/Qwen3.5-35B-A3B` and `~/models/Qwen3.5-122B-A10B-FP8` respectively.
### Step 5: Start the Server
Only one model can run at a time on port 7080. Choose one:
**35B model (2 GPUs, faster per-token, smaller):**
```bash
bash 03_start_server.sh # foreground
bash 04_start_server_background.sh # background
```
**122B model (4 GPUs, more capable, FP8):**
```bash
bash 11_start_server_122b.sh # foreground
bash 12_start_server_122b_background.sh # background
```
**To switch models:**
```bash
bash 05_stop_server.sh # stop whichever is running
bash 11_start_server_122b.sh # start the other one
```
The model takes 2-5 minutes (35B) or 5-10 minutes (122B) to load. It's ready
when you see:
```
INFO: Uvicorn running on http://0.0.0.0:7080
```
### Step 6: Test the Server
From another terminal on the server:
```bash
curl http://localhost:7080/v1/models
```
Quick chat test:
```bash
curl http://localhost:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
```
### Step 7: Set Up Open WebUI (ChatGPT-like Interface)
Open WebUI provides a full-featured chat interface that runs on the server.
Students access it via a browser — no local setup required.
**Pull the container:**
```bash
bash 06_setup_openwebui.sh
```
**Start (foreground with tmux):**
```bash
tmux new -s webui
bash 07_start_openwebui.sh
# Ctrl+B, then D to detach
```
**Start (background with logging):**
```bash
bash 08_start_openwebui_background.sh
tail -f logs/openwebui_*.log
```
Open WebUI is ready when you see `Uvicorn running` in the logs.
Access it at `http://silicon.fhgr.ch:7081`.
> **Important**: The first user to sign up becomes the **admin**. Sign up
> yourself first before sharing the URL with students.
### Step 8: Share with Students
Distribute `STUDENT_GUIDE.md` with connection details:
- **Open WebUI**: `http://silicon.fhgr.ch:7081` (recommended for most students)
- **API Base URL**: `http://silicon.fhgr.ch:7080/v1` (for SDK / programmatic use)
- **Model name**: `qwen3.5-35b-a3b` or `qwen3.5-122b-a10b-fp8` (depending on which is running)
---
## Open WebUI
A server-hosted ChatGPT-like interface backed by the vLLM inference server.
Runs as an Apptainer container on port **7081**.
### Features
- User accounts with persistent chat history (stored in `openwebui-data/`)
- Auto-discovers models from the vLLM backend
- Streaming responses, markdown rendering, code highlighting
- Admin panel for managing users, models, and settings
- No local setup needed — students just open a browser
### Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `7081` | HTTP port for the UI |
| `VLLM_BASE_URL` | `http://localhost:7080/v1` | vLLM API endpoint |
| `VLLM_API_KEY` | `EMPTY` | API key (if vLLM requires one) |
| `DATA_DIR` | `./openwebui-data` | Persistent storage (DB, uploads) |
### Management
```bash
# Start in background
bash 08_start_openwebui_background.sh
# View logs
tail -f logs/openwebui_*.log
# Stop
bash 09_stop_openwebui.sh
# Reconnect to tmux session
tmux attach -t webui
```
### Data Persistence
All user data (accounts, chats, settings) is stored in `openwebui-data/`.
This directory is bind-mounted into the container, so data survives
container restarts. Back it up regularly.
---
## Streamlit App
A web-based chat and file editor that connects to the inference server.
Students run it on their own machines.
### Setup
```bash
pip install -r requirements.txt
```
Or with a virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### Run
```bash
streamlit run app.py
```
Opens at `http://localhost:8501` with two tabs:
- **Chat** — Conversational interface with streaming responses. Save the
model's last response directly into a workspace file (code auto-extracted).
- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file.
Use "Generate with LLM" to modify files via natural language instructions.
### Sidebar Controls
| Parameter | Default | Range | Purpose |
|-----------|---------|-------|---------|
| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
| Temperature | 0.7 | 0.0 2.0 | Creativity vs determinism |
| Max Tokens | 4096 | 256 16384 | Maximum response length |
| Top P | 0.95 | 0.0 1.0 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | 0.0 2.0 | Penalize repeated topics |
---
## Server Configuration
Both start scripts accept the same environment variables:
| Variable | 35B default | 122B default | Description |
|----------|-------------|--------------|-------------|
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | `~/models/Qwen3.5-122B-A10B-FP8` | Model weights path |
| `PORT` | `7080` | `7080` | HTTP port |
| `MAX_MODEL_LEN` | `32768` | `32768` | Max context length |
| `GPU_MEM_UTIL` | `0.92` | `0.92` | GPU memory fraction |
| `API_KEY` | *(none)* | *(none)* | API key for auth |
| `TENSOR_PARALLEL` | `2` | `4` | Number of GPUs |
### Examples
```bash
# Increase context length (35B)
MAX_MODEL_LEN=65536 bash 03_start_server.sh
# Increase context length (122B — has room with FP8)
MAX_MODEL_LEN=65536 bash 11_start_server_122b.sh
# Add API key authentication (works for either model)
API_KEY="your-secret-key" bash 11_start_server_122b.sh
```
---
## Server Management
```bash
# Start in background
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool
# View logs
tail -f logs/vllm_server_*.log
# Stop
bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
# Reconnect to tmux session
tmux attach -t llm
```
---
## Files Overview
| File | Purpose |
|------------------------------------|------------------------------------------------------|
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
| `02_download_model.sh` | Downloads 35B model weights |
| `03_start_server.sh` | Starts 35B vLLM server (foreground, TP=2) |
| `04_start_server_background.sh` | Starts 35B server in background with logging |
| `05_stop_server.sh` | Stops whichever background vLLM server is running |
| `06_setup_openwebui.sh` | Pulls the Open WebUI container image |
| `07_start_openwebui.sh` | Starts Open WebUI (foreground) |
| `08_start_openwebui_background.sh` | Starts Open WebUI in background with logging |
| `09_stop_openwebui.sh` | Stops the background Open WebUI |
| `10_download_model_122b.sh` | Downloads 122B FP8 model weights |
| `11_start_server_122b.sh` | Starts 122B vLLM server (foreground, TP=4) |
| `12_start_server_122b_background.sh` | Starts 122B server in background with logging |
| `app.py` | Streamlit chat & file editor web app |
| `requirements.txt` | Python dependencies for the Streamlit app |
| `test_server.py` | Tests the running server via CLI |
| `STUDENT_GUIDE.md` | Instructions for students |
---
## Troubleshooting
### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
### Container build fails
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly`
### "No NVIDIA GPU detected"
- Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is present (already in scripts)
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### "Model type qwen3_5_moe not recognized"
- The container needs `vllm/vllm-openai:nightly` (not `:latest`)
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
### Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must be on the university network or VPN
### Slow generation with many users
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Disable thinking mode for faster simple responses
- Monitor: `curl http://localhost:7080/metrics`
### Open WebUI won't start
- Ensure the vLLM server is running first on port 7080
- Check that port 7081 is not already in use: `ss -tlnp | grep 7081`
- Check logs: `tail -50 logs/openwebui_*.log`
- If the database is corrupted, reset: `rm openwebui-data/webui.db` and restart
### Open WebUI shows no models
- Verify vLLM is reachable: `curl http://localhost:7080/v1/models`
- The OpenAI API base URL is set on first launch; if changed later, update
it in the Open WebUI Admin Panel > Settings > Connections
### Syncing files to the server
- No `git` or `pip` on the host — use `scp` from your local machine:
```bash
scp app.py 03_start_server.sh <name>@silicon.fhgr.ch:~/LLM_local/
```