413 lines
13 KiB
Markdown
413 lines
13 KiB
Markdown
# LLM Inferenz Server — Qwen3.5
|
||
|
||
Self-hosted LLM inference for ~15 concurrent students, served via **vLLM**
|
||
inside an **Apptainer** container on a GPU server. Two models are available
|
||
(one at a time):
|
||
|
||
| Model | Params | Active | Weights | GPUs |
|
||
|-------|--------|--------|---------|------|
|
||
| **Qwen3.5-35B-A3B** | 35B MoE | 3B | ~67 GB BF16 | 2× L40S (TP=2) |
|
||
| **Qwen3.5-122B-A10B-FP8** | 122B MoE | 10B | ~125 GB FP8 | 4× L40S (TP=4) |
|
||
|
||
Two front-ends are provided: **Open WebUI** (server-hosted ChatGPT-like UI)
|
||
and a **Streamlit app** (local chat + file editor with code execution).
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Students
|
||
│
|
||
├── Browser ──► Open WebUI (silicon.fhgr.ch:7081)
|
||
│ │ ChatGPT-like UI, user accounts, chat history
|
||
│ │
|
||
├── Streamlit ─────┤ Local app with file editor & code runner
|
||
│ │
|
||
└── SDK / curl ────┘
|
||
▼
|
||
┌──────────────────────────────┐
|
||
│ silicon.fhgr.ch:7080 │
|
||
│ OpenAI-compatible API │
|
||
├──────────────────────────────┤
|
||
│ vLLM Server (nightly) │
|
||
│ Apptainer container (.sif) │
|
||
├──────────────────────────────┤
|
||
│ Model weights │
|
||
│ (bind-mounted from host) │
|
||
├──────────────────────────────┤
|
||
│ 4× NVIDIA L40S (46 GB ea.) │
|
||
│ 184 GB total VRAM │
|
||
└──────────────────────────────┘
|
||
```
|
||
|
||
## Hardware
|
||
|
||
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each,
|
||
184 GB total). Only one model runs at a time on port 7080.
|
||
|
||
| | Qwen3.5-35B-A3B | Qwen3.5-122B-A10B-FP8 |
|
||
|---|---|---|
|
||
| GPUs used | 2× L40S (TP=2) | 4× L40S (TP=4) |
|
||
| VRAM used | ~92 GB | ~184 GB |
|
||
| Weight size | ~67 GB (BF16) | ~125 GB (FP8) |
|
||
| Active params/token | 3B (MoE) | 10B (MoE) |
|
||
| Context length | 32,768 tokens | 32,768 tokens |
|
||
| Port | 7080 | 7080 |
|
||
|
||
## Prerequisites
|
||
|
||
- **Apptainer** (formerly Singularity) installed on the server
|
||
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
|
||
- **~200 GB disk** for model weights (both models) + ~8 GB for the container image
|
||
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
|
||
|
||
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
|
||
> the Apptainer container.
|
||
|
||
---
|
||
|
||
## Step-by-Step Setup
|
||
|
||
### Step 0: SSH into the Server
|
||
|
||
```bash
|
||
ssh <name>@silicon.fhgr.ch
|
||
```
|
||
|
||
### Step 1: Clone the Repository
|
||
|
||
```bash
|
||
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
|
||
cd ~/LLM_local
|
||
chmod +x *.sh
|
||
```
|
||
|
||
> **Note**: `git` is not installed on the host. Use the container:
|
||
> `apptainer exec vllm_qwen.sif git clone ...`
|
||
> Or copy files via `scp` from your local machine.
|
||
|
||
### Step 2: Check GPU and Environment
|
||
|
||
```bash
|
||
nvidia-smi
|
||
apptainer --version
|
||
df -h ~
|
||
```
|
||
|
||
### Step 3: Build the Apptainer Container
|
||
|
||
```bash
|
||
bash 01_build_container.sh
|
||
```
|
||
|
||
Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5
|
||
support), installs latest `transformers` from source, and packages everything
|
||
into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
|
||
|
||
### Step 4: Download Model Weights
|
||
|
||
**35B model (~67 GB):**
|
||
```bash
|
||
bash 02_download_model.sh
|
||
```
|
||
|
||
**122B model (~125 GB):**
|
||
```bash
|
||
bash 10_download_model_122b.sh
|
||
```
|
||
|
||
Both use `huggingface-cli` **inside the container**. Stored at
|
||
`~/models/Qwen3.5-35B-A3B` and `~/models/Qwen3.5-122B-A10B-FP8` respectively.
|
||
|
||
### Step 5: Start the Server
|
||
|
||
Only one model can run at a time on port 7080. Choose one:
|
||
|
||
**35B model (2 GPUs, faster per-token, smaller):**
|
||
```bash
|
||
bash 03_start_server.sh # foreground
|
||
bash 04_start_server_background.sh # background
|
||
```
|
||
|
||
**122B model (4 GPUs, more capable, FP8):**
|
||
```bash
|
||
bash 11_start_server_122b.sh # foreground
|
||
bash 12_start_server_122b_background.sh # background
|
||
```
|
||
|
||
**To switch models:**
|
||
```bash
|
||
bash 05_stop_server.sh # stop whichever is running
|
||
bash 11_start_server_122b.sh # start the other one
|
||
```
|
||
|
||
The model takes 2-5 minutes (35B) or 5-10 minutes (122B) to load. It's ready
|
||
when you see:
|
||
```
|
||
INFO: Uvicorn running on http://0.0.0.0:7080
|
||
```
|
||
|
||
### Step 6: Test the Server
|
||
|
||
From another terminal on the server:
|
||
```bash
|
||
curl http://localhost:7080/v1/models
|
||
```
|
||
|
||
Quick chat test:
|
||
```bash
|
||
curl http://localhost:7080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
|
||
```
|
||
|
||
### Step 7: Set Up Open WebUI (ChatGPT-like Interface)
|
||
|
||
Open WebUI provides a full-featured chat interface that runs on the server.
|
||
Students access it via a browser — no local setup required.
|
||
|
||
**Pull the container:**
|
||
```bash
|
||
bash 06_setup_openwebui.sh
|
||
```
|
||
|
||
**Start (foreground with tmux):**
|
||
```bash
|
||
tmux new -s webui
|
||
bash 07_start_openwebui.sh
|
||
# Ctrl+B, then D to detach
|
||
```
|
||
|
||
**Start (background with logging):**
|
||
```bash
|
||
bash 08_start_openwebui_background.sh
|
||
tail -f logs/openwebui_*.log
|
||
```
|
||
|
||
Open WebUI is ready when you see `Uvicorn running` in the logs.
|
||
Access it at `http://silicon.fhgr.ch:7081`.
|
||
|
||
> **Important**: The first user to sign up becomes the **admin**. Sign up
|
||
> yourself first before sharing the URL with students.
|
||
|
||
### Step 8: Share with Students
|
||
|
||
Distribute `STUDENT_GUIDE.md` with connection details:
|
||
- **Open WebUI**: `http://silicon.fhgr.ch:7081` (recommended for most students)
|
||
- **API Base URL**: `http://silicon.fhgr.ch:7080/v1` (for SDK / programmatic use)
|
||
- **Model name**: `qwen3.5-35b-a3b` or `qwen3.5-122b-a10b-fp8` (depending on which is running)
|
||
|
||
---
|
||
|
||
## Open WebUI
|
||
|
||
A server-hosted ChatGPT-like interface backed by the vLLM inference server.
|
||
Runs as an Apptainer container on port **7081**.
|
||
|
||
### Features
|
||
|
||
- User accounts with persistent chat history (stored in `openwebui-data/`)
|
||
- Auto-discovers models from the vLLM backend
|
||
- Streaming responses, markdown rendering, code highlighting
|
||
- Admin panel for managing users, models, and settings
|
||
- No local setup needed — students just open a browser
|
||
|
||
### Configuration
|
||
|
||
| Variable | Default | Description |
|
||
|----------|---------|-------------|
|
||
| `PORT` | `7081` | HTTP port for the UI |
|
||
| `VLLM_BASE_URL` | `http://localhost:7080/v1` | vLLM API endpoint |
|
||
| `VLLM_API_KEY` | `EMPTY` | API key (if vLLM requires one) |
|
||
| `DATA_DIR` | `./openwebui-data` | Persistent storage (DB, uploads) |
|
||
|
||
### Management
|
||
|
||
```bash
|
||
# Start in background
|
||
bash 08_start_openwebui_background.sh
|
||
|
||
# View logs
|
||
tail -f logs/openwebui_*.log
|
||
|
||
# Stop
|
||
bash 09_stop_openwebui.sh
|
||
|
||
# Reconnect to tmux session
|
||
tmux attach -t webui
|
||
```
|
||
|
||
### Data Persistence
|
||
|
||
All user data (accounts, chats, settings) is stored in `openwebui-data/`.
|
||
This directory is bind-mounted into the container, so data survives
|
||
container restarts. Back it up regularly.
|
||
|
||
---
|
||
|
||
## Streamlit App
|
||
|
||
A web-based chat and file editor that connects to the inference server.
|
||
Students run it on their own machines.
|
||
|
||
### Setup
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
Or with a virtual environment:
|
||
|
||
```bash
|
||
python3 -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
### Run
|
||
|
||
```bash
|
||
streamlit run app.py
|
||
```
|
||
|
||
Opens at `http://localhost:8501` with two tabs:
|
||
|
||
- **Chat** — Conversational interface with streaming responses. Save the
|
||
model's last response directly into a workspace file (code auto-extracted).
|
||
- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file.
|
||
Use "Generate with LLM" to modify files via natural language instructions.
|
||
|
||
### Sidebar Controls
|
||
|
||
| Parameter | Default | Range | Purpose |
|
||
|-----------|---------|-------|---------|
|
||
| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
|
||
| Temperature | 0.7 | 0.0 – 2.0 | Creativity vs determinism |
|
||
| Max Tokens | 4096 | 256 – 16384 | Maximum response length |
|
||
| Top P | 0.95 | 0.0 – 1.0 | Nucleus sampling threshold |
|
||
| Presence Penalty | 0.0 | 0.0 – 2.0 | Penalize repeated topics |
|
||
|
||
---
|
||
|
||
## Server Configuration
|
||
|
||
Both start scripts accept the same environment variables:
|
||
|
||
| Variable | 35B default | 122B default | Description |
|
||
|----------|-------------|--------------|-------------|
|
||
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | `~/models/Qwen3.5-122B-A10B-FP8` | Model weights path |
|
||
| `PORT` | `7080` | `7080` | HTTP port |
|
||
| `MAX_MODEL_LEN` | `32768` | `32768` | Max context length |
|
||
| `GPU_MEM_UTIL` | `0.92` | `0.92` | GPU memory fraction |
|
||
| `API_KEY` | *(none)* | *(none)* | API key for auth |
|
||
| `TENSOR_PARALLEL` | `2` | `4` | Number of GPUs |
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Increase context length (35B)
|
||
MAX_MODEL_LEN=65536 bash 03_start_server.sh
|
||
|
||
# Increase context length (122B — has room with FP8)
|
||
MAX_MODEL_LEN=65536 bash 11_start_server_122b.sh
|
||
|
||
# Add API key authentication (works for either model)
|
||
API_KEY="your-secret-key" bash 11_start_server_122b.sh
|
||
```
|
||
|
||
---
|
||
|
||
## Server Management
|
||
|
||
```bash
|
||
# Start in background
|
||
bash 04_start_server_background.sh
|
||
|
||
# Check if running
|
||
curl -s http://localhost:7080/v1/models | python3 -m json.tool
|
||
|
||
# View logs
|
||
tail -f logs/vllm_server_*.log
|
||
|
||
# Stop
|
||
bash 05_stop_server.sh
|
||
|
||
# Monitor GPU usage
|
||
watch -n 2 nvidia-smi
|
||
|
||
# Reconnect to tmux session
|
||
tmux attach -t llm
|
||
```
|
||
|
||
---
|
||
|
||
## Files Overview
|
||
|
||
| File | Purpose |
|
||
|------------------------------------|------------------------------------------------------|
|
||
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
|
||
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
|
||
| `02_download_model.sh` | Downloads 35B model weights |
|
||
| `03_start_server.sh` | Starts 35B vLLM server (foreground, TP=2) |
|
||
| `04_start_server_background.sh` | Starts 35B server in background with logging |
|
||
| `05_stop_server.sh` | Stops whichever background vLLM server is running |
|
||
| `06_setup_openwebui.sh` | Pulls the Open WebUI container image |
|
||
| `07_start_openwebui.sh` | Starts Open WebUI (foreground) |
|
||
| `08_start_openwebui_background.sh` | Starts Open WebUI in background with logging |
|
||
| `09_stop_openwebui.sh` | Stops the background Open WebUI |
|
||
| `10_download_model_122b.sh` | Downloads 122B FP8 model weights |
|
||
| `11_start_server_122b.sh` | Starts 122B vLLM server (foreground, TP=4) |
|
||
| `12_start_server_122b_background.sh` | Starts 122B server in background with logging |
|
||
| `app.py` | Streamlit chat & file editor web app |
|
||
| `requirements.txt` | Python dependencies for the Streamlit app |
|
||
| `test_server.py` | Tests the running server via CLI |
|
||
| `STUDENT_GUIDE.md` | Instructions for students |
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### "CUDA out of memory"
|
||
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
|
||
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
|
||
|
||
### Container build fails
|
||
- Ensure internet access and sufficient disk space (~20 GB for build cache)
|
||
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly`
|
||
|
||
### "No NVIDIA GPU detected"
|
||
- Verify `nvidia-smi` works on the host
|
||
- Ensure `--nv` flag is present (already in scripts)
|
||
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
||
|
||
### "Model type qwen3_5_moe not recognized"
|
||
- The container needs `vllm/vllm-openai:nightly` (not `:latest`)
|
||
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
|
||
|
||
### Students can't connect
|
||
- Check firewall: ports 7080-7090 must be open
|
||
- Verify the server binds to `0.0.0.0` (not just localhost)
|
||
- Students must be on the university network or VPN
|
||
|
||
### Slow generation with many users
|
||
- Expected — vLLM batches requests but throughput is finite
|
||
- The MoE architecture (3B active) helps with per-token speed
|
||
- Disable thinking mode for faster simple responses
|
||
- Monitor: `curl http://localhost:7080/metrics`
|
||
|
||
### Open WebUI won't start
|
||
- Ensure the vLLM server is running first on port 7080
|
||
- Check that port 7081 is not already in use: `ss -tlnp | grep 7081`
|
||
- Check logs: `tail -50 logs/openwebui_*.log`
|
||
- If the database is corrupted, reset: `rm openwebui-data/webui.db` and restart
|
||
|
||
### Open WebUI shows no models
|
||
- Verify vLLM is reachable: `curl http://localhost:7080/v1/models`
|
||
- The OpenAI API base URL is set on first launch; if changed later, update
|
||
it in the Open WebUI Admin Panel > Settings > Connections
|
||
|
||
### Syncing files to the server
|
||
- No `git` or `pip` on the host — use `scp` from your local machine:
|
||
```bash
|
||
scp app.py 03_start_server.sh <name>@silicon.fhgr.ch:~/LLM_local/
|
||
```
|