- Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor
233 lines
7.0 KiB
Markdown
233 lines
7.0 KiB
Markdown
# LLM Inferenz Server — Qwen3.5-35B-A3B
|
||
|
||
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
|
||
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
|
||
**Apptainer** container on a GPU server.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Students (OpenAI SDK / curl)
|
||
│
|
||
▼
|
||
┌──────────────────────────────┐
|
||
│ silicon.fhgr.ch:7080 │
|
||
│ OpenAI-compatible API │
|
||
├──────────────────────────────┤
|
||
│ vLLM Server (nightly) │
|
||
│ Apptainer container (.sif) │
|
||
├──────────────────────────────┤
|
||
│ Qwen3.5-35B-A3B weights │
|
||
│ (bind-mounted from host) │
|
||
├──────────────────────────────┤
|
||
│ 2× NVIDIA L40S (46 GB ea.) │
|
||
│ Tensor Parallel = 2 │
|
||
└──────────────────────────────┘
|
||
```
|
||
|
||
## Hardware
|
||
|
||
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
|
||
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
|
||
|
||
| Component | Value |
|
||
|-----------|-------|
|
||
| GPUs used | 2× NVIDIA L40S |
|
||
| VRAM used | ~92 GB total |
|
||
| Model size (BF16) | ~67 GB |
|
||
| Active params/token | 3B (MoE) |
|
||
| Context length | 32,768 tokens |
|
||
| Port | 7080 |
|
||
|
||
## Prerequisites
|
||
|
||
- **Apptainer** (formerly Singularity) installed on the server
|
||
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
|
||
- **~80 GB disk** for model weights + ~8 GB for the container image
|
||
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
|
||
|
||
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
|
||
> the Apptainer container.
|
||
|
||
---
|
||
|
||
## Step-by-Step Setup
|
||
|
||
### Step 0: SSH into the Server
|
||
|
||
```bash
|
||
ssh herzogfloria@silicon.fhgr.ch
|
||
```
|
||
|
||
### Step 1: Clone the Repository
|
||
|
||
```bash
|
||
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
|
||
cd ~/LLM_local
|
||
chmod +x *.sh
|
||
```
|
||
|
||
### Step 2: Check GPU and Environment
|
||
|
||
```bash
|
||
nvidia-smi
|
||
apptainer --version
|
||
df -h ~
|
||
```
|
||
|
||
### Step 3: Build the Apptainer Container
|
||
|
||
```bash
|
||
bash 01_build_container.sh
|
||
```
|
||
|
||
Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
|
||
(required for Qwen3.5 support), installs latest `transformers` from source,
|
||
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
|
||
|
||
### Step 4: Download the Model (~67 GB)
|
||
|
||
```bash
|
||
bash 02_download_model.sh
|
||
```
|
||
|
||
Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
|
||
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
|
||
depending on bandwidth.
|
||
|
||
### Step 5: Start the Server
|
||
|
||
**Interactive (foreground) — recommended with tmux:**
|
||
```bash
|
||
tmux new -s llm
|
||
bash 03_start_server.sh
|
||
# Ctrl+B, then D to detach
|
||
```
|
||
|
||
**Background with logging:**
|
||
```bash
|
||
bash 04_start_server_background.sh
|
||
tail -f logs/vllm_server_*.log
|
||
```
|
||
|
||
The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
|
||
```
|
||
INFO: Uvicorn running on http://0.0.0.0:7080
|
||
```
|
||
|
||
### Step 6: Test the Server
|
||
|
||
From another terminal on the server:
|
||
```bash
|
||
curl http://localhost:7080/v1/models
|
||
```
|
||
|
||
Or run the full test (uses `openai` SDK inside the container):
|
||
```bash
|
||
apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
|
||
```
|
||
|
||
### Step 7: Share with Students
|
||
|
||
Distribute `STUDENT_GUIDE.md` with connection details:
|
||
- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
|
||
- **Model name**: `qwen3.5-35b-a3b`
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
All configuration is via environment variables passed to `03_start_server.sh`:
|
||
|
||
| Variable | Default | Description |
|
||
|-------------------|----------------------------------|--------------------------------|
|
||
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
|
||
| `PORT` | `7080` | HTTP port |
|
||
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
|
||
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
|
||
| `API_KEY` | *(empty = no auth)* | API key for authentication |
|
||
| `TENSOR_PARALLEL` | `2` | Number of GPUs |
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Increase context length
|
||
MAX_MODEL_LEN=65536 bash 03_start_server.sh
|
||
|
||
# Add API key authentication
|
||
API_KEY="your-secret-key" bash 03_start_server.sh
|
||
|
||
# Use all 4 GPUs (more KV cache headroom)
|
||
TENSOR_PARALLEL=4 bash 03_start_server.sh
|
||
```
|
||
|
||
---
|
||
|
||
## Server Management
|
||
|
||
```bash
|
||
# Start in background
|
||
bash 04_start_server_background.sh
|
||
|
||
# Check if running
|
||
curl -s http://localhost:7080/v1/models | python3 -m json.tool
|
||
|
||
# View logs
|
||
tail -f logs/vllm_server_*.log
|
||
|
||
# Stop
|
||
bash 05_stop_server.sh
|
||
|
||
# Monitor GPU usage
|
||
watch -n 2 nvidia-smi
|
||
|
||
# Reconnect to tmux session
|
||
tmux attach -t llm
|
||
```
|
||
|
||
---
|
||
|
||
## Files Overview
|
||
|
||
| File | Purpose |
|
||
|----------------------------------|------------------------------------------------------|
|
||
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
|
||
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
|
||
| `02_download_model.sh` | Downloads model weights (runs inside container) |
|
||
| `03_start_server.sh` | Starts vLLM server (foreground) |
|
||
| `04_start_server_background.sh` | Starts server in background with logging |
|
||
| `05_stop_server.sh` | Stops the background server |
|
||
| `test_server.py` | Tests the running server |
|
||
| `STUDENT_GUIDE.md` | Instructions for students |
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### "CUDA out of memory"
|
||
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
|
||
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
|
||
|
||
### Container build fails
|
||
- Ensure internet access and sufficient disk space (~20 GB for build cache)
|
||
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
|
||
|
||
### "No NVIDIA GPU detected"
|
||
- Verify `nvidia-smi` works on the host
|
||
- Ensure `--nv` flag is present (already in scripts)
|
||
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
||
|
||
### "Model type qwen3_5_moe not recognized"
|
||
- The container needs vLLM nightly and latest transformers
|
||
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
|
||
|
||
### Students can't connect
|
||
- Check firewall: ports 7080-7090 must be open
|
||
- Verify the server binds to `0.0.0.0` (not just localhost)
|
||
- Students must be on the university network or VPN
|
||
|
||
### Slow generation with many users
|
||
- Expected — vLLM batches requests but throughput is finite
|
||
- The MoE architecture (3B active) helps with per-token speed
|
||
- Monitor: `curl http://localhost:7080/metrics`
|