herzogflorian 9e1e0c0751 Add Streamlit chat app, update container to vLLM nightly
- Add app.py: Streamlit UI with chat and file editor tabs
- Add requirements.txt: streamlit + openai dependencies
- Update vllm_qwen.def: use nightly image for Qwen3.5 support
- Update README.md: reflect 35B-A3B model, correct script names
- Update STUDENT_GUIDE.md: add app usage and thinking mode docs
- Update .gitignore: exclude .venv/ and workspace/

Made-with: Cursor
2026-03-02 16:30:04 +01:00

233 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM Inferenz Server — Qwen3.5-35B-A3B
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
**Apptainer** container on a GPU server.
## Architecture
```
Students (OpenAI SDK / curl)
┌──────────────────────────────┐
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├──────────────────────────────┤
│ vLLM Server (nightly) │
│ Apptainer container (.sif) │
├──────────────────────────────┤
│ Qwen3.5-35B-A3B weights │
│ (bind-mounted from host) │
├──────────────────────────────┤
│ 2× NVIDIA L40S (46 GB ea.) │
│ Tensor Parallel = 2 │
└──────────────────────────────┘
```
## Hardware
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
| Component | Value |
|-----------|-------|
| GPUs used | 2× NVIDIA L40S |
| VRAM used | ~92 GB total |
| Model size (BF16) | ~67 GB |
| Active params/token | 3B (MoE) |
| Context length | 32,768 tokens |
| Port | 7080 |
## Prerequisites
- **Apptainer** (formerly Singularity) installed on the server
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **~80 GB disk** for model weights + ~8 GB for the container image
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
> the Apptainer container.
---
## Step-by-Step Setup
### Step 0: SSH into the Server
```bash
ssh herzogfloria@silicon.fhgr.ch
```
### Step 1: Clone the Repository
```bash
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```
### Step 2: Check GPU and Environment
```bash
nvidia-smi
apptainer --version
df -h ~
```
### Step 3: Build the Apptainer Container
```bash
bash 01_build_container.sh
```
Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
(required for Qwen3.5 support), installs latest `transformers` from source,
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
### Step 4: Download the Model (~67 GB)
```bash
bash 02_download_model.sh
```
Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
depending on bandwidth.
### Step 5: Start the Server
**Interactive (foreground) — recommended with tmux:**
```bash
tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach
```
**Background with logging:**
```bash
bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log
```
The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
```
INFO: Uvicorn running on http://0.0.0.0:7080
```
### Step 6: Test the Server
From another terminal on the server:
```bash
curl http://localhost:7080/v1/models
```
Or run the full test (uses `openai` SDK inside the container):
```bash
apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
```
### Step 7: Share with Students
Distribute `STUDENT_GUIDE.md` with connection details:
- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
- **Model name**: `qwen3.5-35b-a3b`
---
## Configuration
All configuration is via environment variables passed to `03_start_server.sh`:
| Variable | Default | Description |
|-------------------|----------------------------------|--------------------------------|
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
| `PORT` | `7080` | HTTP port |
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
| `API_KEY` | *(empty = no auth)* | API key for authentication |
| `TENSOR_PARALLEL` | `2` | Number of GPUs |
### Examples
```bash
# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh
# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh
# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh
```
---
## Server Management
```bash
# Start in background
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool
# View logs
tail -f logs/vllm_server_*.log
# Stop
bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
# Reconnect to tmux session
tmux attach -t llm
```
---
## Files Overview
| File | Purpose |
|----------------------------------|------------------------------------------------------|
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
| `02_download_model.sh` | Downloads model weights (runs inside container) |
| `03_start_server.sh` | Starts vLLM server (foreground) |
| `04_start_server_background.sh` | Starts server in background with logging |
| `05_stop_server.sh` | Stops the background server |
| `test_server.py` | Tests the running server |
| `STUDENT_GUIDE.md` | Instructions for students |
---
## Troubleshooting
### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
### Container build fails
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
### "No NVIDIA GPU detected"
- Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is present (already in scripts)
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### "Model type qwen3_5_moe not recognized"
- The container needs vLLM nightly and latest transformers
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
### Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must be on the university network or VPN
### Slow generation with many users
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Monitor: `curl http://localhost:7080/metrics`