herzogflorian 076001b07f Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer
Scripts to build container, download model, and serve Qwen3.5-35B-A3B
via vLLM with OpenAI-compatible API on port 7080. Configured for 2x
NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent
students.

Made-with: Cursor
2026-03-02 14:43:39 +01:00

266 lines
7.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM Local — Qwen3.5-27B Inference Server
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
served via **vLLM** inside an **Apptainer** container on a GPU server.
## Architecture
```
Students (OpenAI SDK / curl)
┌─────────────────────────┐
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├─────────────────────────┤
│ vLLM Server │
│ (Apptainer container) │
├─────────────────────────┤
│ Qwen3.5-27B weights │
│ (bind-mounted) │
├─────────────────────────┤
│ NVIDIA GPU │
└─────────────────────────┘
```
## Prerequisites
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
- **Apptainer** (formerly Singularity) installed on the server.
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
- **~60 GB disk space** for model weights + ~15 GB for the container image.
- **Network**: Students must be on the university network or VPN.
## Hardware Sizing
| Component | Minimum | Recommended |
|-----------|----------------|-----------------|
| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) |
| RAM | 64 GB | 128 GB |
| Disk | 100 GB free | 200 GB free |
> **If your GPU has less than 80 GB VRAM**, you have two options:
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
---
## Step-by-Step Setup
### Step 0: SSH into the Server
```bash
ssh herzogfloria@silicon.fhgr.ch
```
### Step 1: Clone This Repository
```bash
# Or copy the files to the server
git clone <your-repo-url> ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```
### Step 2: Check GPU and Environment
```bash
# Verify GPU is visible
nvidia-smi
# Verify Apptainer is installed
apptainer --version
# Check available disk space
df -h ~
```
### Step 3: Download the Model (~60 GB)
```bash
# Install huggingface-cli if not available
pip install --user huggingface_hub[cli]
# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B
```
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
### Step 4: Build the Apptainer Container
```bash
bash 02_build_container.sh
```
This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
> first and convert manually:
> ```bash
> apptainer pull docker://vllm/vllm-openai:latest
> ```
### Step 5: Start the Server
**Interactive (foreground):**
```bash
bash 03_start_server.sh
```
**Background (recommended for production):**
```bash
bash 04_start_server_background.sh
```
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
```bash
tail -f logs/vllm_server_*.log
```
Look for the line:
```
INFO: Uvicorn running on http://0.0.0.0:8000
```
### Step 6: Test the Server
```bash
# Quick health check
curl http://localhost:7080/v1/models
# Full test
pip install openai
python test_server.py
```
### Step 7: Share with Students
Distribute the `STUDENT_GUIDE.md` file or share the connection details:
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
---
## Configuration
All configuration is via environment variables in `03_start_server.sh`:
| Variable | Default | Description |
|-------------------|------------------------------|-------------------------------------|
| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights |
| `PORT` | `7080` | HTTP port |
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
| `API_KEY` | *(empty = no auth)* | API key for authentication |
| `TENSOR_PARALLEL` | `1` | Number of GPUs |
### Context Length Tuning
The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
concurrent users. If you have plenty of VRAM headroom:
```bash
MAX_MODEL_LEN=65536 bash 03_start_server.sh
```
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
significantly more GPU memory for KV cache.
### Adding Authentication
```bash
API_KEY="your-secret-key-here" bash 03_start_server.sh
```
Students then use this key in their `api_key` parameter.
### Multi-GPU Setup
If you have multiple GPUs:
```bash
TENSOR_PARALLEL=2 bash 03_start_server.sh
```
---
## Server Management
```bash
# Start in background
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool
# View logs
tail -f logs/vllm_server_*.log
# Stop
bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
```
### Running Persistently with tmux
For a robust setup that survives SSH disconnects:
```bash
ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach
# Reconnect later:
tmux attach -t llm_server
```
---
## Files Overview
| File | Purpose |
|------------------------------|------------------------------------------- |
| `vllm_qwen.def` | Apptainer container definition |
| `01_download_model.sh` | Downloads model weights from Hugging Face |
| `02_build_container.sh` | Builds the Apptainer .sif image |
| `03_start_server.sh` | Starts vLLM server (foreground) |
| `04_start_server_background.sh` | Starts server in background with logging|
| `05_stop_server.sh` | Stops the background server |
| `test_server.py` | Tests the running server |
| `STUDENT_GUIDE.md` | Instructions for students |
---
## Troubleshooting
### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
- Use a quantized model variant
### Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
### "No NVIDIA GPU detected"
- Check that `nvidia-smi` works outside the container
- Ensure `--nv` flag is passed (already in scripts)
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### Server starts but students can't connect
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must use the server's hostname/IP, not `localhost`
### Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite
- Consider reducing `max_tokens` in student requests
- Monitor with: `curl http://localhost:7080/metrics`