Scripts to build container, download model, and serve Qwen3.5-35B-A3B via vLLM with OpenAI-compatible API on port 7080. Configured for 2x NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent students. Made-with: Cursor
266 lines
7.4 KiB
Markdown
266 lines
7.4 KiB
Markdown
# LLM Local — Qwen3.5-27B Inference Server
|
||
|
||
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
|
||
served via **vLLM** inside an **Apptainer** container on a GPU server.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Students (OpenAI SDK / curl)
|
||
│
|
||
▼
|
||
┌─────────────────────────┐
|
||
│ silicon.fhgr.ch:7080 │
|
||
│ OpenAI-compatible API │
|
||
├─────────────────────────┤
|
||
│ vLLM Server │
|
||
│ (Apptainer container) │
|
||
├─────────────────────────┤
|
||
│ Qwen3.5-27B weights │
|
||
│ (bind-mounted) │
|
||
├─────────────────────────┤
|
||
│ NVIDIA GPU │
|
||
└─────────────────────────┘
|
||
```
|
||
|
||
## Prerequisites
|
||
|
||
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
|
||
Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
|
||
- **Apptainer** (formerly Singularity) installed on the server.
|
||
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
|
||
- **~60 GB disk space** for model weights + ~15 GB for the container image.
|
||
- **Network**: Students must be on the university network or VPN.
|
||
|
||
## Hardware Sizing
|
||
|
||
| Component | Minimum | Recommended |
|
||
|-----------|----------------|-----------------|
|
||
| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) |
|
||
| RAM | 64 GB | 128 GB |
|
||
| Disk | 100 GB free | 200 GB free |
|
||
|
||
> **If your GPU has less than 80 GB VRAM**, you have two options:
|
||
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
|
||
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
|
||
|
||
---
|
||
|
||
## Step-by-Step Setup
|
||
|
||
### Step 0: SSH into the Server
|
||
|
||
```bash
|
||
ssh herzogfloria@silicon.fhgr.ch
|
||
```
|
||
|
||
### Step 1: Clone This Repository
|
||
|
||
```bash
|
||
# Or copy the files to the server
|
||
git clone <your-repo-url> ~/LLM_local
|
||
cd ~/LLM_local
|
||
chmod +x *.sh
|
||
```
|
||
|
||
### Step 2: Check GPU and Environment
|
||
|
||
```bash
|
||
# Verify GPU is visible
|
||
nvidia-smi
|
||
|
||
# Verify Apptainer is installed
|
||
apptainer --version
|
||
|
||
# Check available disk space
|
||
df -h ~
|
||
```
|
||
|
||
### Step 3: Download the Model (~60 GB)
|
||
|
||
```bash
|
||
# Install huggingface-cli if not available
|
||
pip install --user huggingface_hub[cli]
|
||
|
||
# Download Qwen3.5-27B
|
||
bash 01_download_model.sh
|
||
# Default target: ~/models/Qwen3.5-27B
|
||
```
|
||
|
||
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
|
||
|
||
### Step 4: Build the Apptainer Container
|
||
|
||
```bash
|
||
bash 02_build_container.sh
|
||
```
|
||
|
||
This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
|
||
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
|
||
|
||
> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
|
||
> first and convert manually:
|
||
> ```bash
|
||
> apptainer pull docker://vllm/vllm-openai:latest
|
||
> ```
|
||
|
||
### Step 5: Start the Server
|
||
|
||
**Interactive (foreground):**
|
||
```bash
|
||
bash 03_start_server.sh
|
||
```
|
||
|
||
**Background (recommended for production):**
|
||
```bash
|
||
bash 04_start_server_background.sh
|
||
```
|
||
|
||
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
|
||
```bash
|
||
tail -f logs/vllm_server_*.log
|
||
```
|
||
|
||
Look for the line:
|
||
```
|
||
INFO: Uvicorn running on http://0.0.0.0:8000
|
||
```
|
||
|
||
### Step 6: Test the Server
|
||
|
||
```bash
|
||
# Quick health check
|
||
curl http://localhost:7080/v1/models
|
||
|
||
# Full test
|
||
pip install openai
|
||
python test_server.py
|
||
```
|
||
|
||
### Step 7: Share with Students
|
||
|
||
Distribute the `STUDENT_GUIDE.md` file or share the connection details:
|
||
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
|
||
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
All configuration is via environment variables in `03_start_server.sh`:
|
||
|
||
| Variable | Default | Description |
|
||
|-------------------|------------------------------|-------------------------------------|
|
||
| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights |
|
||
| `PORT` | `7080` | HTTP port |
|
||
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
|
||
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
|
||
| `API_KEY` | *(empty = no auth)* | API key for authentication |
|
||
| `TENSOR_PARALLEL` | `1` | Number of GPUs |
|
||
|
||
### Context Length Tuning
|
||
|
||
The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
|
||
concurrent users. If you have plenty of VRAM headroom:
|
||
|
||
```bash
|
||
MAX_MODEL_LEN=65536 bash 03_start_server.sh
|
||
```
|
||
|
||
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
|
||
significantly more GPU memory for KV cache.
|
||
|
||
### Adding Authentication
|
||
|
||
```bash
|
||
API_KEY="your-secret-key-here" bash 03_start_server.sh
|
||
```
|
||
|
||
Students then use this key in their `api_key` parameter.
|
||
|
||
### Multi-GPU Setup
|
||
|
||
If you have multiple GPUs:
|
||
|
||
```bash
|
||
TENSOR_PARALLEL=2 bash 03_start_server.sh
|
||
```
|
||
|
||
---
|
||
|
||
## Server Management
|
||
|
||
```bash
|
||
# Start in background
|
||
bash 04_start_server_background.sh
|
||
|
||
# Check if running
|
||
curl -s http://localhost:7080/v1/models | python -m json.tool
|
||
|
||
# View logs
|
||
tail -f logs/vllm_server_*.log
|
||
|
||
# Stop
|
||
bash 05_stop_server.sh
|
||
|
||
# Monitor GPU usage
|
||
watch -n 2 nvidia-smi
|
||
```
|
||
|
||
### Running Persistently with tmux
|
||
|
||
For a robust setup that survives SSH disconnects:
|
||
|
||
```bash
|
||
ssh herzogfloria@silicon.fhgr.ch
|
||
tmux new -s llm_server
|
||
bash 03_start_server.sh
|
||
# Press Ctrl+B, then D to detach
|
||
|
||
# Reconnect later:
|
||
tmux attach -t llm_server
|
||
```
|
||
|
||
---
|
||
|
||
## Files Overview
|
||
|
||
| File | Purpose |
|
||
|------------------------------|------------------------------------------- |
|
||
| `vllm_qwen.def` | Apptainer container definition |
|
||
| `01_download_model.sh` | Downloads model weights from Hugging Face |
|
||
| `02_build_container.sh` | Builds the Apptainer .sif image |
|
||
| `03_start_server.sh` | Starts vLLM server (foreground) |
|
||
| `04_start_server_background.sh` | Starts server in background with logging|
|
||
| `05_stop_server.sh` | Stops the background server |
|
||
| `test_server.py` | Tests the running server |
|
||
| `STUDENT_GUIDE.md` | Instructions for students |
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### "CUDA out of memory"
|
||
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
|
||
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
|
||
- Use a quantized model variant
|
||
|
||
### Container build fails
|
||
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
|
||
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
|
||
|
||
### "No NVIDIA GPU detected"
|
||
- Check that `nvidia-smi` works outside the container
|
||
- Ensure `--nv` flag is passed (already in scripts)
|
||
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
||
|
||
### Server starts but students can't connect
|
||
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
|
||
- Verify the server binds to `0.0.0.0` (not just localhost)
|
||
- Students must use the server's hostname/IP, not `localhost`
|
||
|
||
### Slow generation with many users
|
||
- This is expected — vLLM batches requests but throughput is finite
|
||
- Consider reducing `max_tokens` in student requests
|
||
- Monitor with: `curl http://localhost:7080/metrics`
|