herzogflorian 076001b07f Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer
Scripts to build container, download model, and serve Qwen3.5-35B-A3B
via vLLM with OpenAI-compatible API on port 7080. Configured for 2x
NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent
students.

Made-with: Cursor
2026-03-02 14:43:39 +01:00

LLM Local — Qwen3.5-27B Inference Server

Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-27B, served via vLLM inside an Apptainer container on a GPU server.

Architecture

Students (OpenAI SDK / curl)
        │
        ▼
  ┌─────────────────────────┐
  │  silicon.fhgr.ch:7080   │
  │  OpenAI-compatible API  │
  ├─────────────────────────┤
  │  vLLM Server            │
  │  (Apptainer container)  │
  ├─────────────────────────┤
  │  Qwen3.5-27B weights    │
  │  (bind-mounted)         │
  ├─────────────────────────┤
  │  NVIDIA GPU             │
  └─────────────────────────┘

Prerequisites

  • GPU: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended). Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
  • Apptainer (formerly Singularity) installed on the server.
  • NVIDIA drivers + nvidia-container-cli for GPU passthrough.
  • ~60 GB disk space for model weights + ~15 GB for the container image.
  • Network: Students must be on the university network or VPN.

Hardware Sizing

Component Minimum Recommended
GPU VRAM 80 GB (1× A100) 80 GB (1× H100)
RAM 64 GB 128 GB
Disk 100 GB free 200 GB free

If your GPU has less than 80 GB VRAM, you have two options:

  1. Use a quantized version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
  2. Use tensor parallelism across multiple GPUs (set TENSOR_PARALLEL=2)

Step-by-Step Setup

Step 0: SSH into the Server

ssh herzogfloria@silicon.fhgr.ch

Step 1: Clone This Repository

# Or copy the files to the server
git clone <your-repo-url> ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Step 2: Check GPU and Environment

# Verify GPU is visible
nvidia-smi

# Verify Apptainer is installed
apptainer --version

# Check available disk space
df -h ~

Step 3: Download the Model (~60 GB)

# Install huggingface-cli if not available
pip install --user huggingface_hub[cli]

# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B

This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.

Step 4: Build the Apptainer Container

bash 02_build_container.sh

This pulls the vllm/vllm-openai:latest Docker image and converts it to a .sif file. Takes 10-20 minutes. The resulting vllm_qwen.sif is ~12-15 GB.

Tip

: If building fails due to network/proxy issues, you can pull the Docker image first and convert manually:

apptainer pull docker://vllm/vllm-openai:latest

Step 5: Start the Server

Interactive (foreground):

bash 03_start_server.sh

Background (recommended for production):

bash 04_start_server_background.sh

The server takes 2-5 minutes to load the model into GPU memory. Monitor with:

tail -f logs/vllm_server_*.log

Look for the line:

INFO:     Uvicorn running on http://0.0.0.0:8000

Step 6: Test the Server

# Quick health check
curl http://localhost:7080/v1/models

# Full test
pip install openai
python test_server.py

Step 7: Share with Students

Distribute the STUDENT_GUIDE.md file or share the connection details:

  • 27B Base URL: http://silicon.fhgr.ch:7080/v1 — model name: qwen3.5-27b
  • 35B Base URL: http://silicon.fhgr.ch:7081/v1 — model name: qwen3.5-35b-a3b

Configuration

All configuration is via environment variables in 03_start_server.sh:

Variable Default Description
MODEL_DIR ~/models/Qwen3.5-27B Path to model weights
PORT 7080 HTTP port
MAX_MODEL_LEN 32768 Max context length (tokens)
GPU_MEM_UTIL 0.92 Fraction of GPU memory to use
API_KEY (empty = no auth) API key for authentication
TENSOR_PARALLEL 1 Number of GPUs

Context Length Tuning

The default MAX_MODEL_LEN=32768 is conservative and ensures stable operation for 15 concurrent users. If you have plenty of VRAM headroom:

MAX_MODEL_LEN=65536 bash 03_start_server.sh

Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require significantly more GPU memory for KV cache.

Adding Authentication

API_KEY="your-secret-key-here" bash 03_start_server.sh

Students then use this key in their api_key parameter.

Multi-GPU Setup

If you have multiple GPUs:

TENSOR_PARALLEL=2 bash 03_start_server.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

Running Persistently with tmux

For a robust setup that survives SSH disconnects:

ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach

# Reconnect later:
tmux attach -t llm_server

Files Overview

File Purpose
vllm_qwen.def Apptainer container definition
01_download_model.sh Downloads model weights from Hugging Face
02_build_container.sh Builds the Apptainer .sif image
03_start_server.sh Starts vLLM server (foreground)
04_start_server_background.sh Starts server in background with logging
05_stop_server.sh Stops the background server
test_server.py Tests the running server
STUDENT_GUIDE.md Instructions for students

Troubleshooting

"CUDA out of memory"

  • Reduce MAX_MODEL_LEN (e.g., 16384)
  • Reduce GPU_MEM_UTIL (e.g., 0.85)
  • Use a quantized model variant

Container build fails

  • Ensure you have internet access and sufficient disk space (~20 GB for build cache)
  • Try: apptainer pull docker://vllm/vllm-openai:latest first

"No NVIDIA GPU detected"

  • Check that nvidia-smi works outside the container
  • Ensure --nv flag is passed (already in scripts)
  • Verify nvidia-container-cli: apptainer exec --nv vllm_qwen.sif nvidia-smi

Server starts but students can't connect

  • Check firewall: sudo ufw allow 7080:7090/tcp or equivalent
  • Verify the server binds to 0.0.0.0 (not just localhost)
  • Students must use the server's hostname/IP, not localhost

Slow generation with many users

  • This is expected — vLLM batches requests but throughput is finite
  • Consider reducing max_tokens in student requests
  • Monitor with: curl http://localhost:7080/metrics
Description
Project to run a LLM inference server on silicon.fhgr.ch
Readme 103 KiB
Languages
Shell 55.9%
Python 44.1%