Go to file

herzogflorian 076001b07f Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

Scripts to build container, download model, and serve Qwen3.5-35B-A3B
via vLLM with OpenAI-compatible API on port 7080. Configured for 2x
NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent
students.

Made-with: Cursor

2026-03-02 14:43:39 +01:00

.gitignore

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

01_build_container.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

02_download_model.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

03_start_server.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

04_start_server_background.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

05_stop_server.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

README.md

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

STUDENT_GUIDE.md

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

test_server.py

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

vllm_qwen.def

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

README.md

LLM Local — Qwen3.5-27B Inference Server

Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-27B, served via vLLM inside an Apptainer container on a GPU server.

Architecture

Students (OpenAI SDK / curl)
        │
        ▼
  ┌─────────────────────────┐
  │  silicon.fhgr.ch:7080   │
  │  OpenAI-compatible API  │
  ├─────────────────────────┤
  │  vLLM Server            │
  │  (Apptainer container)  │
  ├─────────────────────────┤
  │  Qwen3.5-27B weights    │
  │  (bind-mounted)         │
  ├─────────────────────────┤
  │  NVIDIA GPU             │
  └─────────────────────────┘

Prerequisites

GPU: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended). Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
Apptainer (formerly Singularity) installed on the server.
NVIDIA drivers + nvidia-container-cli for GPU passthrough.
~60 GB disk space for model weights + ~15 GB for the container image.
Network: Students must be on the university network or VPN.

Hardware Sizing

Component	Minimum	Recommended
GPU VRAM	80 GB (1× A100)	80 GB (1× H100)
RAM	64 GB	128 GB
Disk	100 GB free	200 GB free

If your GPU has less than 80 GB VRAM, you have two options:

Use a quantized version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)

Use tensor parallelism across multiple GPUs (set TENSOR_PARALLEL=2)

Step-by-Step Setup

Step 0: SSH into the Server

ssh herzogfloria@silicon.fhgr.ch

Step 1: Clone This Repository

# Or copy the files to the server
git clone <your-repo-url> ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Step 2: Check GPU and Environment

# Verify GPU is visible
nvidia-smi

# Verify Apptainer is installed
apptainer --version

# Check available disk space
df -h ~

Step 3: Download the Model (~60 GB)

# Install huggingface-cli if not available
pip install --user huggingface_hub[cli]

# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B

This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.

Step 4: Build the Apptainer Container

bash 02_build_container.sh

This pulls the vllm/vllm-openai:latest Docker image and converts it to a .sif file. Takes 10-20 minutes. The resulting vllm_qwen.sif is ~12-15 GB.

Tip

: If building fails due to network/proxy issues, you can pull the Docker image first and convert manually:
apptainer pull docker://vllm/vllm-openai:latest

Step 5: Start the Server

Interactive (foreground):

bash 03_start_server.sh

Background (recommended for production):

bash 04_start_server_background.sh

The server takes 2-5 minutes to load the model into GPU memory. Monitor with:

tail -f logs/vllm_server_*.log

Look for the line:

INFO:     Uvicorn running on http://0.0.0.0:8000

Step 6: Test the Server

# Quick health check
curl http://localhost:7080/v1/models

# Full test
pip install openai
python test_server.py

Distribute the STUDENT_GUIDE.md file or share the connection details:

27B Base URL: http://silicon.fhgr.ch:7080/v1 — model name: qwen3.5-27b
35B Base URL: http://silicon.fhgr.ch:7081/v1 — model name: qwen3.5-35b-a3b

Configuration

All configuration is via environment variables in 03_start_server.sh:

Variable	Default	Description
`MODEL_DIR`	`~/models/Qwen3.5-27B`	Path to model weights
`PORT`	`7080`	HTTP port
`MAX_MODEL_LEN`	`32768`	Max context length (tokens)
`GPU_MEM_UTIL`	`0.92`	Fraction of GPU memory to use
`API_KEY`	(empty = no auth)	API key for authentication
`TENSOR_PARALLEL`	`1`	Number of GPUs

Context Length Tuning

The default MAX_MODEL_LEN=32768 is conservative and ensures stable operation for 15 concurrent users. If you have plenty of VRAM headroom:

MAX_MODEL_LEN=65536 bash 03_start_server.sh

Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require significantly more GPU memory for KV cache.

Adding Authentication

API_KEY="your-secret-key-here" bash 03_start_server.sh

Students then use this key in their api_key parameter.

Multi-GPU Setup

If you have multiple GPUs:

TENSOR_PARALLEL=2 bash 03_start_server.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

Running Persistently with tmux

For a robust setup that survives SSH disconnects:

ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach

# Reconnect later:
tmux attach -t llm_server

Files Overview

File	Purpose
`vllm_qwen.def`	Apptainer container definition
`01_download_model.sh`	Downloads model weights from Hugging Face
`02_build_container.sh`	Builds the Apptainer .sif image
`03_start_server.sh`	Starts vLLM server (foreground)
`04_start_server_background.sh`	Starts server in background with logging
`05_stop_server.sh`	Stops the background server
`test_server.py`	Tests the running server
`STUDENT_GUIDE.md`	Instructions for students

Troubleshooting

"CUDA out of memory"

Reduce MAX_MODEL_LEN (e.g., 16384)
Reduce GPU_MEM_UTIL (e.g., 0.85)
Use a quantized model variant

Container build fails

Ensure you have internet access and sufficient disk space (~20 GB for build cache)
Try: apptainer pull docker://vllm/vllm-openai:latest first

"No NVIDIA GPU detected"

Check that nvidia-smi works outside the container
Ensure --nv flag is passed (already in scripts)
Verify nvidia-container-cli: apptainer exec --nv vllm_qwen.sif nvidia-smi

Server starts but students can't connect

Check firewall: sudo ufw allow 7080:7090/tcp or equivalent
Verify the server binds to 0.0.0.0 (not just localhost)
Students must use the server's hostname/IP, not localhost

Slow generation with many users

This is expected — vLLM batches requests but throughput is finite
Consider reducing max_tokens in student requests
Monitor with: curl http://localhost:7080/metrics

README.md Unescape Escape

LLM Local — Qwen3.5-27B Inference Server

Architecture

Prerequisites

Hardware Sizing

Step-by-Step Setup

Step 0: SSH into the Server

Step 1: Clone This Repository

Step 2: Check GPU and Environment

Step 3: Download the Model (~60 GB)

Step 4: Build the Apptainer Container

Step 5: Start the Server

Step 6: Test the Server

Step 7: Share with Students

Configuration

Context Length Tuning

Adding Authentication

Multi-GPU Setup

Server Management

Running Persistently with tmux

Files Overview

Troubleshooting

"CUDA out of memory"

Container build fails

"No NVIDIA GPU detected"

Server starts but students can't connect

Slow generation with many users

README.md