Scripts to build container, download model, and serve Qwen3.5-35B-A3B via vLLM with OpenAI-compatible API on port 7080. Configured for 2x NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent students. Made-with: Cursor
LLM Local — Qwen3.5-27B Inference Server
Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-27B, served via vLLM inside an Apptainer container on a GPU server.
Architecture
Students (OpenAI SDK / curl)
│
▼
┌─────────────────────────┐
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├─────────────────────────┤
│ vLLM Server │
│ (Apptainer container) │
├─────────────────────────┤
│ Qwen3.5-27B weights │
│ (bind-mounted) │
├─────────────────────────┤
│ NVIDIA GPU │
└─────────────────────────┘
Prerequisites
- GPU: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended). Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
- Apptainer (formerly Singularity) installed on the server.
- NVIDIA drivers + nvidia-container-cli for GPU passthrough.
- ~60 GB disk space for model weights + ~15 GB for the container image.
- Network: Students must be on the university network or VPN.
Hardware Sizing
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 80 GB (1× A100) | 80 GB (1× H100) |
| RAM | 64 GB | 128 GB |
| Disk | 100 GB free | 200 GB free |
If your GPU has less than 80 GB VRAM, you have two options:
- Use a quantized version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
- Use tensor parallelism across multiple GPUs (set
TENSOR_PARALLEL=2)
Step-by-Step Setup
Step 0: SSH into the Server
ssh herzogfloria@silicon.fhgr.ch
Step 1: Clone This Repository
# Or copy the files to the server
git clone <your-repo-url> ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
Step 2: Check GPU and Environment
# Verify GPU is visible
nvidia-smi
# Verify Apptainer is installed
apptainer --version
# Check available disk space
df -h ~
Step 3: Download the Model (~60 GB)
# Install huggingface-cli if not available
pip install --user huggingface_hub[cli]
# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
Step 4: Build the Apptainer Container
bash 02_build_container.sh
This pulls the vllm/vllm-openai:latest Docker image and converts it to a .sif file.
Takes 10-20 minutes. The resulting vllm_qwen.sif is ~12-15 GB.
Tip
: If building fails due to network/proxy issues, you can pull the Docker image first and convert manually:
apptainer pull docker://vllm/vllm-openai:latest
Step 5: Start the Server
Interactive (foreground):
bash 03_start_server.sh
Background (recommended for production):
bash 04_start_server_background.sh
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
tail -f logs/vllm_server_*.log
Look for the line:
INFO: Uvicorn running on http://0.0.0.0:8000
Step 6: Test the Server
# Quick health check
curl http://localhost:7080/v1/models
# Full test
pip install openai
python test_server.py
Step 7: Share with Students
Distribute the STUDENT_GUIDE.md file or share the connection details:
- 27B Base URL:
http://silicon.fhgr.ch:7080/v1— model name:qwen3.5-27b - 35B Base URL:
http://silicon.fhgr.ch:7081/v1— model name:qwen3.5-35b-a3b
Configuration
All configuration is via environment variables in 03_start_server.sh:
| Variable | Default | Description |
|---|---|---|
MODEL_DIR |
~/models/Qwen3.5-27B |
Path to model weights |
PORT |
7080 |
HTTP port |
MAX_MODEL_LEN |
32768 |
Max context length (tokens) |
GPU_MEM_UTIL |
0.92 |
Fraction of GPU memory to use |
API_KEY |
(empty = no auth) | API key for authentication |
TENSOR_PARALLEL |
1 |
Number of GPUs |
Context Length Tuning
The default MAX_MODEL_LEN=32768 is conservative and ensures stable operation for 15
concurrent users. If you have plenty of VRAM headroom:
MAX_MODEL_LEN=65536 bash 03_start_server.sh
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require significantly more GPU memory for KV cache.
Adding Authentication
API_KEY="your-secret-key-here" bash 03_start_server.sh
Students then use this key in their api_key parameter.
Multi-GPU Setup
If you have multiple GPUs:
TENSOR_PARALLEL=2 bash 03_start_server.sh
Server Management
# Start in background
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool
# View logs
tail -f logs/vllm_server_*.log
# Stop
bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
Running Persistently with tmux
For a robust setup that survives SSH disconnects:
ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach
# Reconnect later:
tmux attach -t llm_server
Files Overview
| File | Purpose |
|---|---|
vllm_qwen.def |
Apptainer container definition |
01_download_model.sh |
Downloads model weights from Hugging Face |
02_build_container.sh |
Builds the Apptainer .sif image |
03_start_server.sh |
Starts vLLM server (foreground) |
04_start_server_background.sh |
Starts server in background with logging |
05_stop_server.sh |
Stops the background server |
test_server.py |
Tests the running server |
STUDENT_GUIDE.md |
Instructions for students |
Troubleshooting
"CUDA out of memory"
- Reduce
MAX_MODEL_LEN(e.g., 16384) - Reduce
GPU_MEM_UTIL(e.g., 0.85) - Use a quantized model variant
Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
- Try:
apptainer pull docker://vllm/vllm-openai:latestfirst
"No NVIDIA GPU detected"
- Check that
nvidia-smiworks outside the container - Ensure
--nvflag is passed (already in scripts) - Verify nvidia-container-cli:
apptainer exec --nv vllm_qwen.sif nvidia-smi
Server starts but students can't connect
- Check firewall:
sudo ufw allow 7080:7090/tcpor equivalent - Verify the server binds to
0.0.0.0(not just localhost) - Students must use the server's hostname/IP, not
localhost
Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite
- Consider reducing
max_tokensin student requests - Monitor with:
curl http://localhost:7080/metrics