# LLM Inferenz Server — Qwen3.5-35B-A3B Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B** (MoE, 35B total / 3B active per token), served via **vLLM** inside an **Apptainer** container on a GPU server. ## Architecture ``` Students (OpenAI SDK / curl) │ ▼ ┌──────────────────────────────┐ │ silicon.fhgr.ch:7080 │ │ OpenAI-compatible API │ ├──────────────────────────────┤ │ vLLM Server (nightly) │ │ Apptainer container (.sif) │ ├──────────────────────────────┤ │ Qwen3.5-35B-A3B weights │ │ (bind-mounted from host) │ ├──────────────────────────────┤ │ 2× NVIDIA L40S (46 GB ea.) │ │ Tensor Parallel = 2 │ └──────────────────────────────┘ ``` ## Hardware The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each). The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free. | Component | Value | |-----------|-------| | GPUs used | 2× NVIDIA L40S | | VRAM used | ~92 GB total | | Model size (BF16) | ~67 GB | | Active params/token | 3B (MoE) | | Context length | 32,768 tokens | | Port | 7080 | ## Prerequisites - **Apptainer** (formerly Singularity) installed on the server - **NVIDIA drivers** with GPU passthrough support (`--nv` flag) - **~80 GB disk** for model weights + ~8 GB for the container image - **Network access** to Hugging Face (for model download) and Docker Hub (for container build) > **Note**: No `pip` or `python` is needed on the host — everything runs inside > the Apptainer container. --- ## Step-by-Step Setup ### Step 0: SSH into the Server ```bash ssh herzogfloria@silicon.fhgr.ch ``` ### Step 1: Clone the Repository ```bash git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local cd ~/LLM_local chmod +x *.sh ``` ### Step 2: Check GPU and Environment ```bash nvidia-smi apptainer --version df -h ~ ``` ### Step 3: Build the Apptainer Container ```bash bash 01_build_container.sh ``` Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly (required for Qwen3.5 support), installs latest `transformers` from source, and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes. ### Step 4: Download the Model (~67 GB) ```bash bash 02_download_model.sh ``` Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes depending on bandwidth. ### Step 5: Start the Server **Interactive (foreground) — recommended with tmux:** ```bash tmux new -s llm bash 03_start_server.sh # Ctrl+B, then D to detach ``` **Background with logging:** ```bash bash 04_start_server_background.sh tail -f logs/vllm_server_*.log ``` The model takes 2-5 minutes to load into GPU memory. It's ready when you see: ``` INFO: Uvicorn running on http://0.0.0.0:7080 ``` ### Step 6: Test the Server From another terminal on the server: ```bash curl http://localhost:7080/v1/models ``` Or run the full test (uses `openai` SDK inside the container): ```bash apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py ``` ### Step 7: Share with Students Distribute `STUDENT_GUIDE.md` with connection details: - **Base URL**: `http://silicon.fhgr.ch:7080/v1` - **Model name**: `qwen3.5-35b-a3b` --- ## Configuration All configuration is via environment variables passed to `03_start_server.sh`: | Variable | Default | Description | |-------------------|----------------------------------|--------------------------------| | `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights | | `PORT` | `7080` | HTTP port | | `MAX_MODEL_LEN` | `32768` | Max context length (tokens) | | `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use | | `API_KEY` | *(empty = no auth)* | API key for authentication | | `TENSOR_PARALLEL` | `2` | Number of GPUs | ### Examples ```bash # Increase context length MAX_MODEL_LEN=65536 bash 03_start_server.sh # Add API key authentication API_KEY="your-secret-key" bash 03_start_server.sh # Use all 4 GPUs (more KV cache headroom) TENSOR_PARALLEL=4 bash 03_start_server.sh ``` --- ## Server Management ```bash # Start in background bash 04_start_server_background.sh # Check if running curl -s http://localhost:7080/v1/models | python3 -m json.tool # View logs tail -f logs/vllm_server_*.log # Stop bash 05_stop_server.sh # Monitor GPU usage watch -n 2 nvidia-smi # Reconnect to tmux session tmux attach -t llm ``` --- ## Files Overview | File | Purpose | |----------------------------------|------------------------------------------------------| | `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) | | `01_build_container.sh` | Builds the Apptainer `.sif` image | | `02_download_model.sh` | Downloads model weights (runs inside container) | | `03_start_server.sh` | Starts vLLM server (foreground) | | `04_start_server_background.sh` | Starts server in background with logging | | `05_stop_server.sh` | Stops the background server | | `test_server.py` | Tests the running server | | `STUDENT_GUIDE.md` | Instructions for students | --- ## Troubleshooting ### "CUDA out of memory" - Reduce `MAX_MODEL_LEN` (e.g., `16384`) - Reduce `GPU_MEM_UTIL` (e.g., `0.85`) ### Container build fails - Ensure internet access and sufficient disk space (~20 GB for build cache) - Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest` ### "No NVIDIA GPU detected" - Verify `nvidia-smi` works on the host - Ensure `--nv` flag is present (already in scripts) - Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi` ### "Model type qwen3_5_moe not recognized" - The container needs vLLM nightly and latest transformers - Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh` ### Students can't connect - Check firewall: ports 7080-7090 must be open - Verify the server binds to `0.0.0.0` (not just localhost) - Students must be on the university network or VPN ### Slow generation with many users - Expected — vLLM batches requests but throughput is finite - The MoE architecture (3B active) helps with per-token speed - Monitor: `curl http://localhost:7080/metrics`