# LLM Local — Qwen3.5-27B Inference Server Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**, served via **vLLM** inside an **Apptainer** container on a GPU server. ## Architecture ``` Students (OpenAI SDK / curl) │ ▼ ┌─────────────────────────┐ │ silicon.fhgr.ch:7080 │ │ OpenAI-compatible API │ ├─────────────────────────┤ │ vLLM Server │ │ (Apptainer container) │ ├─────────────────────────┤ │ Qwen3.5-27B weights │ │ (bind-mounted) │ ├─────────────────────────┤ │ NVIDIA GPU │ └─────────────────────────┘ ``` ## Prerequisites - **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended). Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead. - **Apptainer** (formerly Singularity) installed on the server. - **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough. - **~60 GB disk space** for model weights + ~15 GB for the container image. - **Network**: Students must be on the university network or VPN. ## Hardware Sizing | Component | Minimum | Recommended | |-----------|----------------|-----------------| | GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) | | RAM | 64 GB | 128 GB | | Disk | 100 GB free | 200 GB free | > **If your GPU has less than 80 GB VRAM**, you have two options: > 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM) > 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`) --- ## Step-by-Step Setup ### Step 0: SSH into the Server ```bash ssh herzogfloria@silicon.fhgr.ch ``` ### Step 1: Clone This Repository ```bash # Or copy the files to the server git clone ~/LLM_local cd ~/LLM_local chmod +x *.sh ``` ### Step 2: Check GPU and Environment ```bash # Verify GPU is visible nvidia-smi # Verify Apptainer is installed apptainer --version # Check available disk space df -h ~ ``` ### Step 3: Download the Model (~60 GB) ```bash # Install huggingface-cli if not available pip install --user huggingface_hub[cli] # Download Qwen3.5-27B bash 01_download_model.sh # Default target: ~/models/Qwen3.5-27B ``` This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth. ### Step 4: Build the Apptainer Container ```bash bash 02_build_container.sh ``` This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file. Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB. > **Tip**: If building fails due to network/proxy issues, you can pull the Docker image > first and convert manually: > ```bash > apptainer pull docker://vllm/vllm-openai:latest > ``` ### Step 5: Start the Server **Interactive (foreground):** ```bash bash 03_start_server.sh ``` **Background (recommended for production):** ```bash bash 04_start_server_background.sh ``` The server takes 2-5 minutes to load the model into GPU memory. Monitor with: ```bash tail -f logs/vllm_server_*.log ``` Look for the line: ``` INFO: Uvicorn running on http://0.0.0.0:8000 ``` ### Step 6: Test the Server ```bash # Quick health check curl http://localhost:7080/v1/models # Full test pip install openai python test_server.py ``` ### Step 7: Share with Students Distribute the `STUDENT_GUIDE.md` file or share the connection details: - **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b` - **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b` --- ## Configuration All configuration is via environment variables in `03_start_server.sh`: | Variable | Default | Description | |-------------------|------------------------------|-------------------------------------| | `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights | | `PORT` | `7080` | HTTP port | | `MAX_MODEL_LEN` | `32768` | Max context length (tokens) | | `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use | | `API_KEY` | *(empty = no auth)* | API key for authentication | | `TENSOR_PARALLEL` | `1` | Number of GPUs | ### Context Length Tuning The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15 concurrent users. If you have plenty of VRAM headroom: ```bash MAX_MODEL_LEN=65536 bash 03_start_server.sh ``` Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require significantly more GPU memory for KV cache. ### Adding Authentication ```bash API_KEY="your-secret-key-here" bash 03_start_server.sh ``` Students then use this key in their `api_key` parameter. ### Multi-GPU Setup If you have multiple GPUs: ```bash TENSOR_PARALLEL=2 bash 03_start_server.sh ``` --- ## Server Management ```bash # Start in background bash 04_start_server_background.sh # Check if running curl -s http://localhost:7080/v1/models | python -m json.tool # View logs tail -f logs/vllm_server_*.log # Stop bash 05_stop_server.sh # Monitor GPU usage watch -n 2 nvidia-smi ``` ### Running Persistently with tmux For a robust setup that survives SSH disconnects: ```bash ssh herzogfloria@silicon.fhgr.ch tmux new -s llm_server bash 03_start_server.sh # Press Ctrl+B, then D to detach # Reconnect later: tmux attach -t llm_server ``` --- ## Files Overview | File | Purpose | |------------------------------|------------------------------------------- | | `vllm_qwen.def` | Apptainer container definition | | `01_download_model.sh` | Downloads model weights from Hugging Face | | `02_build_container.sh` | Builds the Apptainer .sif image | | `03_start_server.sh` | Starts vLLM server (foreground) | | `04_start_server_background.sh` | Starts server in background with logging| | `05_stop_server.sh` | Stops the background server | | `test_server.py` | Tests the running server | | `STUDENT_GUIDE.md` | Instructions for students | --- ## Troubleshooting ### "CUDA out of memory" - Reduce `MAX_MODEL_LEN` (e.g., 16384) - Reduce `GPU_MEM_UTIL` (e.g., 0.85) - Use a quantized model variant ### Container build fails - Ensure you have internet access and sufficient disk space (~20 GB for build cache) - Try: `apptainer pull docker://vllm/vllm-openai:latest` first ### "No NVIDIA GPU detected" - Check that `nvidia-smi` works outside the container - Ensure `--nv` flag is passed (already in scripts) - Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi` ### Server starts but students can't connect - Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent - Verify the server binds to `0.0.0.0` (not just localhost) - Students must use the server's hostname/IP, not `localhost` ### Slow generation with many users - This is expected — vLLM batches requests but throughput is finite - Consider reducing `max_tokens` in student requests - Monitor with: `curl http://localhost:7080/metrics`