# LLM Inferenz Server — Qwen3.5 Self-hosted LLM inference for ~15 concurrent students, served via **vLLM** inside an **Apptainer** container on a GPU server. Two models are available (one at a time): | Model | Params | Active | Weights | GPUs | |-------|--------|--------|---------|------| | **Qwen3.5-35B-A3B** | 35B MoE | 3B | ~67 GB BF16 | 2× L40S (TP=2) | | **Qwen3.5-122B-A10B-FP8** | 122B MoE | 10B | ~125 GB FP8 | 4× L40S (TP=4) | Two front-ends are provided: **Open WebUI** (server-hosted ChatGPT-like UI) and a **Streamlit app** (local chat + file editor with code execution). ## Architecture ``` Students │ ├── Browser ──► Open WebUI (silicon.fhgr.ch:7081) │ │ ChatGPT-like UI, user accounts, chat history │ │ ├── Streamlit ─────┤ Local app with file editor & code runner │ │ └── SDK / curl ────┘ ▼ ┌──────────────────────────────┐ │ silicon.fhgr.ch:7080 │ │ OpenAI-compatible API │ ├──────────────────────────────┤ │ vLLM Server (nightly) │ │ Apptainer container (.sif) │ ├──────────────────────────────┤ │ Model weights │ │ (bind-mounted from host) │ ├──────────────────────────────┤ │ 4× NVIDIA L40S (46 GB ea.) │ │ 184 GB total VRAM │ └──────────────────────────────┘ ``` ## Hardware The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each, 184 GB total). Only one model runs at a time on port 7080. | | Qwen3.5-35B-A3B | Qwen3.5-122B-A10B-FP8 | |---|---|---| | GPUs used | 2× L40S (TP=2) | 4× L40S (TP=4) | | VRAM used | ~92 GB | ~184 GB | | Weight size | ~67 GB (BF16) | ~125 GB (FP8) | | Active params/token | 3B (MoE) | 10B (MoE) | | Context length | 32,768 tokens | 32,768 tokens | | Port | 7080 | 7080 | ## Prerequisites - **Apptainer** (formerly Singularity) installed on the server - **NVIDIA drivers** with GPU passthrough support (`--nv` flag) - **~200 GB disk** for model weights (both models) + ~8 GB for the container image - **Network access** to Hugging Face (for model download) and Docker Hub (for container build) > **Note**: No `pip` or `python` is needed on the host — everything runs inside > the Apptainer container. --- ## Step-by-Step Setup ### Step 0: SSH into the Server ```bash ssh @silicon.fhgr.ch ``` ### Step 1: Clone the Repository ```bash git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local cd ~/LLM_local chmod +x *.sh ``` > **Note**: `git` is not installed on the host. Use the container: > `apptainer exec vllm_qwen.sif git clone ...` > Or copy files via `scp` from your local machine. ### Step 2: Check GPU and Environment ```bash nvidia-smi apptainer --version df -h ~ ``` ### Step 3: Build the Apptainer Container ```bash bash 01_build_container.sh ``` Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5 support), installs latest `transformers` from source, and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes. ### Step 4: Download Model Weights **35B model (~67 GB):** ```bash bash 02_download_model.sh ``` **122B model (~125 GB):** ```bash bash 10_download_model_122b.sh ``` Both use `huggingface-cli` **inside the container**. Stored at `~/models/Qwen3.5-35B-A3B` and `~/models/Qwen3.5-122B-A10B-FP8` respectively. ### Step 5: Start the Server Only one model can run at a time on port 7080. Choose one: **35B model (2 GPUs, faster per-token, smaller):** ```bash bash 03_start_server.sh # foreground bash 04_start_server_background.sh # background ``` **122B model (4 GPUs, more capable, FP8):** ```bash bash 11_start_server_122b.sh # foreground bash 12_start_server_122b_background.sh # background ``` **To switch models:** ```bash bash 05_stop_server.sh # stop whichever is running bash 11_start_server_122b.sh # start the other one ``` The model takes 2-5 minutes (35B) or 5-10 minutes (122B) to load. It's ready when you see: ``` INFO: Uvicorn running on http://0.0.0.0:7080 ``` ### Step 6: Test the Server From another terminal on the server: ```bash curl http://localhost:7080/v1/models ``` Quick chat test: ```bash curl http://localhost:7080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' ``` ### Step 7: Set Up Open WebUI (ChatGPT-like Interface) Open WebUI provides a full-featured chat interface that runs on the server. Students access it via a browser — no local setup required. **Pull the container:** ```bash bash 06_setup_openwebui.sh ``` **Start (foreground with tmux):** ```bash tmux new -s webui bash 07_start_openwebui.sh # Ctrl+B, then D to detach ``` **Start (background with logging):** ```bash bash 08_start_openwebui_background.sh tail -f logs/openwebui_*.log ``` Open WebUI is ready when you see `Uvicorn running` in the logs. Access it at `http://silicon.fhgr.ch:7081`. > **Important**: The first user to sign up becomes the **admin**. Sign up > yourself first before sharing the URL with students. ### Step 8: Share with Students Distribute `STUDENT_GUIDE.md` with connection details: - **Open WebUI**: `http://silicon.fhgr.ch:7081` (recommended for most students) - **API Base URL**: `http://silicon.fhgr.ch:7080/v1` (for SDK / programmatic use) - **Model name**: `qwen3.5-35b-a3b` or `qwen3.5-122b-a10b-fp8` (depending on which is running) --- ## Open WebUI A server-hosted ChatGPT-like interface backed by the vLLM inference server. Runs as an Apptainer container on port **7081**. ### Features - User accounts with persistent chat history (stored in `openwebui-data/`) - Auto-discovers models from the vLLM backend - Streaming responses, markdown rendering, code highlighting - Admin panel for managing users, models, and settings - No local setup needed — students just open a browser ### Configuration | Variable | Default | Description | |----------|---------|-------------| | `PORT` | `7081` | HTTP port for the UI | | `VLLM_BASE_URL` | `http://localhost:7080/v1` | vLLM API endpoint | | `VLLM_API_KEY` | `EMPTY` | API key (if vLLM requires one) | | `DATA_DIR` | `./openwebui-data` | Persistent storage (DB, uploads) | ### Management ```bash # Start in background bash 08_start_openwebui_background.sh # View logs tail -f logs/openwebui_*.log # Stop bash 09_stop_openwebui.sh # Reconnect to tmux session tmux attach -t webui ``` ### Data Persistence All user data (accounts, chats, settings) is stored in `openwebui-data/`. This directory is bind-mounted into the container, so data survives container restarts. Back it up regularly. --- ## Streamlit App A web-based chat and file editor that connects to the inference server. Students run it on their own machines. ### Setup ```bash pip install -r requirements.txt ``` Or with a virtual environment: ```bash python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` ### Run ```bash streamlit run app.py ``` Opens at `http://localhost:8501` with two tabs: - **Chat** — Conversational interface with streaming responses. Save the model's last response directly into a workspace file (code auto-extracted). - **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file. Use "Generate with LLM" to modify files via natural language instructions. ### Sidebar Controls | Parameter | Default | Range | Purpose | |-----------|---------|-------|---------| | Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) | | Temperature | 0.7 | 0.0 – 2.0 | Creativity vs determinism | | Max Tokens | 4096 | 256 – 16384 | Maximum response length | | Top P | 0.95 | 0.0 – 1.0 | Nucleus sampling threshold | | Presence Penalty | 0.0 | 0.0 – 2.0 | Penalize repeated topics | --- ## Server Configuration Both start scripts accept the same environment variables: | Variable | 35B default | 122B default | Description | |----------|-------------|--------------|-------------| | `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | `~/models/Qwen3.5-122B-A10B-FP8` | Model weights path | | `PORT` | `7080` | `7080` | HTTP port | | `MAX_MODEL_LEN` | `32768` | `32768` | Max context length | | `GPU_MEM_UTIL` | `0.92` | `0.92` | GPU memory fraction | | `API_KEY` | *(none)* | *(none)* | API key for auth | | `TENSOR_PARALLEL` | `2` | `4` | Number of GPUs | ### Examples ```bash # Increase context length (35B) MAX_MODEL_LEN=65536 bash 03_start_server.sh # Increase context length (122B — has room with FP8) MAX_MODEL_LEN=65536 bash 11_start_server_122b.sh # Add API key authentication (works for either model) API_KEY="your-secret-key" bash 11_start_server_122b.sh ``` --- ## Server Management ```bash # Start in background bash 04_start_server_background.sh # Check if running curl -s http://localhost:7080/v1/models | python3 -m json.tool # View logs tail -f logs/vllm_server_*.log # Stop bash 05_stop_server.sh # Monitor GPU usage watch -n 2 nvidia-smi # Reconnect to tmux session tmux attach -t llm ``` --- ## Files Overview | File | Purpose | |------------------------------------|------------------------------------------------------| | `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) | | `01_build_container.sh` | Builds the Apptainer `.sif` image | | `02_download_model.sh` | Downloads 35B model weights | | `03_start_server.sh` | Starts 35B vLLM server (foreground, TP=2) | | `04_start_server_background.sh` | Starts 35B server in background with logging | | `05_stop_server.sh` | Stops whichever background vLLM server is running | | `06_setup_openwebui.sh` | Pulls the Open WebUI container image | | `07_start_openwebui.sh` | Starts Open WebUI (foreground) | | `08_start_openwebui_background.sh` | Starts Open WebUI in background with logging | | `09_stop_openwebui.sh` | Stops the background Open WebUI | | `10_download_model_122b.sh` | Downloads 122B FP8 model weights | | `11_start_server_122b.sh` | Starts 122B vLLM server (foreground, TP=4) | | `12_start_server_122b_background.sh` | Starts 122B server in background with logging | | `app.py` | Streamlit chat & file editor web app | | `requirements.txt` | Python dependencies for the Streamlit app | | `test_server.py` | Tests the running server via CLI | | `STUDENT_GUIDE.md` | Instructions for students | --- ## Troubleshooting ### "CUDA out of memory" - Reduce `MAX_MODEL_LEN` (e.g., `16384`) - Reduce `GPU_MEM_UTIL` (e.g., `0.85`) ### Container build fails - Ensure internet access and sufficient disk space (~20 GB for build cache) - Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly` ### "No NVIDIA GPU detected" - Verify `nvidia-smi` works on the host - Ensure `--nv` flag is present (already in scripts) - Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi` ### "Model type qwen3_5_moe not recognized" - The container needs `vllm/vllm-openai:nightly` (not `:latest`) - Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh` ### Students can't connect - Check firewall: ports 7080-7090 must be open - Verify the server binds to `0.0.0.0` (not just localhost) - Students must be on the university network or VPN ### Slow generation with many users - Expected — vLLM batches requests but throughput is finite - The MoE architecture (3B active) helps with per-token speed - Disable thinking mode for faster simple responses - Monitor: `curl http://localhost:7080/metrics` ### Open WebUI won't start - Ensure the vLLM server is running first on port 7080 - Check that port 7081 is not already in use: `ss -tlnp | grep 7081` - Check logs: `tail -50 logs/openwebui_*.log` - If the database is corrupted, reset: `rm openwebui-data/webui.db` and restart ### Open WebUI shows no models - Verify vLLM is reachable: `curl http://localhost:7080/v1/models` - The OpenAI API base URL is set on first launch; if changed later, update it in the Open WebUI Admin Panel > Settings > Connections ### Syncing files to the server - No `git` or `pip` on the host — use `scp` from your local machine: ```bash scp app.py 03_start_server.sh @silicon.fhgr.ch:~/LLM_local/ ```