herzogflorian 030d9f7935 Replace hardcoded username with placeholder in README

Made-with: Cursor

2026-03-02 20:59:27 +01:00

13 KiB

Raw Blame History

LLM Inferenz Server — Qwen3.5

Self-hosted LLM inference for ~15 concurrent students, served via vLLM inside an Apptainer container on a GPU server. Two models are available (one at a time):

Model	Params	Active	Weights	GPUs
Qwen3.5-35B-A3B	35B MoE	3B	~67 GB BF16	2× L40S (TP=2)
Qwen3.5-122B-A10B-FP8	122B MoE	10B	~125 GB FP8	4× L40S (TP=4)

Two front-ends are provided: Open WebUI (server-hosted ChatGPT-like UI) and a Streamlit app (local chat + file editor with code execution).

Architecture

Students
  │
  ├── Browser ──► Open WebUI (silicon.fhgr.ch:7081)
  │                  │  ChatGPT-like UI, user accounts, chat history
  │                  │
  ├── Streamlit ─────┤  Local app with file editor & code runner
  │                  │
  └── SDK / curl ────┘
                     ▼
          ┌──────────────────────────────┐
          │  silicon.fhgr.ch:7080       │
          │  OpenAI-compatible API      │
          ├──────────────────────────────┤
          │  vLLM Server (nightly)      │
          │  Apptainer container (.sif) │
          ├──────────────────────────────┤
          │  Model weights              │
          │  (bind-mounted from host)   │
          ├──────────────────────────────┤
          │  4× NVIDIA L40S (46 GB ea.) │
          │  184 GB total VRAM          │
          └──────────────────────────────┘

Hardware

The server silicon.fhgr.ch has 4× NVIDIA L40S GPUs (46 GB VRAM each, 184 GB total). Only one model runs at a time on port 7080.

	Qwen3.5-35B-A3B	Qwen3.5-122B-A10B-FP8
GPUs used	2× L40S (TP=2)	4× L40S (TP=4)
VRAM used	~92 GB	~184 GB
Weight size	~67 GB (BF16)	~125 GB (FP8)
Active params/token	3B (MoE)	10B (MoE)
Context length	32,768 tokens	32,768 tokens
Port	7080	7080

Prerequisites

Apptainer (formerly Singularity) installed on the server
NVIDIA drivers with GPU passthrough support (--nv flag)
~200 GB disk for model weights (both models) + ~8 GB for the container image
Network access to Hugging Face (for model download) and Docker Hub (for container build)

Note

: No pip or python is needed on the host — everything runs inside the Apptainer container.

Step-by-Step Setup

Step 0: SSH into the Server

ssh <name>@silicon.fhgr.ch

Step 1: Clone the Repository

git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Note

: git is not installed on the host. Use the container: apptainer exec vllm_qwen.sif git clone ... Or copy files via scp from your local machine.

Step 2: Check GPU and Environment

nvidia-smi
apptainer --version
df -h ~

Step 3: Build the Apptainer Container

bash 01_build_container.sh

Pulls the vllm/vllm-openai:nightly Docker image (required for Qwen3.5 support), installs latest transformers from source, and packages everything into vllm_qwen.sif (~8 GB). Takes 15-20 minutes.

Step 4: Download Model Weights

35B model (~67 GB):

bash 02_download_model.sh

122B model (~125 GB):

bash 10_download_model_122b.sh

Both use huggingface-cli inside the container. Stored at ~/models/Qwen3.5-35B-A3B and ~/models/Qwen3.5-122B-A10B-FP8 respectively.

Step 5: Start the Server

Only one model can run at a time on port 7080. Choose one:

35B model (2 GPUs, faster per-token, smaller):

bash 03_start_server.sh                  # foreground
bash 04_start_server_background.sh       # background

122B model (4 GPUs, more capable, FP8):

bash 11_start_server_122b.sh             # foreground
bash 12_start_server_122b_background.sh  # background

To switch models:

bash 05_stop_server.sh           # stop whichever is running
bash 11_start_server_122b.sh     # start the other one

The model takes 2-5 minutes (35B) or 5-10 minutes (122B) to load. It's ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:7080

Step 6: Test the Server

From another terminal on the server:

curl http://localhost:7080/v1/models

Quick chat test:

curl http://localhost:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'

Step 7: Set Up Open WebUI (ChatGPT-like Interface)

Open WebUI provides a full-featured chat interface that runs on the server. Students access it via a browser — no local setup required.

Pull the container:

bash 06_setup_openwebui.sh

Start (foreground with tmux):

tmux new -s webui
bash 07_start_openwebui.sh
# Ctrl+B, then D to detach

Start (background with logging):

bash 08_start_openwebui_background.sh
tail -f logs/openwebui_*.log

Open WebUI is ready when you see Uvicorn running in the logs. Access it at http://silicon.fhgr.ch:7081.

Important

: The first user to sign up becomes the admin. Sign up yourself first before sharing the URL with students.

Distribute STUDENT_GUIDE.md with connection details:

Open WebUI: http://silicon.fhgr.ch:7081 (recommended for most students)
API Base URL: http://silicon.fhgr.ch:7080/v1 (for SDK / programmatic use)
Model name: qwen3.5-35b-a3b or qwen3.5-122b-a10b-fp8 (depending on which is running)

Open WebUI

A server-hosted ChatGPT-like interface backed by the vLLM inference server. Runs as an Apptainer container on port 7081.

Features

User accounts with persistent chat history (stored in openwebui-data/)
Auto-discovers models from the vLLM backend
Streaming responses, markdown rendering, code highlighting
Admin panel for managing users, models, and settings
No local setup needed — students just open a browser

Configuration

Variable	Default	Description
`PORT`	`7081`	HTTP port for the UI
`VLLM_BASE_URL`	`http://localhost:7080/v1`	vLLM API endpoint
`VLLM_API_KEY`	`EMPTY`	API key (if vLLM requires one)
`DATA_DIR`	`./openwebui-data`	Persistent storage (DB, uploads)

Management

# Start in background
bash 08_start_openwebui_background.sh

# View logs
tail -f logs/openwebui_*.log

# Stop
bash 09_stop_openwebui.sh

# Reconnect to tmux session
tmux attach -t webui

Data Persistence

All user data (accounts, chats, settings) is stored in openwebui-data/. This directory is bind-mounted into the container, so data survives container restarts. Back it up regularly.

Streamlit App

A web-based chat and file editor that connects to the inference server. Students run it on their own machines.

Setup

pip install -r requirements.txt

Or with a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

streamlit run app.py

Opens at http://localhost:8501 with two tabs:

Chat — Conversational interface with streaming responses. Save the model's last response directly into a workspace file (code auto-extracted).
File Editor — Create/edit .py, .tex, .html, or any text file. Use "Generate with LLM" to modify files via natural language instructions.

Sidebar Controls

Parameter	Default	Range	Purpose
Thinking Mode	Off	Toggle	Chain-of-thought reasoning (slower, better for complex tasks)
Temperature	0.7	0.0 – 2.0	Creativity vs determinism
Max Tokens	4096	256 – 16384	Maximum response length
Top P	0.95	0.0 – 1.0	Nucleus sampling threshold
Presence Penalty	0.0	0.0 – 2.0	Penalize repeated topics

Server Configuration

Both start scripts accept the same environment variables:

Variable	35B default	122B default	Description
`MODEL_DIR`	`~/models/Qwen3.5-35B-A3B`	`~/models/Qwen3.5-122B-A10B-FP8`	Model weights path
`PORT`	`7080`	`7080`	HTTP port
`MAX_MODEL_LEN`	`32768`	`32768`	Max context length
`GPU_MEM_UTIL`	`0.92`	`0.92`	GPU memory fraction
`API_KEY`	(none)	(none)	API key for auth
`TENSOR_PARALLEL`	`2`	`4`	Number of GPUs

Examples

# Increase context length (35B)
MAX_MODEL_LEN=65536 bash 03_start_server.sh

# Increase context length (122B — has room with FP8)
MAX_MODEL_LEN=65536 bash 11_start_server_122b.sh

# Add API key authentication (works for either model)
API_KEY="your-secret-key" bash 11_start_server_122b.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

# Reconnect to tmux session
tmux attach -t llm

Files Overview

File	Purpose
`vllm_qwen.def`	Apptainer container definition (vLLM nightly + deps)
`01_build_container.sh`	Builds the Apptainer `.sif` image
`02_download_model.sh`	Downloads 35B model weights
`03_start_server.sh`	Starts 35B vLLM server (foreground, TP=2)
`04_start_server_background.sh`	Starts 35B server in background with logging
`05_stop_server.sh`	Stops whichever background vLLM server is running
`06_setup_openwebui.sh`	Pulls the Open WebUI container image
`07_start_openwebui.sh`	Starts Open WebUI (foreground)
`08_start_openwebui_background.sh`	Starts Open WebUI in background with logging
`09_stop_openwebui.sh`	Stops the background Open WebUI
`10_download_model_122b.sh`	Downloads 122B FP8 model weights
`11_start_server_122b.sh`	Starts 122B vLLM server (foreground, TP=4)
`12_start_server_122b_background.sh`	Starts 122B server in background with logging
`app.py`	Streamlit chat & file editor web app
`requirements.txt`	Python dependencies for the Streamlit app
`test_server.py`	Tests the running server via CLI
`STUDENT_GUIDE.md`	Instructions for students

Troubleshooting

"CUDA out of memory"

Reduce MAX_MODEL_LEN (e.g., 16384)
Reduce GPU_MEM_UTIL (e.g., 0.85)

Container build fails

Ensure internet access and sufficient disk space (~20 GB for build cache)
Try pulling manually first: apptainer pull docker://vllm/vllm-openai:nightly

"No NVIDIA GPU detected"

Verify nvidia-smi works on the host
Ensure --nv flag is present (already in scripts)
Test: apptainer exec --nv vllm_qwen.sif nvidia-smi

"Model type qwen3_5_moe not recognized"

The container needs vllm/vllm-openai:nightly (not :latest)
Rebuild the container: rm vllm_qwen.sif && bash 01_build_container.sh

Students can't connect

Check firewall: ports 7080-7090 must be open
Verify the server binds to 0.0.0.0 (not just localhost)
Students must be on the university network or VPN

Slow generation with many users

Expected — vLLM batches requests but throughput is finite
The MoE architecture (3B active) helps with per-token speed
Disable thinking mode for faster simple responses
Monitor: curl http://localhost:7080/metrics

Open WebUI won't start

Ensure the vLLM server is running first on port 7080
Check that port 7081 is not already in use: ss -tlnp | grep 7081
Check logs: tail -50 logs/openwebui_*.log
If the database is corrupted, reset: rm openwebui-data/webui.db and restart

Open WebUI shows no models

Verify vLLM is reachable: curl http://localhost:7080/v1/models
The OpenAI API base URL is set on first launch; if changed later, update it in the Open WebUI Admin Panel > Settings > Connections

Syncing files to the server

No git or pip on the host — use scp from your local machine:

scp app.py 03_start_server.sh <name>@silicon.fhgr.ch:~/LLM_local/

13 KiB Raw Blame History Unescape Escape