Made-with: Cursor
LLM Inferenz Server — Qwen3.5
Self-hosted LLM inference for ~15 concurrent students, served via vLLM inside an Apptainer container on a GPU server. Two models are available (one at a time):
| Model | Params | Active | Weights | GPUs |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | 35B MoE | 3B | ~67 GB BF16 | 2× L40S (TP=2) |
| Qwen3.5-122B-A10B-FP8 | 122B MoE | 10B | ~125 GB FP8 | 4× L40S (TP=4) |
Two front-ends are provided: Open WebUI (server-hosted ChatGPT-like UI) and a Streamlit app (local chat + file editor with code execution).
Architecture
Students
│
├── Browser ──► Open WebUI (silicon.fhgr.ch:7081)
│ │ ChatGPT-like UI, user accounts, chat history
│ │
├── Streamlit ─────┤ Local app with file editor & code runner
│ │
└── SDK / curl ────┘
▼
┌──────────────────────────────┐
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├──────────────────────────────┤
│ vLLM Server (nightly) │
│ Apptainer container (.sif) │
├──────────────────────────────┤
│ Model weights │
│ (bind-mounted from host) │
├──────────────────────────────┤
│ 4× NVIDIA L40S (46 GB ea.) │
│ 184 GB total VRAM │
└──────────────────────────────┘
Hardware
The server silicon.fhgr.ch has 4× NVIDIA L40S GPUs (46 GB VRAM each,
184 GB total). Only one model runs at a time on port 7080.
| Qwen3.5-35B-A3B | Qwen3.5-122B-A10B-FP8 | |
|---|---|---|
| GPUs used | 2× L40S (TP=2) | 4× L40S (TP=4) |
| VRAM used | ~92 GB | ~184 GB |
| Weight size | ~67 GB (BF16) | ~125 GB (FP8) |
| Active params/token | 3B (MoE) | 10B (MoE) |
| Context length | 32,768 tokens | 32,768 tokens |
| Port | 7080 | 7080 |
Prerequisites
- Apptainer (formerly Singularity) installed on the server
- NVIDIA drivers with GPU passthrough support (
--nvflag) - ~200 GB disk for model weights (both models) + ~8 GB for the container image
- Network access to Hugging Face (for model download) and Docker Hub (for container build)
Note
: No
piporpythonis needed on the host — everything runs inside the Apptainer container.
Step-by-Step Setup
Step 0: SSH into the Server
ssh herzogfloria@silicon.fhgr.ch
Step 1: Clone the Repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
Note
:
gitis not installed on the host. Use the container:apptainer exec vllm_qwen.sif git clone ...Or copy files viascpfrom your local machine.
Step 2: Check GPU and Environment
nvidia-smi
apptainer --version
df -h ~
Step 3: Build the Apptainer Container
bash 01_build_container.sh
Pulls the vllm/vllm-openai:nightly Docker image (required for Qwen3.5
support), installs latest transformers from source, and packages everything
into vllm_qwen.sif (~8 GB). Takes 15-20 minutes.
Step 4: Download Model Weights
35B model (~67 GB):
bash 02_download_model.sh
122B model (~125 GB):
bash 10_download_model_122b.sh
Both use huggingface-cli inside the container. Stored at
~/models/Qwen3.5-35B-A3B and ~/models/Qwen3.5-122B-A10B-FP8 respectively.
Step 5: Start the Server
Only one model can run at a time on port 7080. Choose one:
35B model (2 GPUs, faster per-token, smaller):
bash 03_start_server.sh # foreground
bash 04_start_server_background.sh # background
122B model (4 GPUs, more capable, FP8):
bash 11_start_server_122b.sh # foreground
bash 12_start_server_122b_background.sh # background
To switch models:
bash 05_stop_server.sh # stop whichever is running
bash 11_start_server_122b.sh # start the other one
The model takes 2-5 minutes (35B) or 5-10 minutes (122B) to load. It's ready when you see:
INFO: Uvicorn running on http://0.0.0.0:7080
Step 6: Test the Server
From another terminal on the server:
curl http://localhost:7080/v1/models
Quick chat test:
curl http://localhost:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
Step 7: Set Up Open WebUI (ChatGPT-like Interface)
Open WebUI provides a full-featured chat interface that runs on the server. Students access it via a browser — no local setup required.
Pull the container:
bash 06_setup_openwebui.sh
Start (foreground with tmux):
tmux new -s webui
bash 07_start_openwebui.sh
# Ctrl+B, then D to detach
Start (background with logging):
bash 08_start_openwebui_background.sh
tail -f logs/openwebui_*.log
Open WebUI is ready when you see Uvicorn running in the logs.
Access it at http://silicon.fhgr.ch:7081.
Important
: The first user to sign up becomes the admin. Sign up yourself first before sharing the URL with students.
Step 8: Share with Students
Distribute STUDENT_GUIDE.md with connection details:
- Open WebUI:
http://silicon.fhgr.ch:7081(recommended for most students) - API Base URL:
http://silicon.fhgr.ch:7080/v1(for SDK / programmatic use) - Model name:
qwen3.5-35b-a3borqwen3.5-122b-a10b-fp8(depending on which is running)
Open WebUI
A server-hosted ChatGPT-like interface backed by the vLLM inference server. Runs as an Apptainer container on port 7081.
Features
- User accounts with persistent chat history (stored in
openwebui-data/) - Auto-discovers models from the vLLM backend
- Streaming responses, markdown rendering, code highlighting
- Admin panel for managing users, models, and settings
- No local setup needed — students just open a browser
Configuration
| Variable | Default | Description |
|---|---|---|
PORT |
7081 |
HTTP port for the UI |
VLLM_BASE_URL |
http://localhost:7080/v1 |
vLLM API endpoint |
VLLM_API_KEY |
EMPTY |
API key (if vLLM requires one) |
DATA_DIR |
./openwebui-data |
Persistent storage (DB, uploads) |
Management
# Start in background
bash 08_start_openwebui_background.sh
# View logs
tail -f logs/openwebui_*.log
# Stop
bash 09_stop_openwebui.sh
# Reconnect to tmux session
tmux attach -t webui
Data Persistence
All user data (accounts, chats, settings) is stored in openwebui-data/.
This directory is bind-mounted into the container, so data survives
container restarts. Back it up regularly.
Streamlit App
A web-based chat and file editor that connects to the inference server. Students run it on their own machines.
Setup
pip install -r requirements.txt
Or with a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Run
streamlit run app.py
Opens at http://localhost:8501 with two tabs:
- Chat — Conversational interface with streaming responses. Save the model's last response directly into a workspace file (code auto-extracted).
- File Editor — Create/edit
.py,.tex,.html, or any text file. Use "Generate with LLM" to modify files via natural language instructions.
Sidebar Controls
| Parameter | Default | Range | Purpose |
|---|---|---|---|
| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
| Temperature | 0.7 | 0.0 – 2.0 | Creativity vs determinism |
| Max Tokens | 4096 | 256 – 16384 | Maximum response length |
| Top P | 0.95 | 0.0 – 1.0 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | 0.0 – 2.0 | Penalize repeated topics |
Server Configuration
Both start scripts accept the same environment variables:
| Variable | 35B default | 122B default | Description |
|---|---|---|---|
MODEL_DIR |
~/models/Qwen3.5-35B-A3B |
~/models/Qwen3.5-122B-A10B-FP8 |
Model weights path |
PORT |
7080 |
7080 |
HTTP port |
MAX_MODEL_LEN |
32768 |
32768 |
Max context length |
GPU_MEM_UTIL |
0.92 |
0.92 |
GPU memory fraction |
API_KEY |
(none) | (none) | API key for auth |
TENSOR_PARALLEL |
2 |
4 |
Number of GPUs |
Examples
# Increase context length (35B)
MAX_MODEL_LEN=65536 bash 03_start_server.sh
# Increase context length (122B — has room with FP8)
MAX_MODEL_LEN=65536 bash 11_start_server_122b.sh
# Add API key authentication (works for either model)
API_KEY="your-secret-key" bash 11_start_server_122b.sh
Server Management
# Start in background
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool
# View logs
tail -f logs/vllm_server_*.log
# Stop
bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
# Reconnect to tmux session
tmux attach -t llm
Files Overview
| File | Purpose |
|---|---|
vllm_qwen.def |
Apptainer container definition (vLLM nightly + deps) |
01_build_container.sh |
Builds the Apptainer .sif image |
02_download_model.sh |
Downloads 35B model weights |
03_start_server.sh |
Starts 35B vLLM server (foreground, TP=2) |
04_start_server_background.sh |
Starts 35B server in background with logging |
05_stop_server.sh |
Stops whichever background vLLM server is running |
06_setup_openwebui.sh |
Pulls the Open WebUI container image |
07_start_openwebui.sh |
Starts Open WebUI (foreground) |
08_start_openwebui_background.sh |
Starts Open WebUI in background with logging |
09_stop_openwebui.sh |
Stops the background Open WebUI |
10_download_model_122b.sh |
Downloads 122B FP8 model weights |
11_start_server_122b.sh |
Starts 122B vLLM server (foreground, TP=4) |
12_start_server_122b_background.sh |
Starts 122B server in background with logging |
app.py |
Streamlit chat & file editor web app |
requirements.txt |
Python dependencies for the Streamlit app |
test_server.py |
Tests the running server via CLI |
STUDENT_GUIDE.md |
Instructions for students |
Troubleshooting
"CUDA out of memory"
- Reduce
MAX_MODEL_LEN(e.g.,16384) - Reduce
GPU_MEM_UTIL(e.g.,0.85)
Container build fails
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first:
apptainer pull docker://vllm/vllm-openai:nightly
"No NVIDIA GPU detected"
- Verify
nvidia-smiworks on the host - Ensure
--nvflag is present (already in scripts) - Test:
apptainer exec --nv vllm_qwen.sif nvidia-smi
"Model type qwen3_5_moe not recognized"
- The container needs
vllm/vllm-openai:nightly(not:latest) - Rebuild the container:
rm vllm_qwen.sif && bash 01_build_container.sh
Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to
0.0.0.0(not just localhost) - Students must be on the university network or VPN
Slow generation with many users
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Disable thinking mode for faster simple responses
- Monitor:
curl http://localhost:7080/metrics
Open WebUI won't start
- Ensure the vLLM server is running first on port 7080
- Check that port 7081 is not already in use:
ss -tlnp | grep 7081 - Check logs:
tail -50 logs/openwebui_*.log - If the database is corrupted, reset:
rm openwebui-data/webui.dband restart
Open WebUI shows no models
- Verify vLLM is reachable:
curl http://localhost:7080/v1/models - The OpenAI API base URL is set on first launch; if changed later, update it in the Open WebUI Admin Panel > Settings > Connections
Syncing files to the server
- No
gitorpipon the host — usescpfrom your local machine:
scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/