Go to file

herzogflorian f4fdaab732 Add Open WebUI integration and enhance Streamlit app

- Add Open WebUI scripts (06-09) for server-hosted ChatGPT-like interface
  connected to the vLLM backend on port 7081
- Add context window management to chat (auto-trim, token counter, progress bar)
- Add terminal output panel to file editor for running Python/LaTeX files
- Update README with Open WebUI setup, architecture diagram, and troubleshooting
- Update STUDENT_GUIDE with step-by-step Open WebUI login instructions

Made-with: Cursor

2026-03-02 18:48:51 +01:00

.gitignore

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

01_build_container.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

02_download_model.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

03_start_server.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

04_start_server_background.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

05_stop_server.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

06_setup_openwebui.sh

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

07_start_openwebui.sh

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

08_start_openwebui_background.sh

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

09_stop_openwebui.sh

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

app.py

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

README.md

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

requirements.txt

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

STUDENT_GUIDE.md

Add Open WebUI integration and enhance Streamlit app

2026-03-02 18:48:51 +01:00

test_server.py

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

vllm_qwen.def

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

README.md

LLM Inferenz Server — Qwen3.5-35B-A3B

Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-35B-A3B (MoE, 35B total / 3B active per token), served via vLLM inside an Apptainer container on a GPU server. Two front-ends are provided: Open WebUI (server-hosted ChatGPT-like UI) and a Streamlit app (local chat + file editor with code execution).

Architecture

Students
  │
  ├── Browser ──► Open WebUI (silicon.fhgr.ch:7081)
  │                  │  ChatGPT-like UI, user accounts, chat history
  │                  │
  ├── Streamlit ─────┤  Local app with file editor & code runner
  │                  │
  └── SDK / curl ────┘
                     ▼
          ┌──────────────────────────────┐
          │  silicon.fhgr.ch:7080       │
          │  OpenAI-compatible API      │
          ├──────────────────────────────┤
          │  vLLM Server (nightly)      │
          │  Apptainer container (.sif) │
          ├──────────────────────────────┤
          │  Qwen3.5-35B-A3B weights    │
          │  (bind-mounted from host)   │
          ├──────────────────────────────┤
          │  2× NVIDIA L40S (46 GB ea.) │
          │  Tensor Parallel = 2        │
          └──────────────────────────────┘

Hardware

The server silicon.fhgr.ch has 4× NVIDIA L40S GPUs (46 GB VRAM each). The inference server uses 2 GPUs with tensor parallelism, leaving 2 GPUs free.

Component	Value
GPUs used	2× NVIDIA L40S
VRAM used	~92 GB total
Model size (BF16)	~67 GB
Active params/token	3B (MoE)
Context length	32,768 tokens
Port	7080

Prerequisites

Apptainer (formerly Singularity) installed on the server
NVIDIA drivers with GPU passthrough support (--nv flag)
~80 GB disk for model weights + ~8 GB for the container image
Network access to Hugging Face (for model download) and Docker Hub (for container build)

Note

: No pip or python is needed on the host — everything runs inside the Apptainer container.

Step-by-Step Setup

Step 0: SSH into the Server

ssh herzogfloria@silicon.fhgr.ch

Step 1: Clone the Repository

git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Note

: git is not installed on the host. Use the container: apptainer exec vllm_qwen.sif git clone ... Or copy files via scp from your local machine.

Step 2: Check GPU and Environment

nvidia-smi
apptainer --version
df -h ~

Step 3: Build the Apptainer Container

bash 01_build_container.sh

Pulls the vllm/vllm-openai:nightly Docker image (required for Qwen3.5 support), installs latest transformers from source, and packages everything into vllm_qwen.sif (~8 GB). Takes 15-20 minutes.

Step 4: Download the Model (~67 GB)

bash 02_download_model.sh

Downloads Qwen3.5-35B-A3B weights using huggingface-cli inside the container. Stored at ~/models/Qwen3.5-35B-A3B. Takes 5-30 minutes depending on bandwidth.

Step 5: Start the Server

Interactive (foreground) — recommended with tmux:

tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach

Background with logging:

bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log

The model takes 2-5 minutes to load into GPU memory. It's ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:7080

Step 6: Test the Server

From another terminal on the server:

curl http://localhost:7080/v1/models

Quick chat test:

curl http://localhost:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'

Step 7: Set Up Open WebUI (ChatGPT-like Interface)

Open WebUI provides a full-featured chat interface that runs on the server. Students access it via a browser — no local setup required.

Pull the container:

bash 06_setup_openwebui.sh

Start (foreground with tmux):

tmux new -s webui
bash 07_start_openwebui.sh
# Ctrl+B, then D to detach

Start (background with logging):

bash 08_start_openwebui_background.sh
tail -f logs/openwebui_*.log

Open WebUI is ready when you see Uvicorn running in the logs. Access it at http://silicon.fhgr.ch:7081.

Important

: The first user to sign up becomes the admin. Sign up yourself first before sharing the URL with students.

Distribute STUDENT_GUIDE.md with connection details:

Open WebUI: http://silicon.fhgr.ch:7081 (recommended for most students)
API Base URL: http://silicon.fhgr.ch:7080/v1 (for SDK / programmatic use)
Model name: qwen3.5-35b-a3b

Open WebUI

A server-hosted ChatGPT-like interface backed by the vLLM inference server. Runs as an Apptainer container on port 7081.

Features

User accounts with persistent chat history (stored in openwebui-data/)
Auto-discovers models from the vLLM backend
Streaming responses, markdown rendering, code highlighting
Admin panel for managing users, models, and settings
No local setup needed — students just open a browser

Configuration

Variable	Default	Description
`PORT`	`7081`	HTTP port for the UI
`VLLM_BASE_URL`	`http://localhost:7080/v1`	vLLM API endpoint
`VLLM_API_KEY`	`EMPTY`	API key (if vLLM requires one)
`DATA_DIR`	`./openwebui-data`	Persistent storage (DB, uploads)

Management

# Start in background
bash 08_start_openwebui_background.sh

# View logs
tail -f logs/openwebui_*.log

# Stop
bash 09_stop_openwebui.sh

# Reconnect to tmux session
tmux attach -t webui

Data Persistence

All user data (accounts, chats, settings) is stored in openwebui-data/. This directory is bind-mounted into the container, so data survives container restarts. Back it up regularly.

Streamlit App

A web-based chat and file editor that connects to the inference server. Students run it on their own machines.

Setup

pip install -r requirements.txt

Or with a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

streamlit run app.py

Opens at http://localhost:8501 with two tabs:

Chat — Conversational interface with streaming responses. Save the model's last response directly into a workspace file (code auto-extracted).
File Editor — Create/edit .py, .tex, .html, or any text file. Use "Generate with LLM" to modify files via natural language instructions.

Sidebar Controls

Parameter	Default	Range	Purpose
Thinking Mode	Off	Toggle	Chain-of-thought reasoning (slower, better for complex tasks)
Temperature	0.7	0.0 – 2.0	Creativity vs determinism
Max Tokens	4096	256 – 16384	Maximum response length
Top P	0.95	0.0 – 1.0	Nucleus sampling threshold
Presence Penalty	0.0	0.0 – 2.0	Penalize repeated topics

Server Configuration

All configuration is via environment variables passed to 03_start_server.sh:

Variable	Default	Description
`MODEL_DIR`	`~/models/Qwen3.5-35B-A3B`	Path to model weights
`PORT`	`7080`	HTTP port
`MAX_MODEL_LEN`	`32768`	Max context length (tokens)
`GPU_MEM_UTIL`	`0.92`	Fraction of GPU memory to use
`API_KEY`	(empty = no auth)	API key for authentication
`TENSOR_PARALLEL`	`2`	Number of GPUs

Examples

# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh

# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh

# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

# Reconnect to tmux session
tmux attach -t llm

Files Overview

File	Purpose
`vllm_qwen.def`	Apptainer container definition (vLLM nightly + deps)
`01_build_container.sh`	Builds the Apptainer `.sif` image
`02_download_model.sh`	Downloads model weights (runs inside container)
`03_start_server.sh`	Starts vLLM server (foreground)
`04_start_server_background.sh`	Starts vLLM server in background with logging
`05_stop_server.sh`	Stops the background vLLM server
`06_setup_openwebui.sh`	Pulls the Open WebUI container image
`07_start_openwebui.sh`	Starts Open WebUI (foreground)
`08_start_openwebui_background.sh`	Starts Open WebUI in background with logging
`09_stop_openwebui.sh`	Stops the background Open WebUI
`app.py`	Streamlit chat & file editor web app
`requirements.txt`	Python dependencies for the Streamlit app
`test_server.py`	Tests the running server via CLI
`STUDENT_GUIDE.md`	Instructions for students

Troubleshooting

"CUDA out of memory"

Reduce MAX_MODEL_LEN (e.g., 16384)
Reduce GPU_MEM_UTIL (e.g., 0.85)

Container build fails

Ensure internet access and sufficient disk space (~20 GB for build cache)
Try pulling manually first: apptainer pull docker://vllm/vllm-openai:nightly

"No NVIDIA GPU detected"

Verify nvidia-smi works on the host
Ensure --nv flag is present (already in scripts)
Test: apptainer exec --nv vllm_qwen.sif nvidia-smi

"Model type qwen3_5_moe not recognized"

The container needs vllm/vllm-openai:nightly (not :latest)
Rebuild the container: rm vllm_qwen.sif && bash 01_build_container.sh

Students can't connect

Check firewall: ports 7080-7090 must be open
Verify the server binds to 0.0.0.0 (not just localhost)
Students must be on the university network or VPN

Slow generation with many users

Expected — vLLM batches requests but throughput is finite
The MoE architecture (3B active) helps with per-token speed
Disable thinking mode for faster simple responses
Monitor: curl http://localhost:7080/metrics

Open WebUI won't start

Ensure the vLLM server is running first on port 7080
Check that port 7081 is not already in use: ss -tlnp | grep 7081
Check logs: tail -50 logs/openwebui_*.log
If the database is corrupted, reset: rm openwebui-data/webui.db and restart

Open WebUI shows no models

Verify vLLM is reachable: curl http://localhost:7080/v1/models
The OpenAI API base URL is set on first launch; if changed later, update it in the Open WebUI Admin Panel > Settings > Connections

Syncing files to the server

No git or pip on the host — use scp from your local machine:

scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/

README.md Unescape Escape

LLM Inferenz Server — Qwen3.5-35B-A3B

Architecture

Hardware

Prerequisites

Step-by-Step Setup

Step 0: SSH into the Server

Step 1: Clone the Repository

Step 2: Check GPU and Environment

Step 3: Build the Apptainer Container

Step 4: Download the Model (~67 GB)

Step 5: Start the Server

Step 6: Test the Server

Step 7: Set Up Open WebUI (ChatGPT-like Interface)

Step 8: Share with Students

Open WebUI

Features

Configuration

Management

Data Persistence

Streamlit App

Setup

Run

Sidebar Controls

Server Configuration

Examples

Server Management

Files Overview

Troubleshooting

"CUDA out of memory"

Container build fails

"No NVIDIA GPU detected"

"Model type qwen3_5_moe not recognized"

Students can't connect

Slow generation with many users

Open WebUI won't start

Open WebUI shows no models

Syncing files to the server

README.md