Go to file

herzogflorian 9e1e0c0751 Add Streamlit chat app, update container to vLLM nightly

- Add app.py: Streamlit UI with chat and file editor tabs
- Add requirements.txt: streamlit + openai dependencies
- Update vllm_qwen.def: use nightly image for Qwen3.5 support
- Update README.md: reflect 35B-A3B model, correct script names
- Update STUDENT_GUIDE.md: add app usage and thinking mode docs
- Update .gitignore: exclude .venv/ and workspace/

Made-with: Cursor

2026-03-02 16:30:04 +01:00

.gitignore

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

01_build_container.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

02_download_model.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

03_start_server.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

04_start_server_background.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

05_stop_server.sh

Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

2026-03-02 14:43:39 +01:00

app.py

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

README.md

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

requirements.txt

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

STUDENT_GUIDE.md

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

test_server.py

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

vllm_qwen.def

Add Streamlit chat app, update container to vLLM nightly

2026-03-02 16:30:04 +01:00

README.md

LLM Inferenz Server — Qwen3.5-35B-A3B

Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-35B-A3B (MoE, 35B total / 3B active per token), served via vLLM inside an Apptainer container on a GPU server.

Architecture

Students (OpenAI SDK / curl)
        │
        ▼
  ┌──────────────────────────────┐
  │  silicon.fhgr.ch:7080       │
  │  OpenAI-compatible API      │
  ├──────────────────────────────┤
  │  vLLM Server (nightly)      │
  │  Apptainer container (.sif) │
  ├──────────────────────────────┤
  │  Qwen3.5-35B-A3B weights    │
  │  (bind-mounted from host)   │
  ├──────────────────────────────┤
  │  2× NVIDIA L40S (46 GB ea.) │
  │  Tensor Parallel = 2        │
  └──────────────────────────────┘

Hardware

The server silicon.fhgr.ch has 4× NVIDIA L40S GPUs (46 GB VRAM each). The inference server uses 2 GPUs with tensor parallelism, leaving 2 GPUs free.

Component	Value
GPUs used	2× NVIDIA L40S
VRAM used	~92 GB total
Model size (BF16)	~67 GB
Active params/token	3B (MoE)
Context length	32,768 tokens
Port	7080

Prerequisites

Apptainer (formerly Singularity) installed on the server
NVIDIA drivers with GPU passthrough support (--nv flag)
~80 GB disk for model weights + ~8 GB for the container image
Network access to Hugging Face (for model download) and Docker Hub (for container build)

Note

: No pip or python is needed on the host — everything runs inside the Apptainer container.

Step-by-Step Setup

Step 0: SSH into the Server

ssh herzogfloria@silicon.fhgr.ch

Step 1: Clone the Repository

git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Step 2: Check GPU and Environment

nvidia-smi
apptainer --version
df -h ~

Step 3: Build the Apptainer Container

bash 01_build_container.sh

Pulls the vllm/vllm-openai:latest Docker image, upgrades vLLM to nightly (required for Qwen3.5 support), installs latest transformers from source, and packages everything into vllm_qwen.sif (~8 GB). Takes 15-20 minutes.

Step 4: Download the Model (~67 GB)

bash 02_download_model.sh

Downloads Qwen3.5-35B-A3B weights using huggingface-cli inside the container. Stored at ~/models/Qwen3.5-35B-A3B. Takes 5-30 minutes depending on bandwidth.

Step 5: Start the Server

Interactive (foreground) — recommended with tmux:

tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach

Background with logging:

bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log

The model takes 2-5 minutes to load into GPU memory. It's ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:7080

Step 6: Test the Server

From another terminal on the server:

curl http://localhost:7080/v1/models

Or run the full test (uses openai SDK inside the container):

apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py

Distribute STUDENT_GUIDE.md with connection details:

Base URL: http://silicon.fhgr.ch:7080/v1
Model name: qwen3.5-35b-a3b

Configuration

All configuration is via environment variables passed to 03_start_server.sh:

Variable	Default	Description
`MODEL_DIR`	`~/models/Qwen3.5-35B-A3B`	Path to model weights
`PORT`	`7080`	HTTP port
`MAX_MODEL_LEN`	`32768`	Max context length (tokens)
`GPU_MEM_UTIL`	`0.92`	Fraction of GPU memory to use
`API_KEY`	(empty = no auth)	API key for authentication
`TENSOR_PARALLEL`	`2`	Number of GPUs

Examples

# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh

# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh

# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

# Reconnect to tmux session
tmux attach -t llm

Files Overview

File	Purpose
`vllm_qwen.def`	Apptainer container definition (vLLM nightly + deps)
`01_build_container.sh`	Builds the Apptainer `.sif` image
`02_download_model.sh`	Downloads model weights (runs inside container)
`03_start_server.sh`	Starts vLLM server (foreground)
`04_start_server_background.sh`	Starts server in background with logging
`05_stop_server.sh`	Stops the background server
`test_server.py`	Tests the running server
`STUDENT_GUIDE.md`	Instructions for students

Troubleshooting

"CUDA out of memory"

Reduce MAX_MODEL_LEN (e.g., 16384)
Reduce GPU_MEM_UTIL (e.g., 0.85)

Container build fails

Ensure internet access and sufficient disk space (~20 GB for build cache)
Try pulling manually first: apptainer pull docker://vllm/vllm-openai:latest

"No NVIDIA GPU detected"

Verify nvidia-smi works on the host
Ensure --nv flag is present (already in scripts)
Test: apptainer exec --nv vllm_qwen.sif nvidia-smi

"Model type qwen3_5_moe not recognized"

The container needs vLLM nightly and latest transformers
Rebuild the container: rm vllm_qwen.sif && bash 01_build_container.sh

Students can't connect

Check firewall: ports 7080-7090 must be open
Verify the server binds to 0.0.0.0 (not just localhost)
Students must be on the university network or VPN

Slow generation with many users

Expected — vLLM batches requests but throughput is finite
The MoE architecture (3B active) helps with per-token speed
Monitor: curl http://localhost:7080/metrics

README.md Unescape Escape

LLM Inferenz Server — Qwen3.5-35B-A3B

Architecture

Hardware

Prerequisites

Step-by-Step Setup

Step 0: SSH into the Server

Step 1: Clone the Repository

Step 2: Check GPU and Environment

Step 3: Build the Apptainer Container

Step 4: Download the Model (~67 GB)

Step 5: Start the Server

Step 6: Test the Server

Step 7: Share with Students

Configuration

Examples

Server Management

Files Overview

Troubleshooting

"CUDA out of memory"

Container build fails

"No NVIDIA GPU detected"

"Model type qwen3_5_moe not recognized"

Students can't connect

Slow generation with many users

README.md