herzogflorian deee5038d1 Update README to reflect current project state

Add Streamlit app section with setup, usage, and sidebar controls.
Document nightly Docker image requirement, scp workflow for server
sync, and practical troubleshooting tips from setup experience.

Made-with: Cursor

2026-03-02 16:42:33 +01:00

8.8 KiB

Raw Blame History

LLM Inferenz Server — Qwen3.5-35B-A3B

Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-35B-A3B (MoE, 35B total / 3B active per token), served via vLLM inside an Apptainer container on a GPU server. Includes a Streamlit web app for chat and file editing.

Architecture

Students (Streamlit App / OpenAI SDK / curl)
        │
        ▼
  ┌──────────────────────────────┐
  │  silicon.fhgr.ch:7080       │
  │  OpenAI-compatible API      │
  ├──────────────────────────────┤
  │  vLLM Server (nightly)      │
  │  Apptainer container (.sif) │
  ├──────────────────────────────┤
  │  Qwen3.5-35B-A3B weights    │
  │  (bind-mounted from host)   │
  ├──────────────────────────────┤
  │  2× NVIDIA L40S (46 GB ea.) │
  │  Tensor Parallel = 2        │
  └──────────────────────────────┘

Hardware

The server silicon.fhgr.ch has 4× NVIDIA L40S GPUs (46 GB VRAM each). The inference server uses 2 GPUs with tensor parallelism, leaving 2 GPUs free.

Component	Value
GPUs used	2× NVIDIA L40S
VRAM used	~92 GB total
Model size (BF16)	~67 GB
Active params/token	3B (MoE)
Context length	32,768 tokens
Port	7080

Prerequisites

Apptainer (formerly Singularity) installed on the server
NVIDIA drivers with GPU passthrough support (--nv flag)
~80 GB disk for model weights + ~8 GB for the container image
Network access to Hugging Face (for model download) and Docker Hub (for container build)

Note

: No pip or python is needed on the host — everything runs inside the Apptainer container.

Step-by-Step Setup

Step 0: SSH into the Server

ssh herzogfloria@silicon.fhgr.ch

Step 1: Clone the Repository

git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Note

: git is not installed on the host. Use the container: apptainer exec vllm_qwen.sif git clone ... Or copy files via scp from your local machine.

Step 2: Check GPU and Environment

nvidia-smi
apptainer --version
df -h ~

Step 3: Build the Apptainer Container

bash 01_build_container.sh

Pulls the vllm/vllm-openai:nightly Docker image (required for Qwen3.5 support), installs latest transformers from source, and packages everything into vllm_qwen.sif (~8 GB). Takes 15-20 minutes.

Step 4: Download the Model (~67 GB)

bash 02_download_model.sh

Downloads Qwen3.5-35B-A3B weights using huggingface-cli inside the container. Stored at ~/models/Qwen3.5-35B-A3B. Takes 5-30 minutes depending on bandwidth.

Step 5: Start the Server

Interactive (foreground) — recommended with tmux:

tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach

Background with logging:

bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log

The model takes 2-5 minutes to load into GPU memory. It's ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:7080

Step 6: Test the Server

From another terminal on the server:

curl http://localhost:7080/v1/models

Quick chat test:

curl http://localhost:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'

Distribute STUDENT_GUIDE.md with connection details:

Base URL: http://silicon.fhgr.ch:7080/v1
Model name: qwen3.5-35b-a3b

Streamlit App

A web-based chat and file editor that connects to the inference server. Students run it on their own machines.

Setup

pip install -r requirements.txt

Or with a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

streamlit run app.py

Opens at http://localhost:8501 with two tabs:

Chat — Conversational interface with streaming responses. Save the model's last response directly into a workspace file (code auto-extracted).
File Editor — Create/edit .py, .tex, .html, or any text file. Use "Generate with LLM" to modify files via natural language instructions.

Sidebar Controls

Parameter	Default	Range	Purpose
Thinking Mode	Off	Toggle	Chain-of-thought reasoning (slower, better for complex tasks)
Temperature	0.7	0.0 – 2.0	Creativity vs determinism
Max Tokens	4096	256 – 16384	Maximum response length
Top P	0.95	0.0 – 1.0	Nucleus sampling threshold
Presence Penalty	0.0	0.0 – 2.0	Penalize repeated topics

Server Configuration

All configuration is via environment variables passed to 03_start_server.sh:

Variable	Default	Description
`MODEL_DIR`	`~/models/Qwen3.5-35B-A3B`	Path to model weights
`PORT`	`7080`	HTTP port
`MAX_MODEL_LEN`	`32768`	Max context length (tokens)
`GPU_MEM_UTIL`	`0.92`	Fraction of GPU memory to use
`API_KEY`	(empty = no auth)	API key for authentication
`TENSOR_PARALLEL`	`2`	Number of GPUs

Examples

# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh

# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh

# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

# Reconnect to tmux session
tmux attach -t llm

Files Overview

File	Purpose
`vllm_qwen.def`	Apptainer container definition (vLLM nightly + deps)
`01_build_container.sh`	Builds the Apptainer `.sif` image
`02_download_model.sh`	Downloads model weights (runs inside container)
`03_start_server.sh`	Starts vLLM server (foreground)
`04_start_server_background.sh`	Starts server in background with logging
`05_stop_server.sh`	Stops the background server
`app.py`	Streamlit chat & file editor web app
`requirements.txt`	Python dependencies for the Streamlit app
`test_server.py`	Tests the running server via CLI
`STUDENT_GUIDE.md`	Instructions for students

Troubleshooting

"CUDA out of memory"

Reduce MAX_MODEL_LEN (e.g., 16384)
Reduce GPU_MEM_UTIL (e.g., 0.85)

Container build fails

Ensure internet access and sufficient disk space (~20 GB for build cache)
Try pulling manually first: apptainer pull docker://vllm/vllm-openai:nightly

"No NVIDIA GPU detected"

Verify nvidia-smi works on the host
Ensure --nv flag is present (already in scripts)
Test: apptainer exec --nv vllm_qwen.sif nvidia-smi

"Model type qwen3_5_moe not recognized"

The container needs vllm/vllm-openai:nightly (not :latest)
Rebuild the container: rm vllm_qwen.sif && bash 01_build_container.sh

Students can't connect

Check firewall: ports 7080-7090 must be open
Verify the server binds to 0.0.0.0 (not just localhost)
Students must be on the university network or VPN

Slow generation with many users

Expected — vLLM batches requests but throughput is finite
The MoE architecture (3B active) helps with per-token speed
Disable thinking mode for faster simple responses
Monitor: curl http://localhost:7080/metrics

Syncing files to the server

No git or pip on the host — use scp from your local machine:

scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/

8.8 KiB Raw Blame History Unescape Escape