herzogflorian deee5038d1 Update README to reflect current project state
Add Streamlit app section with setup, usage, and sidebar controls.
Document nightly Docker image requirement, scp workflow for server
sync, and practical troubleshooting tips from setup experience.

Made-with: Cursor
2026-03-02 16:42:33 +01:00

8.8 KiB
Raw Blame History

LLM Inferenz Server — Qwen3.5-35B-A3B

Self-hosted LLM inference for ~15 concurrent students using Qwen3.5-35B-A3B (MoE, 35B total / 3B active per token), served via vLLM inside an Apptainer container on a GPU server. Includes a Streamlit web app for chat and file editing.

Architecture

Students (Streamlit App / OpenAI SDK / curl)
        │
        ▼
  ┌──────────────────────────────┐
  │  silicon.fhgr.ch:7080       │
  │  OpenAI-compatible API      │
  ├──────────────────────────────┤
  │  vLLM Server (nightly)      │
  │  Apptainer container (.sif) │
  ├──────────────────────────────┤
  │  Qwen3.5-35B-A3B weights    │
  │  (bind-mounted from host)   │
  ├──────────────────────────────┤
  │  2× NVIDIA L40S (46 GB ea.) │
  │  Tensor Parallel = 2        │
  └──────────────────────────────┘

Hardware

The server silicon.fhgr.ch has 4× NVIDIA L40S GPUs (46 GB VRAM each). The inference server uses 2 GPUs with tensor parallelism, leaving 2 GPUs free.

Component Value
GPUs used 2× NVIDIA L40S
VRAM used ~92 GB total
Model size (BF16) ~67 GB
Active params/token 3B (MoE)
Context length 32,768 tokens
Port 7080

Prerequisites

  • Apptainer (formerly Singularity) installed on the server
  • NVIDIA drivers with GPU passthrough support (--nv flag)
  • ~80 GB disk for model weights + ~8 GB for the container image
  • Network access to Hugging Face (for model download) and Docker Hub (for container build)

Note

: No pip or python is needed on the host — everything runs inside the Apptainer container.


Step-by-Step Setup

Step 0: SSH into the Server

ssh herzogfloria@silicon.fhgr.ch

Step 1: Clone the Repository

git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh

Note

: git is not installed on the host. Use the container: apptainer exec vllm_qwen.sif git clone ... Or copy files via scp from your local machine.

Step 2: Check GPU and Environment

nvidia-smi
apptainer --version
df -h ~

Step 3: Build the Apptainer Container

bash 01_build_container.sh

Pulls the vllm/vllm-openai:nightly Docker image (required for Qwen3.5 support), installs latest transformers from source, and packages everything into vllm_qwen.sif (~8 GB). Takes 15-20 minutes.

Step 4: Download the Model (~67 GB)

bash 02_download_model.sh

Downloads Qwen3.5-35B-A3B weights using huggingface-cli inside the container. Stored at ~/models/Qwen3.5-35B-A3B. Takes 5-30 minutes depending on bandwidth.

Step 5: Start the Server

Interactive (foreground) — recommended with tmux:

tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach

Background with logging:

bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log

The model takes 2-5 minutes to load into GPU memory. It's ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:7080

Step 6: Test the Server

From another terminal on the server:

curl http://localhost:7080/v1/models

Quick chat test:

curl http://localhost:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'

Step 7: Share with Students

Distribute STUDENT_GUIDE.md with connection details:

  • Base URL: http://silicon.fhgr.ch:7080/v1
  • Model name: qwen3.5-35b-a3b

Streamlit App

A web-based chat and file editor that connects to the inference server. Students run it on their own machines.

Setup

pip install -r requirements.txt

Or with a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

streamlit run app.py

Opens at http://localhost:8501 with two tabs:

  • Chat — Conversational interface with streaming responses. Save the model's last response directly into a workspace file (code auto-extracted).
  • File Editor — Create/edit .py, .tex, .html, or any text file. Use "Generate with LLM" to modify files via natural language instructions.

Sidebar Controls

Parameter Default Range Purpose
Thinking Mode Off Toggle Chain-of-thought reasoning (slower, better for complex tasks)
Temperature 0.7 0.0 2.0 Creativity vs determinism
Max Tokens 4096 256 16384 Maximum response length
Top P 0.95 0.0 1.0 Nucleus sampling threshold
Presence Penalty 0.0 0.0 2.0 Penalize repeated topics

Server Configuration

All configuration is via environment variables passed to 03_start_server.sh:

Variable Default Description
MODEL_DIR ~/models/Qwen3.5-35B-A3B Path to model weights
PORT 7080 HTTP port
MAX_MODEL_LEN 32768 Max context length (tokens)
GPU_MEM_UTIL 0.92 Fraction of GPU memory to use
API_KEY (empty = no auth) API key for authentication
TENSOR_PARALLEL 2 Number of GPUs

Examples

# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh

# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh

# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh

Server Management

# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi

# Reconnect to tmux session
tmux attach -t llm

Files Overview

File Purpose
vllm_qwen.def Apptainer container definition (vLLM nightly + deps)
01_build_container.sh Builds the Apptainer .sif image
02_download_model.sh Downloads model weights (runs inside container)
03_start_server.sh Starts vLLM server (foreground)
04_start_server_background.sh Starts server in background with logging
05_stop_server.sh Stops the background server
app.py Streamlit chat & file editor web app
requirements.txt Python dependencies for the Streamlit app
test_server.py Tests the running server via CLI
STUDENT_GUIDE.md Instructions for students

Troubleshooting

"CUDA out of memory"

  • Reduce MAX_MODEL_LEN (e.g., 16384)
  • Reduce GPU_MEM_UTIL (e.g., 0.85)

Container build fails

  • Ensure internet access and sufficient disk space (~20 GB for build cache)
  • Try pulling manually first: apptainer pull docker://vllm/vllm-openai:nightly

"No NVIDIA GPU detected"

  • Verify nvidia-smi works on the host
  • Ensure --nv flag is present (already in scripts)
  • Test: apptainer exec --nv vllm_qwen.sif nvidia-smi

"Model type qwen3_5_moe not recognized"

  • The container needs vllm/vllm-openai:nightly (not :latest)
  • Rebuild the container: rm vllm_qwen.sif && bash 01_build_container.sh

Students can't connect

  • Check firewall: ports 7080-7090 must be open
  • Verify the server binds to 0.0.0.0 (not just localhost)
  • Students must be on the university network or VPN

Slow generation with many users

  • Expected — vLLM batches requests but throughput is finite
  • The MoE architecture (3B active) helps with per-token speed
  • Disable thinking mode for faster simple responses
  • Monitor: curl http://localhost:7080/metrics

Syncing files to the server

  • No git or pip on the host — use scp from your local machine:
scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/