herzogflorian deee5038d1 Update README to reflect current project state
Add Streamlit app section with setup, usage, and sidebar controls.
Document nightly Docker image requirement, scp workflow for server
sync, and practical troubleshooting tips from setup experience.

Made-with: Cursor
2026-03-02 16:42:33 +01:00

293 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM Inferenz Server — Qwen3.5-35B-A3B
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
**Apptainer** container on a GPU server. Includes a **Streamlit web app** for
chat and file editing.
## Architecture
```
Students (Streamlit App / OpenAI SDK / curl)
┌──────────────────────────────┐
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├──────────────────────────────┤
│ vLLM Server (nightly) │
│ Apptainer container (.sif) │
├──────────────────────────────┤
│ Qwen3.5-35B-A3B weights │
│ (bind-mounted from host) │
├──────────────────────────────┤
│ 2× NVIDIA L40S (46 GB ea.) │
│ Tensor Parallel = 2 │
└──────────────────────────────┘
```
## Hardware
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
| Component | Value |
|-----------|-------|
| GPUs used | 2× NVIDIA L40S |
| VRAM used | ~92 GB total |
| Model size (BF16) | ~67 GB |
| Active params/token | 3B (MoE) |
| Context length | 32,768 tokens |
| Port | 7080 |
## Prerequisites
- **Apptainer** (formerly Singularity) installed on the server
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **~80 GB disk** for model weights + ~8 GB for the container image
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
> the Apptainer container.
---
## Step-by-Step Setup
### Step 0: SSH into the Server
```bash
ssh herzogfloria@silicon.fhgr.ch
```
### Step 1: Clone the Repository
```bash
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```
> **Note**: `git` is not installed on the host. Use the container:
> `apptainer exec vllm_qwen.sif git clone ...`
> Or copy files via `scp` from your local machine.
### Step 2: Check GPU and Environment
```bash
nvidia-smi
apptainer --version
df -h ~
```
### Step 3: Build the Apptainer Container
```bash
bash 01_build_container.sh
```
Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5
support), installs latest `transformers` from source, and packages everything
into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
### Step 4: Download the Model (~67 GB)
```bash
bash 02_download_model.sh
```
Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
depending on bandwidth.
### Step 5: Start the Server
**Interactive (foreground) — recommended with tmux:**
```bash
tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach
```
**Background with logging:**
```bash
bash 04_start_server_background.sh
tail -f logs/vllm_server_*.log
```
The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
```
INFO: Uvicorn running on http://0.0.0.0:7080
```
### Step 6: Test the Server
From another terminal on the server:
```bash
curl http://localhost:7080/v1/models
```
Quick chat test:
```bash
curl http://localhost:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
```
### Step 7: Share with Students
Distribute `STUDENT_GUIDE.md` with connection details:
- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
- **Model name**: `qwen3.5-35b-a3b`
---
## Streamlit App
A web-based chat and file editor that connects to the inference server.
Students run it on their own machines.
### Setup
```bash
pip install -r requirements.txt
```
Or with a virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### Run
```bash
streamlit run app.py
```
Opens at `http://localhost:8501` with two tabs:
- **Chat** — Conversational interface with streaming responses. Save the
model's last response directly into a workspace file (code auto-extracted).
- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file.
Use "Generate with LLM" to modify files via natural language instructions.
### Sidebar Controls
| Parameter | Default | Range | Purpose |
|-----------|---------|-------|---------|
| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
| Temperature | 0.7 | 0.0 2.0 | Creativity vs determinism |
| Max Tokens | 4096 | 256 16384 | Maximum response length |
| Top P | 0.95 | 0.0 1.0 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | 0.0 2.0 | Penalize repeated topics |
---
## Server Configuration
All configuration is via environment variables passed to `03_start_server.sh`:
| Variable | Default | Description |
|-------------------|----------------------------------|--------------------------------|
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
| `PORT` | `7080` | HTTP port |
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
| `API_KEY` | *(empty = no auth)* | API key for authentication |
| `TENSOR_PARALLEL` | `2` | Number of GPUs |
### Examples
```bash
# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh
# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh
# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh
```
---
## Server Management
```bash
# Start in background
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python3 -m json.tool
# View logs
tail -f logs/vllm_server_*.log
# Stop
bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
# Reconnect to tmux session
tmux attach -t llm
```
---
## Files Overview
| File | Purpose |
|----------------------------------|------------------------------------------------------|
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
| `02_download_model.sh` | Downloads model weights (runs inside container) |
| `03_start_server.sh` | Starts vLLM server (foreground) |
| `04_start_server_background.sh` | Starts server in background with logging |
| `05_stop_server.sh` | Stops the background server |
| `app.py` | Streamlit chat & file editor web app |
| `requirements.txt` | Python dependencies for the Streamlit app |
| `test_server.py` | Tests the running server via CLI |
| `STUDENT_GUIDE.md` | Instructions for students |
---
## Troubleshooting
### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
### Container build fails
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly`
### "No NVIDIA GPU detected"
- Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is present (already in scripts)
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### "Model type qwen3_5_moe not recognized"
- The container needs `vllm/vllm-openai:nightly` (not `:latest`)
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
### Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must be on the university network or VPN
### Slow generation with many users
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Disable thinking mode for faster simple responses
- Monitor: `curl http://localhost:7080/metrics`
### Syncing files to the server
- No `git` or `pip` on the host — use `scp` from your local machine:
```bash
scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/
```