Add Streamlit chat app, update container to vLLM nightly

- Add app.py: Streamlit UI with chat and file editor tabs
- Add requirements.txt: streamlit + openai dependencies
- Update vllm_qwen.def: use nightly image for Qwen3.5 support
- Update README.md: reflect 35B-A3B model, correct script names
- Update STUDENT_GUIDE.md: add app usage and thinking mode docs
- Update .gitignore: exclude .venv/ and workspace/

Made-with: Cursor
This commit is contained in:
herzogflorian 2026-03-02 16:30:04 +01:00
parent 076001b07f
commit 9e1e0c0751
7 changed files with 351 additions and 147 deletions

6
.gitignore vendored
View File

@ -10,5 +10,11 @@ models/
# HuggingFace cache
.cache/
# Python venv
.venv/
# Streamlit workspace files
workspace/
# macOS
.DS_Store

225
README.md
View File

@ -1,7 +1,8 @@
# LLM Local — Qwen3.5-27B Inference Server
# LLM Inferenz Server — Qwen3.5-35B-A3B
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
served via **vLLM** inside an **Apptainer** container on a GPU server.
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
**Apptainer** container on a GPU server.
## Architecture
@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server.
Students (OpenAI SDK / curl)
┌─────────────────────────┐
┌──────────────────────────────
│ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │
├─────────────────────────┤
│ vLLM Server │
│ (Apptainer container) │
├─────────────────────────┤
│ Qwen3.5-27B weights │
│ (bind-mounted) │
├─────────────────────────┤
│ NVIDIA GPU │
└─────────────────────────┘
├──────────────────────────────┤
│ vLLM Server (nightly) │
│ Apptainer container (.sif) │
├──────────────────────────────┤
│ Qwen3.5-35B-A3B weights │
│ (bind-mounted from host) │
├──────────────────────────────┤
│ 2× NVIDIA L40S (46 GB ea.) │
│ Tensor Parallel = 2 │
└──────────────────────────────┘
```
## Hardware
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
| Component | Value |
|-----------|-------|
| GPUs used | 2× NVIDIA L40S |
| VRAM used | ~92 GB total |
| Model size (BF16) | ~67 GB |
| Active params/token | 3B (MoE) |
| Context length | 32,768 tokens |
| Port | 7080 |
## Prerequisites
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
- **Apptainer** (formerly Singularity) installed on the server.
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
- **~60 GB disk space** for model weights + ~15 GB for the container image.
- **Network**: Students must be on the university network or VPN.
- **Apptainer** (formerly Singularity) installed on the server
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **~80 GB disk** for model weights + ~8 GB for the container image
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
## Hardware Sizing
| Component | Minimum | Recommended |
|-----------|----------------|-----------------|
| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) |
| RAM | 64 GB | 128 GB |
| Disk | 100 GB free | 200 GB free |
> **If your GPU has less than 80 GB VRAM**, you have two options:
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
> the Apptainer container.
---
@ -54,11 +59,10 @@ Students (OpenAI SDK / curl)
ssh herzogfloria@silicon.fhgr.ch
```
### Step 1: Clone This Repository
### Step 1: Clone the Repository
```bash
# Or copy the files to the server
git clone <your-repo-url> ~/LLM_local
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```
@ -66,124 +70,95 @@ chmod +x *.sh
### Step 2: Check GPU and Environment
```bash
# Verify GPU is visible
nvidia-smi
# Verify Apptainer is installed
apptainer --version
# Check available disk space
df -h ~
```
### Step 3: Download the Model (~60 GB)
### Step 3: Build the Apptainer Container
```bash
# Install huggingface-cli if not available
pip install --user huggingface_hub[cli]
# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B
bash 01_build_container.sh
```
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
(required for Qwen3.5 support), installs latest `transformers` from source,
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
### Step 4: Build the Apptainer Container
### Step 4: Download the Model (~67 GB)
```bash
bash 02_build_container.sh
bash 02_download_model.sh
```
This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
> first and convert manually:
> ```bash
> apptainer pull docker://vllm/vllm-openai:latest
> ```
Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
depending on bandwidth.
### Step 5: Start the Server
**Interactive (foreground):**
**Interactive (foreground) — recommended with tmux:**
```bash
tmux new -s llm
bash 03_start_server.sh
# Ctrl+B, then D to detach
```
**Background (recommended for production):**
**Background with logging:**
```bash
bash 04_start_server_background.sh
```
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
```bash
tail -f logs/vllm_server_*.log
```
Look for the line:
The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
```
INFO: Uvicorn running on http://0.0.0.0:8000
INFO: Uvicorn running on http://0.0.0.0:7080
```
### Step 6: Test the Server
From another terminal on the server:
```bash
# Quick health check
curl http://localhost:7080/v1/models
```
# Full test
pip install openai
python test_server.py
Or run the full test (uses `openai` SDK inside the container):
```bash
apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
```
### Step 7: Share with Students
Distribute the `STUDENT_GUIDE.md` file or share the connection details:
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
Distribute `STUDENT_GUIDE.md` with connection details:
- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
- **Model name**: `qwen3.5-35b-a3b`
---
## Configuration
All configuration is via environment variables in `03_start_server.sh`:
All configuration is via environment variables passed to `03_start_server.sh`:
| Variable | Default | Description |
|-------------------|------------------------------|-------------------------------------|
| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights |
|-------------------|----------------------------------|--------------------------------|
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
| `PORT` | `7080` | HTTP port |
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
| `API_KEY` | *(empty = no auth)* | API key for authentication |
| `TENSOR_PARALLEL` | `1` | Number of GPUs |
| `TENSOR_PARALLEL` | `2` | Number of GPUs |
### Context Length Tuning
The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
concurrent users. If you have plenty of VRAM headroom:
### Examples
```bash
# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh
```
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
significantly more GPU memory for KV cache.
# Add API key authentication
API_KEY="your-secret-key" bash 03_start_server.sh
### Adding Authentication
```bash
API_KEY="your-secret-key-here" bash 03_start_server.sh
```
Students then use this key in their `api_key` parameter.
### Multi-GPU Setup
If you have multiple GPUs:
```bash
TENSOR_PARALLEL=2 bash 03_start_server.sh
# Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh
```
---
@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh
bash 04_start_server_background.sh
# Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool
curl -s http://localhost:7080/v1/models | python3 -m json.tool
# View logs
tail -f logs/vllm_server_*.log
@ -205,20 +180,9 @@ bash 05_stop_server.sh
# Monitor GPU usage
watch -n 2 nvidia-smi
```
### Running Persistently with tmux
For a robust setup that survives SSH disconnects:
```bash
ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach
# Reconnect later:
tmux attach -t llm_server
# Reconnect to tmux session
tmux attach -t llm
```
---
@ -226,12 +190,12 @@ tmux attach -t llm_server
## Files Overview
| File | Purpose |
|------------------------------|------------------------------------------- |
| `vllm_qwen.def` | Apptainer container definition |
| `01_download_model.sh` | Downloads model weights from Hugging Face |
| `02_build_container.sh` | Builds the Apptainer .sif image |
|----------------------------------|------------------------------------------------------|
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
| `02_download_model.sh` | Downloads model weights (runs inside container) |
| `03_start_server.sh` | Starts vLLM server (foreground) |
| `04_start_server_background.sh` | Starts server in background with logging|
| `04_start_server_background.sh` | Starts server in background with logging |
| `05_stop_server.sh` | Stops the background server |
| `test_server.py` | Tests the running server |
| `STUDENT_GUIDE.md` | Instructions for students |
@ -241,25 +205,28 @@ tmux attach -t llm_server
## Troubleshooting
### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
- Use a quantized model variant
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
### Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
### "No NVIDIA GPU detected"
- Check that `nvidia-smi` works outside the container
- Ensure `--nv` flag is passed (already in scripts)
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
- Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is present (already in scripts)
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### Server starts but students can't connect
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
### "Model type qwen3_5_moe not recognized"
- The container needs vLLM nightly and latest transformers
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
### Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must use the server's hostname/IP, not `localhost`
- Students must be on the university network or VPN
### Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite
- Consider reducing `max_tokens` in student requests
- Monitor with: `curl http://localhost:7080/metrics`
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Monitor: `curl http://localhost:7080/metrics`

View File

@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \
---
## Streamlit Chat & File Editor App
A simple web UI is included for chatting with the model and editing files.
### Setup
```bash
pip install streamlit openai
```
### Run
```bash
streamlit run app.py
```
This opens a browser with two tabs:
- **Chat** — Conversational interface with streaming responses. You can save
the model's last response directly to a file.
- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
Use the "Generate with LLM" button to have the model modify your file based
on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
Files are stored in a `workspace/` folder next to `app.py`.
> **Tip**: The app runs on your local machine and connects to the server — you
> don't need to install anything on the GPU server.
---
## Thinking Mode
By default, the model "thinks" before answering (internal chain-of-thought).
This is great for complex reasoning but adds latency for simple questions.
To disable thinking and get faster direct responses, add this to your API call:
```python
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[...],
max_tokens=1024,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```
---
## Troubleshooting
| Issue | Solution |

181
app.py Normal file
View File

@ -0,0 +1,181 @@
"""
Streamlit Chat & File Editor for Qwen3.5-35B-A3B
A minimal interface to:
1. Chat with the local LLM (OpenAI-compatible API)
2. Edit, save, and generate code / LaTeX files
Usage:
pip install streamlit openai
streamlit run app.py
"""
import re
import streamlit as st
from openai import OpenAI
from pathlib import Path
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1")
API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password")
MODEL = "qwen3.5-35b-a3b"
WORKSPACE = Path("workspace")
WORKSPACE.mkdir(exist_ok=True)
client = OpenAI(base_url=API_BASE, api_key=API_KEY)
LANG_MAP = {
".py": "python", ".tex": "latex", ".js": "javascript",
".html": "html", ".css": "css", ".sh": "bash",
".json": "json", ".yaml": "yaml", ".yml": "yaml",
}
def extract_code(text: str, lang: str = "") -> str:
"""Extract the first fenced code block from markdown text.
Falls back to the full text if no code block is found."""
pattern = r"```(?:\w*)\n(.*?)```"
match = re.search(pattern, text, re.DOTALL)
if match:
return match.group(1).strip()
return text.strip()
# ---------------------------------------------------------------------------
# Sidebar — File Manager
# ---------------------------------------------------------------------------
st.sidebar.markdown("---")
st.sidebar.header("File Manager")
new_filename = st.sidebar.text_input("New file name", placeholder="main.tex")
if st.sidebar.button("Create File") and new_filename:
(WORKSPACE / new_filename).touch()
st.sidebar.success(f"Created {new_filename}")
st.rerun()
files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else []
file_names = [f.name for f in files if f.is_file()]
selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"])
# ---------------------------------------------------------------------------
# Main Layout — Two Tabs
# ---------------------------------------------------------------------------
tab_chat, tab_editor = st.tabs(["Chat", "File Editor"])
# ---------------------------------------------------------------------------
# Tab 1: Chat
# ---------------------------------------------------------------------------
with tab_chat:
st.header("Chat with Qwen3.5")
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if prompt := st.chat_input("Ask anything..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
placeholder = st.empty()
full_response = ""
stream = client.chat.completions.create(
model=MODEL,
messages=st.session_state.messages,
max_tokens=8092,
temperature=0.2,
stream=True,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
full_response += delta
placeholder.markdown(full_response + "")
placeholder.markdown(full_response)
st.session_state.messages.append({"role": "assistant", "content": full_response})
if st.session_state.messages:
col_clear, col_save = st.columns([1, 3])
with col_clear:
if st.button("Clear Chat"):
st.session_state.messages = []
st.rerun()
with col_save:
if selected_file and selected_file != "(no files)":
if st.button(f"Save code → {selected_file}"):
last = st.session_state.messages[-1]["content"]
suffix = Path(selected_file).suffix
lang = LANG_MAP.get(suffix, "")
code = extract_code(last, lang)
(WORKSPACE / selected_file).write_text(code)
st.success(f"Extracted code saved to workspace/{selected_file}")
# ---------------------------------------------------------------------------
# Tab 2: File Editor
# ---------------------------------------------------------------------------
with tab_editor:
st.header("File Editor")
if selected_file and selected_file != "(no files)":
file_path = WORKSPACE / selected_file
content = file_path.read_text() if file_path.exists() else ""
suffix = file_path.suffix
lang = LANG_MAP.get(suffix, "text")
st.code(content, language=lang if lang != "text" else None, line_numbers=True)
edited = st.text_area(
"Edit below:",
value=content,
height=400,
key=f"editor_{selected_file}_{hash(content)}",
)
col_save, col_gen = st.columns(2)
with col_save:
if st.button("Save File"):
file_path.write_text(edited)
st.success(f"Saved {selected_file}")
st.rerun()
with col_gen:
gen_prompt = st.text_input(
"Generation instruction",
placeholder="e.g. Add error handling / Fix the LaTeX formatting",
key="gen_prompt",
)
if st.button("Generate with LLM") and gen_prompt:
with st.spinner("Generating..."):
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": (
f"You are a coding assistant. The user has a {lang} file. "
"Return ONLY the raw file content inside a single code block. "
"No explanations, no comments about changes."
)},
{"role": "user", "content": (
f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n"
f"Instruction: {gen_prompt}"
)},
],
max_tokens=16384,
temperature=0.6,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
result = response.choices[0].message.content
code = extract_code(result, lang)
file_path.write_text(code)
st.success("File updated by LLM")
st.rerun()
else:
st.info("Create a file in the sidebar to start editing.")

2
requirements.txt Normal file
View File

@ -0,0 +1,2 @@
streamlit
openai

View File

@ -38,9 +38,9 @@ def main():
response = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": "What is 2 + 2? Answer in one sentence."}
{"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."}
],
max_tokens=256,
max_tokens=16384,
temperature=0.7,
)
print(f" Response: {response.choices[0].message.content}")
@ -53,7 +53,7 @@ def main():
messages=[
{"role": "user", "content": "Count from 1 to 5."}
],
max_tokens=128,
max_tokens=16384,
temperature=0.7,
stream=True,
)

View File

@ -1,10 +1,10 @@
Bootstrap: docker
From: vllm/vllm-openai:latest
From: vllm/vllm-openai:nightly
%labels
Author herzogfloria
Description vLLM nightly inference server for Qwen3.5-35B-A3B
Version 2.0
Version 3.0
%environment
export HF_HOME=/tmp/hf_cache
@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest
%post
apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
pip install --no-cache-dir huggingface_hub[cli]