Add Streamlit chat app, update container to vLLM nightly

- Add app.py: Streamlit UI with chat and file editor tabs
- Add requirements.txt: streamlit + openai dependencies
- Update vllm_qwen.def: use nightly image for Qwen3.5 support
- Update README.md: reflect 35B-A3B model, correct script names
- Update STUDENT_GUIDE.md: add app usage and thinking mode docs
- Update .gitignore: exclude .venv/ and workspace/

Made-with: Cursor
This commit is contained in:
herzogflorian 2026-03-02 16:30:04 +01:00
parent 076001b07f
commit 9e1e0c0751
7 changed files with 351 additions and 147 deletions

6
.gitignore vendored
View File

@ -10,5 +10,11 @@ models/
# HuggingFace cache # HuggingFace cache
.cache/ .cache/
# Python venv
.venv/
# Streamlit workspace files
workspace/
# macOS # macOS
.DS_Store .DS_Store

223
README.md
View File

@ -1,7 +1,8 @@
# LLM Local — Qwen3.5-27B Inference Server # LLM Inferenz Server — Qwen3.5-35B-A3B
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**, Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
served via **vLLM** inside an **Apptainer** container on a GPU server. (MoE, 35B total / 3B active per token), served via **vLLM** inside an
**Apptainer** container on a GPU server.
## Architecture ## Architecture
@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server.
Students (OpenAI SDK / curl) Students (OpenAI SDK / curl)
┌─────────────────────────┐ ┌──────────────────────────────
│ silicon.fhgr.ch:7080 │ │ silicon.fhgr.ch:7080 │
│ OpenAI-compatible API │ │ OpenAI-compatible API │
├─────────────────────────┤ ├──────────────────────────────┤
│ vLLM Server │ │ vLLM Server (nightly) │
│ (Apptainer container) │ │ Apptainer container (.sif) │
├─────────────────────────┤ ├──────────────────────────────┤
│ Qwen3.5-27B weights │ │ Qwen3.5-35B-A3B weights │
│ (bind-mounted) │ │ (bind-mounted from host) │
├─────────────────────────┤ ├──────────────────────────────┤
│ NVIDIA GPU │ │ 2× NVIDIA L40S (46 GB ea.) │
└─────────────────────────┘ │ Tensor Parallel = 2 │
└──────────────────────────────┘
``` ```
## Hardware
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
| Component | Value |
|-----------|-------|
| GPUs used | 2× NVIDIA L40S |
| VRAM used | ~92 GB total |
| Model size (BF16) | ~67 GB |
| Active params/token | 3B (MoE) |
| Context length | 32,768 tokens |
| Port | 7080 |
## Prerequisites ## Prerequisites
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended). - **Apptainer** (formerly Singularity) installed on the server
Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead. - **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **Apptainer** (formerly Singularity) installed on the server. - **~80 GB disk** for model weights + ~8 GB for the container image
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough. - **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
- **~60 GB disk space** for model weights + ~15 GB for the container image.
- **Network**: Students must be on the university network or VPN.
## Hardware Sizing > **Note**: No `pip` or `python` is needed on the host — everything runs inside
> the Apptainer container.
| Component | Minimum | Recommended |
|-----------|----------------|-----------------|
| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) |
| RAM | 64 GB | 128 GB |
| Disk | 100 GB free | 200 GB free |
> **If your GPU has less than 80 GB VRAM**, you have two options:
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
--- ---
@ -54,11 +59,10 @@ Students (OpenAI SDK / curl)
ssh herzogfloria@silicon.fhgr.ch ssh herzogfloria@silicon.fhgr.ch
``` ```
### Step 1: Clone This Repository ### Step 1: Clone the Repository
```bash ```bash
# Or copy the files to the server git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
git clone <your-repo-url> ~/LLM_local
cd ~/LLM_local cd ~/LLM_local
chmod +x *.sh chmod +x *.sh
``` ```
@ -66,124 +70,95 @@ chmod +x *.sh
### Step 2: Check GPU and Environment ### Step 2: Check GPU and Environment
```bash ```bash
# Verify GPU is visible
nvidia-smi nvidia-smi
# Verify Apptainer is installed
apptainer --version apptainer --version
# Check available disk space
df -h ~ df -h ~
``` ```
### Step 3: Download the Model (~60 GB) ### Step 3: Build the Apptainer Container
```bash ```bash
# Install huggingface-cli if not available bash 01_build_container.sh
pip install --user huggingface_hub[cli]
# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B
``` ```
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth. Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
(required for Qwen3.5 support), installs latest `transformers` from source,
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
### Step 4: Build the Apptainer Container ### Step 4: Download the Model (~67 GB)
```bash ```bash
bash 02_build_container.sh bash 02_download_model.sh
``` ```
This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file. Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB. container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
depending on bandwidth.
> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
> first and convert manually:
> ```bash
> apptainer pull docker://vllm/vllm-openai:latest
> ```
### Step 5: Start the Server ### Step 5: Start the Server
**Interactive (foreground):** **Interactive (foreground) — recommended with tmux:**
```bash ```bash
tmux new -s llm
bash 03_start_server.sh bash 03_start_server.sh
# Ctrl+B, then D to detach
``` ```
**Background (recommended for production):** **Background with logging:**
```bash ```bash
bash 04_start_server_background.sh bash 04_start_server_background.sh
```
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
```bash
tail -f logs/vllm_server_*.log tail -f logs/vllm_server_*.log
``` ```
Look for the line: The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
``` ```
INFO: Uvicorn running on http://0.0.0.0:8000 INFO: Uvicorn running on http://0.0.0.0:7080
``` ```
### Step 6: Test the Server ### Step 6: Test the Server
From another terminal on the server:
```bash ```bash
# Quick health check
curl http://localhost:7080/v1/models curl http://localhost:7080/v1/models
```
# Full test Or run the full test (uses `openai` SDK inside the container):
pip install openai ```bash
python test_server.py apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
``` ```
### Step 7: Share with Students ### Step 7: Share with Students
Distribute the `STUDENT_GUIDE.md` file or share the connection details: Distribute `STUDENT_GUIDE.md` with connection details:
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b` - **Base URL**: `http://silicon.fhgr.ch:7080/v1`
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b` - **Model name**: `qwen3.5-35b-a3b`
--- ---
## Configuration ## Configuration
All configuration is via environment variables in `03_start_server.sh`: All configuration is via environment variables passed to `03_start_server.sh`:
| Variable | Default | Description | | Variable | Default | Description |
|-------------------|------------------------------|-------------------------------------| |-------------------|----------------------------------|--------------------------------|
| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights | | `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
| `PORT` | `7080` | HTTP port | | `PORT` | `7080` | HTTP port |
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) | | `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use | | `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
| `API_KEY` | *(empty = no auth)* | API key for authentication | | `API_KEY` | *(empty = no auth)* | API key for authentication |
| `TENSOR_PARALLEL` | `1` | Number of GPUs | | `TENSOR_PARALLEL` | `2` | Number of GPUs |
### Context Length Tuning ### Examples
The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
concurrent users. If you have plenty of VRAM headroom:
```bash ```bash
# Increase context length
MAX_MODEL_LEN=65536 bash 03_start_server.sh MAX_MODEL_LEN=65536 bash 03_start_server.sh
```
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require # Add API key authentication
significantly more GPU memory for KV cache. API_KEY="your-secret-key" bash 03_start_server.sh
### Adding Authentication # Use all 4 GPUs (more KV cache headroom)
TENSOR_PARALLEL=4 bash 03_start_server.sh
```bash
API_KEY="your-secret-key-here" bash 03_start_server.sh
```
Students then use this key in their `api_key` parameter.
### Multi-GPU Setup
If you have multiple GPUs:
```bash
TENSOR_PARALLEL=2 bash 03_start_server.sh
``` ```
--- ---
@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh
bash 04_start_server_background.sh bash 04_start_server_background.sh
# Check if running # Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool curl -s http://localhost:7080/v1/models | python3 -m json.tool
# View logs # View logs
tail -f logs/vllm_server_*.log tail -f logs/vllm_server_*.log
@ -205,20 +180,9 @@ bash 05_stop_server.sh
# Monitor GPU usage # Monitor GPU usage
watch -n 2 nvidia-smi watch -n 2 nvidia-smi
```
### Running Persistently with tmux # Reconnect to tmux session
tmux attach -t llm
For a robust setup that survives SSH disconnects:
```bash
ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach
# Reconnect later:
tmux attach -t llm_server
``` ```
--- ---
@ -226,10 +190,10 @@ tmux attach -t llm_server
## Files Overview ## Files Overview
| File | Purpose | | File | Purpose |
|------------------------------|------------------------------------------- | |----------------------------------|------------------------------------------------------|
| `vllm_qwen.def` | Apptainer container definition | | `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
| `01_download_model.sh` | Downloads model weights from Hugging Face | | `01_build_container.sh` | Builds the Apptainer `.sif` image |
| `02_build_container.sh` | Builds the Apptainer .sif image | | `02_download_model.sh` | Downloads model weights (runs inside container) |
| `03_start_server.sh` | Starts vLLM server (foreground) | | `03_start_server.sh` | Starts vLLM server (foreground) |
| `04_start_server_background.sh` | Starts server in background with logging | | `04_start_server_background.sh` | Starts server in background with logging |
| `05_stop_server.sh` | Stops the background server | | `05_stop_server.sh` | Stops the background server |
@ -241,25 +205,28 @@ tmux attach -t llm_server
## Troubleshooting ## Troubleshooting
### "CUDA out of memory" ### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., 16384) - Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., 0.85) - Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
- Use a quantized model variant
### Container build fails ### Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache) - Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first - Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
### "No NVIDIA GPU detected" ### "No NVIDIA GPU detected"
- Check that `nvidia-smi` works outside the container - Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is passed (already in scripts) - Ensure `--nv` flag is present (already in scripts)
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi` - Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### Server starts but students can't connect ### "Model type qwen3_5_moe not recognized"
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent - The container needs vLLM nightly and latest transformers
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
### Students can't connect
- Check firewall: ports 7080-7090 must be open
- Verify the server binds to `0.0.0.0` (not just localhost) - Verify the server binds to `0.0.0.0` (not just localhost)
- Students must use the server's hostname/IP, not `localhost` - Students must be on the university network or VPN
### Slow generation with many users ### Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite - Expected — vLLM batches requests but throughput is finite
- Consider reducing `max_tokens` in student requests - The MoE architecture (3B active) helps with per-token speed
- Monitor with: `curl http://localhost:7080/metrics` - Monitor: `curl http://localhost:7080/metrics`

View File

@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \
--- ---
## Streamlit Chat & File Editor App
A simple web UI is included for chatting with the model and editing files.
### Setup
```bash
pip install streamlit openai
```
### Run
```bash
streamlit run app.py
```
This opens a browser with two tabs:
- **Chat** — Conversational interface with streaming responses. You can save
the model's last response directly to a file.
- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
Use the "Generate with LLM" button to have the model modify your file based
on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
Files are stored in a `workspace/` folder next to `app.py`.
> **Tip**: The app runs on your local machine and connects to the server — you
> don't need to install anything on the GPU server.
---
## Thinking Mode
By default, the model "thinks" before answering (internal chain-of-thought).
This is great for complex reasoning but adds latency for simple questions.
To disable thinking and get faster direct responses, add this to your API call:
```python
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[...],
max_tokens=1024,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```
---
## Troubleshooting ## Troubleshooting
| Issue | Solution | | Issue | Solution |

181
app.py Normal file
View File

@ -0,0 +1,181 @@
"""
Streamlit Chat & File Editor for Qwen3.5-35B-A3B
A minimal interface to:
1. Chat with the local LLM (OpenAI-compatible API)
2. Edit, save, and generate code / LaTeX files
Usage:
pip install streamlit openai
streamlit run app.py
"""
import re
import streamlit as st
from openai import OpenAI
from pathlib import Path
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1")
API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password")
MODEL = "qwen3.5-35b-a3b"
WORKSPACE = Path("workspace")
WORKSPACE.mkdir(exist_ok=True)
client = OpenAI(base_url=API_BASE, api_key=API_KEY)
LANG_MAP = {
".py": "python", ".tex": "latex", ".js": "javascript",
".html": "html", ".css": "css", ".sh": "bash",
".json": "json", ".yaml": "yaml", ".yml": "yaml",
}
def extract_code(text: str, lang: str = "") -> str:
"""Extract the first fenced code block from markdown text.
Falls back to the full text if no code block is found."""
pattern = r"```(?:\w*)\n(.*?)```"
match = re.search(pattern, text, re.DOTALL)
if match:
return match.group(1).strip()
return text.strip()
# ---------------------------------------------------------------------------
# Sidebar — File Manager
# ---------------------------------------------------------------------------
st.sidebar.markdown("---")
st.sidebar.header("File Manager")
new_filename = st.sidebar.text_input("New file name", placeholder="main.tex")
if st.sidebar.button("Create File") and new_filename:
(WORKSPACE / new_filename).touch()
st.sidebar.success(f"Created {new_filename}")
st.rerun()
files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else []
file_names = [f.name for f in files if f.is_file()]
selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"])
# ---------------------------------------------------------------------------
# Main Layout — Two Tabs
# ---------------------------------------------------------------------------
tab_chat, tab_editor = st.tabs(["Chat", "File Editor"])
# ---------------------------------------------------------------------------
# Tab 1: Chat
# ---------------------------------------------------------------------------
with tab_chat:
st.header("Chat with Qwen3.5")
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if prompt := st.chat_input("Ask anything..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
placeholder = st.empty()
full_response = ""
stream = client.chat.completions.create(
model=MODEL,
messages=st.session_state.messages,
max_tokens=8092,
temperature=0.2,
stream=True,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
full_response += delta
placeholder.markdown(full_response + "")
placeholder.markdown(full_response)
st.session_state.messages.append({"role": "assistant", "content": full_response})
if st.session_state.messages:
col_clear, col_save = st.columns([1, 3])
with col_clear:
if st.button("Clear Chat"):
st.session_state.messages = []
st.rerun()
with col_save:
if selected_file and selected_file != "(no files)":
if st.button(f"Save code → {selected_file}"):
last = st.session_state.messages[-1]["content"]
suffix = Path(selected_file).suffix
lang = LANG_MAP.get(suffix, "")
code = extract_code(last, lang)
(WORKSPACE / selected_file).write_text(code)
st.success(f"Extracted code saved to workspace/{selected_file}")
# ---------------------------------------------------------------------------
# Tab 2: File Editor
# ---------------------------------------------------------------------------
with tab_editor:
st.header("File Editor")
if selected_file and selected_file != "(no files)":
file_path = WORKSPACE / selected_file
content = file_path.read_text() if file_path.exists() else ""
suffix = file_path.suffix
lang = LANG_MAP.get(suffix, "text")
st.code(content, language=lang if lang != "text" else None, line_numbers=True)
edited = st.text_area(
"Edit below:",
value=content,
height=400,
key=f"editor_{selected_file}_{hash(content)}",
)
col_save, col_gen = st.columns(2)
with col_save:
if st.button("Save File"):
file_path.write_text(edited)
st.success(f"Saved {selected_file}")
st.rerun()
with col_gen:
gen_prompt = st.text_input(
"Generation instruction",
placeholder="e.g. Add error handling / Fix the LaTeX formatting",
key="gen_prompt",
)
if st.button("Generate with LLM") and gen_prompt:
with st.spinner("Generating..."):
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": (
f"You are a coding assistant. The user has a {lang} file. "
"Return ONLY the raw file content inside a single code block. "
"No explanations, no comments about changes."
)},
{"role": "user", "content": (
f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n"
f"Instruction: {gen_prompt}"
)},
],
max_tokens=16384,
temperature=0.6,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
result = response.choices[0].message.content
code = extract_code(result, lang)
file_path.write_text(code)
st.success("File updated by LLM")
st.rerun()
else:
st.info("Create a file in the sidebar to start editing.")

2
requirements.txt Normal file
View File

@ -0,0 +1,2 @@
streamlit
openai

View File

@ -38,9 +38,9 @@ def main():
response = client.chat.completions.create( response = client.chat.completions.create(
model=model, model=model,
messages=[ messages=[
{"role": "user", "content": "What is 2 + 2? Answer in one sentence."} {"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."}
], ],
max_tokens=256, max_tokens=16384,
temperature=0.7, temperature=0.7,
) )
print(f" Response: {response.choices[0].message.content}") print(f" Response: {response.choices[0].message.content}")
@ -53,7 +53,7 @@ def main():
messages=[ messages=[
{"role": "user", "content": "Count from 1 to 5."} {"role": "user", "content": "Count from 1 to 5."}
], ],
max_tokens=128, max_tokens=16384,
temperature=0.7, temperature=0.7,
stream=True, stream=True,
) )

View File

@ -1,10 +1,10 @@
Bootstrap: docker Bootstrap: docker
From: vllm/vllm-openai:latest From: vllm/vllm-openai:nightly
%labels %labels
Author herzogfloria Author herzogfloria
Description vLLM nightly inference server for Qwen3.5-35B-A3B Description vLLM nightly inference server for Qwen3.5-35B-A3B
Version 2.0 Version 3.0
%environment %environment
export HF_HOME=/tmp/hf_cache export HF_HOME=/tmp/hf_cache
@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest
%post %post
apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/* apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main" pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
pip install --no-cache-dir huggingface_hub[cli] pip install --no-cache-dir huggingface_hub[cli]