Add Streamlit chat app, update container to vLLM nightly

- Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor
2026-03-02 16:30:04 +01:00 · 2026-03-02 16:30:04 +01:00 · 9e1e0c0751
commit 9e1e0c0751
parent 076001b07f
7 changed files with 351 additions and 147 deletions
--- a/.gitignore
+++ b/.gitignore
@ -10,5 +10,11 @@ models/
 # HuggingFace cache
 .cache/

+# Python venv
+.venv/
+
+# Streamlit workspace files
+workspace/
+
 # macOS
 .DS_Store
--- a/README.md
+++ b/README.md
@ -1,7 +1,8 @@
-# LLM Local — Qwen3.5-27B Inference Server
+# LLM Inferenz Server — Qwen3.5-35B-A3B

-Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
-served via **vLLM** inside an **Apptainer** container on a GPU server.
+Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
+(MoE, 35B total / 3B active per token), served via **vLLM** inside an
+**Apptainer** container on a GPU server.

 ## Architecture

@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server.
 Students (OpenAI SDK / curl)
        │
        ▼
-  ┌─────────────────────────┐
+  ┌──────────────────────────────┐
  │  silicon.fhgr.ch:7080       │
  │  OpenAI-compatible API      │
-  ├─────────────────────────┤
-  │  vLLM Server            │
-  │  (Apptainer container)  │
-  ├─────────────────────────┤
-  │  Qwen3.5-27B weights    │
-  │  (bind-mounted)         │
-  ├─────────────────────────┤
-  │  NVIDIA GPU             │
-  └─────────────────────────┘
+  ├──────────────────────────────┤
+  │  vLLM Server (nightly)      │
+  │  Apptainer container (.sif) │
+  ├──────────────────────────────┤
+  │  Qwen3.5-35B-A3B weights    │
+  │  (bind-mounted from host)   │
+  ├──────────────────────────────┤
+  │  2× NVIDIA L40S (46 GB ea.) │
+  │  Tensor Parallel = 2        │
+  └──────────────────────────────┘
 ```

+## Hardware
+
+The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
+The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
+
+| Component | Value |
+|-----------|-------|
+| GPUs used | 2× NVIDIA L40S |
+| VRAM used | ~92 GB total |
+| Model size (BF16) | ~67 GB |
+| Active params/token | 3B (MoE) |
+| Context length | 32,768 tokens |
+| Port | 7080 |
+
 ## Prerequisites

- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
-  Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
- **Apptainer** (formerly Singularity) installed on the server.
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
- **~60 GB disk space** for model weights + ~15 GB for the container image.
- **Network**: Students must be on the university network or VPN.
+- **Apptainer** (formerly Singularity) installed on the server
+- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
+- **~80 GB disk** for model weights + ~8 GB for the container image
+- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)

-## Hardware Sizing
-
-| Component | Minimum        | Recommended     |
-|-----------|----------------|-----------------|
-| GPU VRAM  | 80 GB (1× A100)| 80 GB (1× H100) |
-| RAM       | 64 GB          | 128 GB          |
-| Disk      | 100 GB free    | 200 GB free     |
-
-> **If your GPU has less than 80 GB VRAM**, you have two options:
-> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
-> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
+> **Note**: No `pip` or `python` is needed on the host — everything runs inside
+> the Apptainer container.

 ---

@ -54,11 +59,10 @@ Students (OpenAI SDK / curl)
 ssh herzogfloria@silicon.fhgr.ch
 ```

-### Step 1: Clone This Repository
+### Step 1: Clone the Repository

 ```bash
-# Or copy the files to the server
-git clone <your-repo-url> ~/LLM_local
+git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
 cd ~/LLM_local
 chmod +x *.sh
 ```
@ -66,124 +70,95 @@ chmod +x *.sh
 ### Step 2: Check GPU and Environment

 ```bash
-# Verify GPU is visible
 nvidia-smi
-
-# Verify Apptainer is installed
 apptainer --version
-
-# Check available disk space
 df -h ~
 ```

-### Step 3: Download the Model (~60 GB)
+### Step 3: Build the Apptainer Container

 ```bash
-# Install huggingface-cli if not available
-pip install --user huggingface_hub[cli]
-
-# Download Qwen3.5-27B
-bash 01_download_model.sh
-# Default target: ~/models/Qwen3.5-27B
+bash 01_build_container.sh
 ```

-This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
+Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
+(required for Qwen3.5 support), installs latest `transformers` from source,
+and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.

-### Step 4: Build the Apptainer Container
+### Step 4: Download the Model (~67 GB)

 ```bash
-bash 02_build_container.sh
+bash 02_download_model.sh
 ```

-This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
-Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
-
-> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
-> first and convert manually:
-> ```bash
-> apptainer pull docker://vllm/vllm-openai:latest
-> ```
+Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
+container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
+depending on bandwidth.

 ### Step 5: Start the Server

-**Interactive (foreground):**
+**Interactive (foreground) — recommended with tmux:**
 ```bash
+tmux new -s llm
 bash 03_start_server.sh
+# Ctrl+B, then D to detach
 ```

-**Background (recommended for production):**
+**Background with logging:**
 ```bash
 bash 04_start_server_background.sh
-```
-
-The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
-```bash
 tail -f logs/vllm_server_*.log
 ```

-Look for the line:
+The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
 ```
-INFO:     Uvicorn running on http://0.0.0.0:8000
+INFO:     Uvicorn running on http://0.0.0.0:7080
 ```

 ### Step 6: Test the Server

+From another terminal on the server:
 ```bash
-# Quick health check
 curl http://localhost:7080/v1/models
+```

-# Full test
-pip install openai
-python test_server.py
+Or run the full test (uses `openai` SDK inside the container):
+```bash
+apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
 ```

 ### Step 7: Share with Students

-Distribute the `STUDENT_GUIDE.md` file or share the connection details:
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
+Distribute `STUDENT_GUIDE.md` with connection details:
+- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
+- **Model name**: `qwen3.5-35b-a3b`

 ---

 ## Configuration

-All configuration is via environment variables in `03_start_server.sh`:
+All configuration is via environment variables passed to `03_start_server.sh`:

 | Variable          | Default                          | Description                    |
-|-------------------|------------------------------|-------------------------------------|
-| `MODEL_DIR`       | `~/models/Qwen3.5-27B`      | Path to model weights               |
+|-------------------|----------------------------------|--------------------------------|
+| `MODEL_DIR`       | `~/models/Qwen3.5-35B-A3B`      | Path to model weights          |
 | `PORT`            | `7080`                           | HTTP port                      |
 | `MAX_MODEL_LEN`   | `32768`                          | Max context length (tokens)    |
 | `GPU_MEM_UTIL`    | `0.92`                           | Fraction of GPU memory to use  |
 | `API_KEY`         | *(empty = no auth)*              | API key for authentication     |
-| `TENSOR_PARALLEL` | `1`                          | Number of GPUs                      |
+| `TENSOR_PARALLEL` | `2`                              | Number of GPUs                 |

-### Context Length Tuning
-
-The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
-concurrent users. If you have plenty of VRAM headroom:
+### Examples

 ```bash
+# Increase context length
 MAX_MODEL_LEN=65536 bash 03_start_server.sh
-```

-Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
-significantly more GPU memory for KV cache.
+# Add API key authentication
+API_KEY="your-secret-key" bash 03_start_server.sh

-### Adding Authentication
-
-```bash
-API_KEY="your-secret-key-here" bash 03_start_server.sh
-```
-
-Students then use this key in their `api_key` parameter.
-
-### Multi-GPU Setup
-
-If you have multiple GPUs:
-
-```bash
-TENSOR_PARALLEL=2 bash 03_start_server.sh
+# Use all 4 GPUs (more KV cache headroom)
+TENSOR_PARALLEL=4 bash 03_start_server.sh
 ```

 ---
@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh
 bash 04_start_server_background.sh

 # Check if running
-curl -s http://localhost:7080/v1/models | python -m json.tool
+curl -s http://localhost:7080/v1/models | python3 -m json.tool

 # View logs
 tail -f logs/vllm_server_*.log
@ -205,20 +180,9 @@ bash 05_stop_server.sh

 # Monitor GPU usage
 watch -n 2 nvidia-smi
-```

-### Running Persistently with tmux
-
-For a robust setup that survives SSH disconnects:
-
-```bash
-ssh herzogfloria@silicon.fhgr.ch
-tmux new -s llm_server
-bash 03_start_server.sh
-# Press Ctrl+B, then D to detach
-
-# Reconnect later:
-tmux attach -t llm_server
+# Reconnect to tmux session
+tmux attach -t llm
 ```

 ---
@ -226,12 +190,12 @@ tmux attach -t llm_server
 ## Files Overview

 | File                             | Purpose                                              |
-|------------------------------|------------------------------------------- |
-| `vllm_qwen.def`             | Apptainer container definition             |
-| `01_download_model.sh`       | Downloads model weights from Hugging Face  |
-| `02_build_container.sh`      | Builds the Apptainer .sif image            |
+|----------------------------------|------------------------------------------------------|
+| `vllm_qwen.def`                 | Apptainer container definition (vLLM nightly + deps) |
+| `01_build_container.sh`          | Builds the Apptainer `.sif` image                    |
+| `02_download_model.sh`           | Downloads model weights (runs inside container)      |
 | `03_start_server.sh`             | Starts vLLM server (foreground)                      |
-| `04_start_server_background.sh` | Starts server in background with logging|
+| `04_start_server_background.sh`  | Starts server in background with logging             |
 | `05_stop_server.sh`              | Stops the background server                          |
 | `test_server.py`                 | Tests the running server                             |
 | `STUDENT_GUIDE.md`               | Instructions for students                            |
@ -241,25 +205,28 @@ tmux attach -t llm_server
 ## Troubleshooting

 ### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
- Use a quantized model variant
+- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
+- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)

 ### Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
+- Ensure internet access and sufficient disk space (~20 GB for build cache)
+- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`

 ### "No NVIDIA GPU detected"
- Check that `nvidia-smi` works outside the container
- Ensure `--nv` flag is passed (already in scripts)
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
+- Verify `nvidia-smi` works on the host
+- Ensure `--nv` flag is present (already in scripts)
+- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`

-### Server starts but students can't connect
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
+### "Model type qwen3_5_moe not recognized"
+- The container needs vLLM nightly and latest transformers
+- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
+
+### Students can't connect
+- Check firewall: ports 7080-7090 must be open
 - Verify the server binds to `0.0.0.0` (not just localhost)
- Students must use the server's hostname/IP, not `localhost`
+- Students must be on the university network or VPN

 ### Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite
- Consider reducing `max_tokens` in student requests
- Monitor with: `curl http://localhost:7080/metrics`
+- Expected — vLLM batches requests but throughput is finite
+- The MoE architecture (3B active) helps with per-token speed
+- Monitor: `curl http://localhost:7080/metrics`
--- a/STUDENT_GUIDE.md
+++ b/STUDENT_GUIDE.md
@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \

 ---

+## Streamlit Chat & File Editor App
+
+A simple web UI is included for chatting with the model and editing files.
+
+### Setup
+
+```bash
+pip install streamlit openai
+```
+
+### Run
+
+```bash
+streamlit run app.py
+```
+
+This opens a browser with two tabs:
+
+- **Chat** — Conversational interface with streaming responses. You can save
+  the model's last response directly to a file.
+- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
+  Use the "Generate with LLM" button to have the model modify your file based
+  on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
+
+Files are stored in a `workspace/` folder next to `app.py`.
+
+> **Tip**: The app runs on your local machine and connects to the server — you
+> don't need to install anything on the GPU server.
+
+---
+
+## Thinking Mode
+
+By default, the model "thinks" before answering (internal chain-of-thought).
+This is great for complex reasoning but adds latency for simple questions.
+
+To disable thinking and get faster direct responses, add this to your API call:
+
+```python
+response = client.chat.completions.create(
+    model="qwen3.5-35b-a3b",
+    messages=[...],
+    max_tokens=1024,
+    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
+)
+```
+
+---
+
 ## Troubleshooting

 | Issue                       | Solution                                            |
--- a/app.py
+++ b/app.py
@ -0,0 +1,181 @@
+"""
+Streamlit Chat & File Editor for Qwen3.5-35B-A3B
+
+A minimal interface to:
+  1. Chat with the local LLM (OpenAI-compatible API)
+  2. Edit, save, and generate code / LaTeX files
+
+Usage:
+  pip install streamlit openai
+  streamlit run app.py
+"""
+
+import re
+import streamlit as st
+from openai import OpenAI
+from pathlib import Path
+
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1")
+API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password")
+MODEL = "qwen3.5-35b-a3b"
+WORKSPACE = Path("workspace")
+WORKSPACE.mkdir(exist_ok=True)
+
+client = OpenAI(base_url=API_BASE, api_key=API_KEY)
+
+LANG_MAP = {
+    ".py": "python", ".tex": "latex", ".js": "javascript",
+    ".html": "html", ".css": "css", ".sh": "bash",
+    ".json": "json", ".yaml": "yaml", ".yml": "yaml",
+}
+
+
+def extract_code(text: str, lang: str = "") -> str:
+    """Extract the first fenced code block from markdown text.
+    Falls back to the full text if no code block is found."""
+    pattern = r"```(?:\w*)\n(.*?)```"
+    match = re.search(pattern, text, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    return text.strip()
+
+
+# ---------------------------------------------------------------------------
+# Sidebar — File Manager
+# ---------------------------------------------------------------------------
+st.sidebar.markdown("---")
+st.sidebar.header("File Manager")
+
+new_filename = st.sidebar.text_input("New file name", placeholder="main.tex")
+if st.sidebar.button("Create File") and new_filename:
+    (WORKSPACE / new_filename).touch()
+    st.sidebar.success(f"Created {new_filename}")
+    st.rerun()
+
+files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else []
+file_names = [f.name for f in files if f.is_file()]
+selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"])
+
+# ---------------------------------------------------------------------------
+# Main Layout — Two Tabs
+# ---------------------------------------------------------------------------
+tab_chat, tab_editor = st.tabs(["Chat", "File Editor"])
+
+# ---------------------------------------------------------------------------
+# Tab 1: Chat
+# ---------------------------------------------------------------------------
+with tab_chat:
+    st.header("Chat with Qwen3.5")
+
+    if "messages" not in st.session_state:
+        st.session_state.messages = []
+
+    for msg in st.session_state.messages:
+        with st.chat_message(msg["role"]):
+            st.markdown(msg["content"])
+
+    if prompt := st.chat_input("Ask anything..."):
+        st.session_state.messages.append({"role": "user", "content": prompt})
+        with st.chat_message("user"):
+            st.markdown(prompt)
+
+        with st.chat_message("assistant"):
+            placeholder = st.empty()
+            full_response = ""
+
+            stream = client.chat.completions.create(
+                model=MODEL,
+                messages=st.session_state.messages,
+                max_tokens=8092,
+                temperature=0.2,
+                stream=True,
+                extra_body={"chat_template_kwargs": {"enable_thinking": True}},
+            )
+            for chunk in stream:
+                delta = chunk.choices[0].delta.content or ""
+                full_response += delta
+                placeholder.markdown(full_response + "▌")
+            placeholder.markdown(full_response)
+
+        st.session_state.messages.append({"role": "assistant", "content": full_response})
+
+    if st.session_state.messages:
+        col_clear, col_save = st.columns([1, 3])
+        with col_clear:
+            if st.button("Clear Chat"):
+                st.session_state.messages = []
+                st.rerun()
+        with col_save:
+            if selected_file and selected_file != "(no files)":
+                if st.button(f"Save code → {selected_file}"):
+                    last = st.session_state.messages[-1]["content"]
+                    suffix = Path(selected_file).suffix
+                    lang = LANG_MAP.get(suffix, "")
+                    code = extract_code(last, lang)
+                    (WORKSPACE / selected_file).write_text(code)
+                    st.success(f"Extracted code saved to workspace/{selected_file}")
+
+# ---------------------------------------------------------------------------
+# Tab 2: File Editor
+# ---------------------------------------------------------------------------
+with tab_editor:
+    st.header("File Editor")
+
+    if selected_file and selected_file != "(no files)":
+        file_path = WORKSPACE / selected_file
+        content = file_path.read_text() if file_path.exists() else ""
+        suffix = file_path.suffix
+        lang = LANG_MAP.get(suffix, "text")
+
+        st.code(content, language=lang if lang != "text" else None, line_numbers=True)
+
+        edited = st.text_area(
+            "Edit below:",
+            value=content,
+            height=400,
+            key=f"editor_{selected_file}_{hash(content)}",
+        )
+
+        col_save, col_gen = st.columns(2)
+
+        with col_save:
+            if st.button("Save File"):
+                file_path.write_text(edited)
+                st.success(f"Saved {selected_file}")
+                st.rerun()
+
+        with col_gen:
+            gen_prompt = st.text_input(
+                "Generation instruction",
+                placeholder="e.g. Add error handling / Fix the LaTeX formatting",
+                key="gen_prompt",
+            )
+            if st.button("Generate with LLM") and gen_prompt:
+                with st.spinner("Generating..."):
+                    response = client.chat.completions.create(
+                        model=MODEL,
+                        messages=[
+                            {"role": "system", "content": (
+                                f"You are a coding assistant. The user has a {lang} file. "
+                                "Return ONLY the raw file content inside a single code block. "
+                                "No explanations, no comments about changes."
+                            )},
+                            {"role": "user", "content": (
+                                f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n"
+                                f"Instruction: {gen_prompt}"
+                            )},
+                        ],
+                        max_tokens=16384,
+                        temperature=0.6,
+                        extra_body={"chat_template_kwargs": {"enable_thinking": False}},
+                    )
+                    result = response.choices[0].message.content
+                    code = extract_code(result, lang)
+                    file_path.write_text(code)
+                    st.success("File updated by LLM")
+                    st.rerun()
+    else:
+        st.info("Create a file in the sidebar to start editing.")
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,2 @@
+streamlit
+openai
--- a/test_server.py
+++ b/test_server.py
@ -38,9 +38,9 @@ def main():
    response = client.chat.completions.create(
        model=model,
        messages=[
-            {"role": "user", "content": "What is 2 + 2? Answer in one sentence."}
+            {"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."}
        ],
-        max_tokens=256,
+        max_tokens=16384,
        temperature=0.7,
    )
    print(f"  Response: {response.choices[0].message.content}")
@ -53,7 +53,7 @@ def main():
        messages=[
            {"role": "user", "content": "Count from 1 to 5."}
        ],
-        max_tokens=128,
+        max_tokens=16384,
        temperature=0.7,
        stream=True,
    )
--- a/vllm_qwen.def
+++ b/vllm_qwen.def
@ -1,10 +1,10 @@
 Bootstrap: docker
-From: vllm/vllm-openai:latest
+From: vllm/vllm-openai:nightly

 %labels
    Author herzogfloria
    Description vLLM nightly inference server for Qwen3.5-35B-A3B
-    Version 2.0
+    Version 3.0

 %environment
    export HF_HOME=/tmp/hf_cache
@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest

 %post
    apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
-    pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
    pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
    pip install --no-cache-dir huggingface_hub[cli]