Add Streamlit chat app, update container to vLLM nightly

- Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor
2026-03-02 16:30:04 +01:00 · 2026-03-02 16:30:04 +01:00 · 9e1e0c0751
commit 9e1e0c0751
parent 076001b07f
7 changed files with 351 additions and 147 deletions
--- a/.gitignore
+++ b/.gitignore
@ -10,5 +10,11 @@ models/
 # HuggingFace cache
 .cache/
 # Python venv
 .venv/
 # Streamlit workspace files
 workspace/
 # macOS
 .DS_Store
--- a/README.md
+++ b/README.md
@ -1,7 +1,8 @@
-# LLM Local — Qwen3.5-27B Inference Server
+# LLM Inferenz Server — Qwen3.5-35B-A3B
-Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
+Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
-served via **vLLM** inside an **Apptainer** container on a GPU server.
+(MoE, 35B total / 3B active per token), served via **vLLM** inside an
 **Apptainer** container on a GPU server.
 ## Architecture
@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server.
 Students (OpenAI SDK / curl)
        │
        ▼
-  ┌─────────────────────────┐
+  ┌──────────────────────────────┐
-  │  silicon.fhgr.ch:7080   │
+  │  silicon.fhgr.ch:7080       │
-  │  OpenAI-compatible API  │
+  │  OpenAI-compatible API      │
-  ├─────────────────────────┤
+  ├──────────────────────────────┤
-  │  vLLM Server            │
+  │  vLLM Server (nightly)      │
-  │  (Apptainer container)  │
+  │  Apptainer container (.sif) │
-  ├─────────────────────────┤
+  ├──────────────────────────────┤
-  │  Qwen3.5-27B weights    │
+  │  Qwen3.5-35B-A3B weights    │
-  │  (bind-mounted)         │
+  │  (bind-mounted from host)   │
-  ├─────────────────────────┤
+  ├──────────────────────────────┤
-  │  NVIDIA GPU             │
+  │  2× NVIDIA L40S (46 GB ea.) │
-  └─────────────────────────┘
+  │  Tensor Parallel = 2        │
  └──────────────────────────────┘
 ```
 ## Hardware
 The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
 The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
 | Component | Value |
 |-----------|-------|
 | GPUs used | 2× NVIDIA L40S |
 | VRAM used | ~92 GB total |
 | Model size (BF16) | ~67 GB |
 | Active params/token | 3B (MoE) |
 | Context length | 32,768 tokens |
 | Port | 7080 |
 ## Prerequisites
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
+- **Apptainer** (formerly Singularity) installed on the server
-  Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
+- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
- **Apptainer** (formerly Singularity) installed on the server.
+- **~80 GB disk** for model weights + ~8 GB for the container image
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
+- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
 - **~60 GB disk space** for model weights + ~15 GB for the container image.
 - **Network**: Students must be on the university network or VPN.
-## Hardware Sizing
+> **Note**: No `pip` or `python` is needed on the host — everything runs inside
-
+> the Apptainer container.
 | Component | Minimum        | Recommended     |
 |-----------|----------------|-----------------|
 | GPU VRAM  | 80 GB (1× A100)| 80 GB (1× H100) |
 | RAM       | 64 GB          | 128 GB          |
 | Disk      | 100 GB free    | 200 GB free     |
 > **If your GPU has less than 80 GB VRAM**, you have two options:
 > 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
 > 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
 ---
@ -54,11 +59,10 @@ Students (OpenAI SDK / curl)
 ssh herzogfloria@silicon.fhgr.ch
 ```
-### Step 1: Clone This Repository
+### Step 1: Clone the Repository
 ```bash
-# Or copy the files to the server
+git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
 git clone <your-repo-url> ~/LLM_local
 cd ~/LLM_local
 chmod +x *.sh
 ```
@ -66,124 +70,95 @@ chmod +x *.sh
 ### Step 2: Check GPU and Environment
 ```bash
 # Verify GPU is visible
 nvidia-smi
 # Verify Apptainer is installed
 apptainer --version
 # Check available disk space
 df -h ~
 ```
-### Step 3: Download the Model (~60 GB)
+### Step 3: Build the Apptainer Container
 ```bash
-# Install huggingface-cli if not available
+bash 01_build_container.sh
 pip install --user huggingface_hub[cli]
 # Download Qwen3.5-27B
 bash 01_download_model.sh
 # Default target: ~/models/Qwen3.5-27B
 ```
-This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
+Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
 (required for Qwen3.5 support), installs latest `transformers` from source,
 and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
-### Step 4: Build the Apptainer Container
+### Step 4: Download the Model (~67 GB)
 ```bash
-bash 02_build_container.sh
+bash 02_download_model.sh
 ```
-This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
+Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
-Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
+container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
-
+depending on bandwidth.
 > **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
 > first and convert manually:
 > ```bash
 > apptainer pull docker://vllm/vllm-openai:latest
 > ```
 ### Step 5: Start the Server
-**Interactive (foreground):**
+**Interactive (foreground) — recommended with tmux:**
 ```bash
 tmux new -s llm
 bash 03_start_server.sh
 # Ctrl+B, then D to detach
 ```
-**Background (recommended for production):**
+**Background with logging:**
 ```bash
 bash 04_start_server_background.sh
 ```
 The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
 ```bash
 tail -f logs/vllm_server_*.log
 ```
-Look for the line:
+The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
 ```
-INFO:     Uvicorn running on http://0.0.0.0:8000
+INFO:     Uvicorn running on http://0.0.0.0:7080
 ```
 ### Step 6: Test the Server
 From another terminal on the server:
 ```bash
 # Quick health check
 curl http://localhost:7080/v1/models
 ```
-# Full test
+Or run the full test (uses `openai` SDK inside the container):
-pip install openai
+```bash
-python test_server.py
+apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
 ```
 ### Step 7: Share with Students
-Distribute the `STUDENT_GUIDE.md` file or share the connection details:
+Distribute `STUDENT_GUIDE.md` with connection details:
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
+- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
+- **Model name**: `qwen3.5-35b-a3b`
 ---
 ## Configuration
-All configuration is via environment variables in `03_start_server.sh`:
+All configuration is via environment variables passed to `03_start_server.sh`:
-| Variable          | Default                      | Description                         |
+| Variable          | Default                          | Description                    |
-|-------------------|------------------------------|-------------------------------------|
+|-------------------|----------------------------------|--------------------------------|
-| `MODEL_DIR`       | `~/models/Qwen3.5-27B`      | Path to model weights               |
+| `MODEL_DIR`       | `~/models/Qwen3.5-35B-A3B`      | Path to model weights          |
-| `PORT`            | `7080`                       | HTTP port                           |
+| `PORT`            | `7080`                           | HTTP port                      |
-| `MAX_MODEL_LEN`   | `32768`                      | Max context length (tokens)         |
+| `MAX_MODEL_LEN`   | `32768`                          | Max context length (tokens)    |
-| `GPU_MEM_UTIL`    | `0.92`                       | Fraction of GPU memory to use       |
+| `GPU_MEM_UTIL`    | `0.92`                           | Fraction of GPU memory to use  |
-| `API_KEY`         | *(empty = no auth)*          | API key for authentication          |
+| `API_KEY`         | *(empty = no auth)*              | API key for authentication     |
-| `TENSOR_PARALLEL` | `1`                          | Number of GPUs                      |
+| `TENSOR_PARALLEL` | `2`                              | Number of GPUs                 |
-### Context Length Tuning
+### Examples
 The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
 concurrent users. If you have plenty of VRAM headroom:
 ```bash
 # Increase context length
 MAX_MODEL_LEN=65536 bash 03_start_server.sh
 ```
-Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
+# Add API key authentication
-significantly more GPU memory for KV cache.
+API_KEY="your-secret-key" bash 03_start_server.sh
-### Adding Authentication
+# Use all 4 GPUs (more KV cache headroom)
-
+TENSOR_PARALLEL=4 bash 03_start_server.sh
 ```bash
 API_KEY="your-secret-key-here" bash 03_start_server.sh
 ```
 Students then use this key in their `api_key` parameter.
 ### Multi-GPU Setup
 If you have multiple GPUs:
 ```bash
 TENSOR_PARALLEL=2 bash 03_start_server.sh
 ```
 ---
@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh
 bash 04_start_server_background.sh
 # Check if running
-curl -s http://localhost:7080/v1/models | python -m json.tool
+curl -s http://localhost:7080/v1/models | python3 -m json.tool
 # View logs
 tail -f logs/vllm_server_*.log
@ -205,61 +180,53 @@ bash 05_stop_server.sh
 # Monitor GPU usage
 watch -n 2 nvidia-smi
 ```
-### Running Persistently with tmux
+# Reconnect to tmux session
-
+tmux attach -t llm
 For a robust setup that survives SSH disconnects:
 ```bash
 ssh herzogfloria@silicon.fhgr.ch
 tmux new -s llm_server
 bash 03_start_server.sh
 # Press Ctrl+B, then D to detach
 # Reconnect later:
 tmux attach -t llm_server
 ```
 ---
 ## Files Overview
-| File                         | Purpose                                    |
+| File                             | Purpose                                              |
-|------------------------------|------------------------------------------- |
+|----------------------------------|------------------------------------------------------|
-| `vllm_qwen.def`             | Apptainer container definition             |
+| `vllm_qwen.def`                 | Apptainer container definition (vLLM nightly + deps) |
-| `01_download_model.sh`       | Downloads model weights from Hugging Face  |
+| `01_build_container.sh`          | Builds the Apptainer `.sif` image                    |
-| `02_build_container.sh`      | Builds the Apptainer .sif image            |
+| `02_download_model.sh`           | Downloads model weights (runs inside container)      |
-| `03_start_server.sh`         | Starts vLLM server (foreground)            |
+| `03_start_server.sh`             | Starts vLLM server (foreground)                      |
-| `04_start_server_background.sh` | Starts server in background with logging|
+| `04_start_server_background.sh`  | Starts server in background with logging             |
-| `05_stop_server.sh`          | Stops the background server                |
+| `05_stop_server.sh`              | Stops the background server                          |
-| `test_server.py`             | Tests the running server                   |
+| `test_server.py`                 | Tests the running server                             |
-| `STUDENT_GUIDE.md`           | Instructions for students                  |
+| `STUDENT_GUIDE.md`               | Instructions for students                            |
 ---
 ## Troubleshooting
 ### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
+- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
+- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
 - Use a quantized model variant
 ### Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
+- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
+- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
 ### "No NVIDIA GPU detected"
- Check that `nvidia-smi` works outside the container
+- Verify `nvidia-smi` works on the host
- Ensure `--nv` flag is passed (already in scripts)
+- Ensure `--nv` flag is present (already in scripts)
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
+- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
-### Server starts but students can't connect
+### "Model type qwen3_5_moe not recognized"
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
+- The container needs vLLM nightly and latest transformers
 - Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
 ### Students can't connect
 - Check firewall: ports 7080-7090 must be open
 - Verify the server binds to `0.0.0.0` (not just localhost)
- Students must use the server's hostname/IP, not `localhost`
+- Students must be on the university network or VPN
 ### Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite
+- Expected — vLLM batches requests but throughput is finite
- Consider reducing `max_tokens` in student requests
+- The MoE architecture (3B active) helps with per-token speed
- Monitor with: `curl http://localhost:7080/metrics`
+- Monitor: `curl http://localhost:7080/metrics`
--- a/STUDENT_GUIDE.md
+++ b/STUDENT_GUIDE.md
@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \
 ---
 ## Streamlit Chat & File Editor App
 A simple web UI is included for chatting with the model and editing files.
 ### Setup
 ```bash
 pip install streamlit openai
 ```
 ### Run
 ```bash
 streamlit run app.py
 ```
 This opens a browser with two tabs:
 - **Chat** — Conversational interface with streaming responses. You can save
  the model's last response directly to a file.
 - **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
  Use the "Generate with LLM" button to have the model modify your file based
  on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
 Files are stored in a `workspace/` folder next to `app.py`.
 > **Tip**: The app runs on your local machine and connects to the server — you
 > don't need to install anything on the GPU server.
 ---
 ## Thinking Mode
 By default, the model "thinks" before answering (internal chain-of-thought).
 This is great for complex reasoning but adds latency for simple questions.
 To disable thinking and get faster direct responses, add this to your API call:
 ```python
 response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[...],
    max_tokens=1024,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
 )
 ```
 ---
 ## Troubleshooting
 | Issue                       | Solution                                            |
--- a/app.py
+++ b/app.py
@ -0,0 +1,181 @@
 """
 Streamlit Chat & File Editor for Qwen3.5-35B-A3B
 A minimal interface to:
  1. Chat with the local LLM (OpenAI-compatible API)
  2. Edit, save, and generate code / LaTeX files
 Usage:
  pip install streamlit openai
  streamlit run app.py
 """
 import re
 import streamlit as st
 from openai import OpenAI
 from pathlib import Path
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
 API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1")
 API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password")
 MODEL = "qwen3.5-35b-a3b"
 WORKSPACE = Path("workspace")
 WORKSPACE.mkdir(exist_ok=True)
 client = OpenAI(base_url=API_BASE, api_key=API_KEY)
 LANG_MAP = {
    ".py": "python", ".tex": "latex", ".js": "javascript",
    ".html": "html", ".css": "css", ".sh": "bash",
    ".json": "json", ".yaml": "yaml", ".yml": "yaml",
 }
 def extract_code(text: str, lang: str = "") -> str:
    """Extract the first fenced code block from markdown text.
    Falls back to the full text if no code block is found."""
    pattern = r"```(?:\w*)\n(.*?)```"
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()
 # ---------------------------------------------------------------------------
 # Sidebar — File Manager
 # ---------------------------------------------------------------------------
 st.sidebar.markdown("---")
 st.sidebar.header("File Manager")
 new_filename = st.sidebar.text_input("New file name", placeholder="main.tex")
 if st.sidebar.button("Create File") and new_filename:
    (WORKSPACE / new_filename).touch()
    st.sidebar.success(f"Created {new_filename}")
    st.rerun()
 files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else []
 file_names = [f.name for f in files if f.is_file()]
 selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"])
 # ---------------------------------------------------------------------------
 # Main Layout — Two Tabs
 # ---------------------------------------------------------------------------
 tab_chat, tab_editor = st.tabs(["Chat", "File Editor"])
 # ---------------------------------------------------------------------------
 # Tab 1: Chat
 # ---------------------------------------------------------------------------
 with tab_chat:
    st.header("Chat with Qwen3.5")
    if "messages" not in st.session_state:
        st.session_state.messages = []
    for msg in st.session_state.messages:
        with st.chat_message(msg["role"]):
            st.markdown(msg["content"])
    if prompt := st.chat_input("Ask anything..."):
        st.session_state.messages.append({"role": "user", "content": prompt})
        with st.chat_message("user"):
            st.markdown(prompt)
        with st.chat_message("assistant"):
            placeholder = st.empty()
            full_response = ""
            stream = client.chat.completions.create(
                model=MODEL,
                messages=st.session_state.messages,
                max_tokens=8092,
                temperature=0.2,
                stream=True,
                extra_body={"chat_template_kwargs": {"enable_thinking": True}},
            )
            for chunk in stream:
                delta = chunk.choices[0].delta.content or ""
                full_response += delta
                placeholder.markdown(full_response + "▌")
            placeholder.markdown(full_response)
        st.session_state.messages.append({"role": "assistant", "content": full_response})
    if st.session_state.messages:
        col_clear, col_save = st.columns([1, 3])
        with col_clear:
            if st.button("Clear Chat"):
                st.session_state.messages = []
                st.rerun()
        with col_save:
            if selected_file and selected_file != "(no files)":
                if st.button(f"Save code → {selected_file}"):
                    last = st.session_state.messages[-1]["content"]
                    suffix = Path(selected_file).suffix
                    lang = LANG_MAP.get(suffix, "")
                    code = extract_code(last, lang)
                    (WORKSPACE / selected_file).write_text(code)
                    st.success(f"Extracted code saved to workspace/{selected_file}")
 # ---------------------------------------------------------------------------
 # Tab 2: File Editor
 # ---------------------------------------------------------------------------
 with tab_editor:
    st.header("File Editor")
    if selected_file and selected_file != "(no files)":
        file_path = WORKSPACE / selected_file
        content = file_path.read_text() if file_path.exists() else ""
        suffix = file_path.suffix
        lang = LANG_MAP.get(suffix, "text")
        st.code(content, language=lang if lang != "text" else None, line_numbers=True)
        edited = st.text_area(
            "Edit below:",
            value=content,
            height=400,
            key=f"editor_{selected_file}_{hash(content)}",
        )
        col_save, col_gen = st.columns(2)
        with col_save:
            if st.button("Save File"):
                file_path.write_text(edited)
                st.success(f"Saved {selected_file}")
                st.rerun()
        with col_gen:
            gen_prompt = st.text_input(
                "Generation instruction",
                placeholder="e.g. Add error handling / Fix the LaTeX formatting",
                key="gen_prompt",
            )
            if st.button("Generate with LLM") and gen_prompt:
                with st.spinner("Generating..."):
                    response = client.chat.completions.create(
                        model=MODEL,
                        messages=[
                            {"role": "system", "content": (
                                f"You are a coding assistant. The user has a {lang} file. "
                                "Return ONLY the raw file content inside a single code block. "
                                "No explanations, no comments about changes."
                            )},
                            {"role": "user", "content": (
                                f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n"
                                f"Instruction: {gen_prompt}"
                            )},
                        ],
                        max_tokens=16384,
                        temperature=0.6,
                        extra_body={"chat_template_kwargs": {"enable_thinking": False}},
                    )
                    result = response.choices[0].message.content
                    code = extract_code(result, lang)
                    file_path.write_text(code)
                    st.success("File updated by LLM")
                    st.rerun()
    else:
        st.info("Create a file in the sidebar to start editing.")
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,2 @@
 streamlit
 openai
--- a/test_server.py
+++ b/test_server.py
@ -38,9 +38,9 @@ def main():
    response = client.chat.completions.create(
        model=model,
        messages=[
-            {"role": "user", "content": "What is 2 + 2? Answer in one sentence."}
+            {"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."}
        ],
-        max_tokens=256,
+        max_tokens=16384,
        temperature=0.7,
    )
    print(f"  Response: {response.choices[0].message.content}")
@ -53,7 +53,7 @@ def main():
        messages=[
            {"role": "user", "content": "Count from 1 to 5."}
        ],
-        max_tokens=128,
+        max_tokens=16384,
        temperature=0.7,
        stream=True,
    )
--- a/vllm_qwen.def
+++ b/vllm_qwen.def
@ -1,10 +1,10 @@
 Bootstrap: docker
-From: vllm/vllm-openai:latest
+From: vllm/vllm-openai:nightly
 %labels
    Author herzogfloria
    Description vLLM nightly inference server for Qwen3.5-35B-A3B
-    Version 2.0
+    Version 3.0
 %environment
    export HF_HOME=/tmp/hf_cache
@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest
 %post
    apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
    pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
    pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
    pip install --no-cache-dir huggingface_hub[cli]