Update README to reflect current project state

Add Streamlit app section with setup, usage, and sidebar controls. Document nightly Docker image requirement, scp workflow for server sync, and practical troubleshooting tips from setup experience. Made-with: Cursor
2026-03-02 16:42:33 +01:00 · 2026-03-02 16:42:33 +01:00 · deee5038d1
commit deee5038d1
parent 12f9e3ac9b
1 changed files with 71 additions and 11 deletions
--- a/README.md
+++ b/README.md
@ -2,12 +2,13 @@

 Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
 (MoE, 35B total / 3B active per token), served via **vLLM** inside an
-**Apptainer** container on a GPU server.
+**Apptainer** container on a GPU server. Includes a **Streamlit web app** for
+chat and file editing.

 ## Architecture

 ```
-Students (OpenAI SDK / curl)
+Students (Streamlit App / OpenAI SDK / curl)
        │
        ▼
  ┌──────────────────────────────┐
@ -67,6 +68,10 @@ cd ~/LLM_local
 chmod +x *.sh
 ```

+> **Note**: `git` is not installed on the host. Use the container:
+> `apptainer exec vllm_qwen.sif git clone ...`
+> Or copy files via `scp` from your local machine.
+
 ### Step 2: Check GPU and Environment

 ```bash
@ -81,9 +86,9 @@ df -h ~
 bash 01_build_container.sh
 ```

-Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
-(required for Qwen3.5 support), installs latest `transformers` from source,
-and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
+Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5
+support), installs latest `transformers` from source, and packages everything
+into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.

 ### Step 4: Download the Model (~67 GB)

@ -122,9 +127,11 @@ From another terminal on the server:
 curl http://localhost:7080/v1/models
 ```

-Or run the full test (uses `openai` SDK inside the container):
+Quick chat test:
 ```bash
-apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
+curl http://localhost:7080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
 ```

 ### Step 7: Share with Students
@ -135,7 +142,51 @@ Distribute `STUDENT_GUIDE.md` with connection details:

 ---

-## Configuration
+## Streamlit App
+
+A web-based chat and file editor that connects to the inference server.
+Students run it on their own machines.
+
+### Setup
+
+```bash
+pip install -r requirements.txt
+```
+
+Or with a virtual environment:
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+### Run
+
+```bash
+streamlit run app.py
+```
+
+Opens at `http://localhost:8501` with two tabs:
+
+- **Chat** — Conversational interface with streaming responses. Save the
+  model's last response directly into a workspace file (code auto-extracted).
+- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file.
+  Use "Generate with LLM" to modify files via natural language instructions.
+
+### Sidebar Controls
+
+| Parameter | Default | Range | Purpose |
+|-----------|---------|-------|---------|
+| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
+| Temperature | 0.7 | 0.0 – 2.0 | Creativity vs determinism |
+| Max Tokens | 4096 | 256 – 16384 | Maximum response length |
+| Top P | 0.95 | 0.0 – 1.0 | Nucleus sampling threshold |
+| Presence Penalty | 0.0 | 0.0 – 2.0 | Penalize repeated topics |
+
+---
+
+## Server Configuration

 All configuration is via environment variables passed to `03_start_server.sh`:

@ -197,7 +248,9 @@ tmux attach -t llm
 | `03_start_server.sh`             | Starts vLLM server (foreground)                      |
 | `04_start_server_background.sh`  | Starts server in background with logging             |
 | `05_stop_server.sh`              | Stops the background server                          |
-| `test_server.py`                 | Tests the running server                             |
+| `app.py`                         | Streamlit chat & file editor web app                 |
+| `requirements.txt`               | Python dependencies for the Streamlit app            |
+| `test_server.py`                 | Tests the running server via CLI                     |
 | `STUDENT_GUIDE.md`               | Instructions for students                            |

 ---
@ -210,7 +263,7 @@ tmux attach -t llm

 ### Container build fails
 - Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
+- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly`

 ### "No NVIDIA GPU detected"
 - Verify `nvidia-smi` works on the host
@ -218,7 +271,7 @@ tmux attach -t llm
 - Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`

 ### "Model type qwen3_5_moe not recognized"
- The container needs vLLM nightly and latest transformers
+- The container needs `vllm/vllm-openai:nightly` (not `:latest`)
 - Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`

 ### Students can't connect
@ -229,4 +282,11 @@ tmux attach -t llm
 ### Slow generation with many users
 - Expected — vLLM batches requests but throughput is finite
 - The MoE architecture (3B active) helps with per-token speed
+- Disable thinking mode for faster simple responses
 - Monitor: `curl http://localhost:7080/metrics`
+
+### Syncing files to the server
+- No `git` or `pip` on the host — use `scp` from your local machine:
+```bash
+scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/
+```