Update README to reflect current project state

Add Streamlit app section with setup, usage, and sidebar controls.
Document nightly Docker image requirement, scp workflow for server
sync, and practical troubleshooting tips from setup experience.

Made-with: Cursor
This commit is contained in:
herzogflorian 2026-03-02 16:42:33 +01:00
parent 12f9e3ac9b
commit deee5038d1

View File

@ -2,12 +2,13 @@
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
**Apptainer** container on a GPU server.
**Apptainer** container on a GPU server. Includes a **Streamlit web app** for
chat and file editing.
## Architecture
```
Students (OpenAI SDK / curl)
Students (Streamlit App / OpenAI SDK / curl)
┌──────────────────────────────┐
@ -67,6 +68,10 @@ cd ~/LLM_local
chmod +x *.sh
```
> **Note**: `git` is not installed on the host. Use the container:
> `apptainer exec vllm_qwen.sif git clone ...`
> Or copy files via `scp` from your local machine.
### Step 2: Check GPU and Environment
```bash
@ -81,9 +86,9 @@ df -h ~
bash 01_build_container.sh
```
Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
(required for Qwen3.5 support), installs latest `transformers` from source,
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5
support), installs latest `transformers` from source, and packages everything
into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
### Step 4: Download the Model (~67 GB)
@ -122,9 +127,11 @@ From another terminal on the server:
curl http://localhost:7080/v1/models
```
Or run the full test (uses `openai` SDK inside the container):
Quick chat test:
```bash
apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
curl http://localhost:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}'
```
### Step 7: Share with Students
@ -135,7 +142,51 @@ Distribute `STUDENT_GUIDE.md` with connection details:
---
## Configuration
## Streamlit App
A web-based chat and file editor that connects to the inference server.
Students run it on their own machines.
### Setup
```bash
pip install -r requirements.txt
```
Or with a virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### Run
```bash
streamlit run app.py
```
Opens at `http://localhost:8501` with two tabs:
- **Chat** — Conversational interface with streaming responses. Save the
model's last response directly into a workspace file (code auto-extracted).
- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file.
Use "Generate with LLM" to modify files via natural language instructions.
### Sidebar Controls
| Parameter | Default | Range | Purpose |
|-----------|---------|-------|---------|
| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) |
| Temperature | 0.7 | 0.0 2.0 | Creativity vs determinism |
| Max Tokens | 4096 | 256 16384 | Maximum response length |
| Top P | 0.95 | 0.0 1.0 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | 0.0 2.0 | Penalize repeated topics |
---
## Server Configuration
All configuration is via environment variables passed to `03_start_server.sh`:
@ -197,7 +248,9 @@ tmux attach -t llm
| `03_start_server.sh` | Starts vLLM server (foreground) |
| `04_start_server_background.sh` | Starts server in background with logging |
| `05_stop_server.sh` | Stops the background server |
| `test_server.py` | Tests the running server |
| `app.py` | Streamlit chat & file editor web app |
| `requirements.txt` | Python dependencies for the Streamlit app |
| `test_server.py` | Tests the running server via CLI |
| `STUDENT_GUIDE.md` | Instructions for students |
---
@ -210,7 +263,7 @@ tmux attach -t llm
### Container build fails
- Ensure internet access and sufficient disk space (~20 GB for build cache)
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly`
### "No NVIDIA GPU detected"
- Verify `nvidia-smi` works on the host
@ -218,7 +271,7 @@ tmux attach -t llm
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
### "Model type qwen3_5_moe not recognized"
- The container needs vLLM nightly and latest transformers
- The container needs `vllm/vllm-openai:nightly` (not `:latest`)
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
### Students can't connect
@ -229,4 +282,11 @@ tmux attach -t llm
### Slow generation with many users
- Expected — vLLM batches requests but throughput is finite
- The MoE architecture (3B active) helps with per-token speed
- Disable thinking mode for faster simple responses
- Monitor: `curl http://localhost:7080/metrics`
### Syncing files to the server
- No `git` or `pip` on the host — use `scp` from your local machine:
```bash
scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/
```