diff --git a/README.md b/README.md index 75d452d..7c6dd8d 100644 --- a/README.md +++ b/README.md @@ -2,12 +2,13 @@ Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B** (MoE, 35B total / 3B active per token), served via **vLLM** inside an -**Apptainer** container on a GPU server. +**Apptainer** container on a GPU server. Includes a **Streamlit web app** for +chat and file editing. ## Architecture ``` -Students (OpenAI SDK / curl) +Students (Streamlit App / OpenAI SDK / curl) │ ▼ ┌──────────────────────────────┐ @@ -67,6 +68,10 @@ cd ~/LLM_local chmod +x *.sh ``` +> **Note**: `git` is not installed on the host. Use the container: +> `apptainer exec vllm_qwen.sif git clone ...` +> Or copy files via `scp` from your local machine. + ### Step 2: Check GPU and Environment ```bash @@ -81,9 +86,9 @@ df -h ~ bash 01_build_container.sh ``` -Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly -(required for Qwen3.5 support), installs latest `transformers` from source, -and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes. +Pulls the `vllm/vllm-openai:nightly` Docker image (required for Qwen3.5 +support), installs latest `transformers` from source, and packages everything +into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes. ### Step 4: Download the Model (~67 GB) @@ -122,9 +127,11 @@ From another terminal on the server: curl http://localhost:7080/v1/models ``` -Or run the full test (uses `openai` SDK inside the container): +Quick chat test: ```bash -apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py +curl http://localhost:7080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"qwen3.5-35b-a3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":128}' ``` ### Step 7: Share with Students @@ -135,7 +142,51 @@ Distribute `STUDENT_GUIDE.md` with connection details: --- -## Configuration +## Streamlit App + +A web-based chat and file editor that connects to the inference server. +Students run it on their own machines. + +### Setup + +```bash +pip install -r requirements.txt +``` + +Or with a virtual environment: + +```bash +python3 -m venv .venv +source .venv/bin/activate +pip install -r requirements.txt +``` + +### Run + +```bash +streamlit run app.py +``` + +Opens at `http://localhost:8501` with two tabs: + +- **Chat** — Conversational interface with streaming responses. Save the + model's last response directly into a workspace file (code auto-extracted). +- **File Editor** — Create/edit `.py`, `.tex`, `.html`, or any text file. + Use "Generate with LLM" to modify files via natural language instructions. + +### Sidebar Controls + +| Parameter | Default | Range | Purpose | +|-----------|---------|-------|---------| +| Thinking Mode | Off | Toggle | Chain-of-thought reasoning (slower, better for complex tasks) | +| Temperature | 0.7 | 0.0 – 2.0 | Creativity vs determinism | +| Max Tokens | 4096 | 256 – 16384 | Maximum response length | +| Top P | 0.95 | 0.0 – 1.0 | Nucleus sampling threshold | +| Presence Penalty | 0.0 | 0.0 – 2.0 | Penalize repeated topics | + +--- + +## Server Configuration All configuration is via environment variables passed to `03_start_server.sh`: @@ -197,7 +248,9 @@ tmux attach -t llm | `03_start_server.sh` | Starts vLLM server (foreground) | | `04_start_server_background.sh` | Starts server in background with logging | | `05_stop_server.sh` | Stops the background server | -| `test_server.py` | Tests the running server | +| `app.py` | Streamlit chat & file editor web app | +| `requirements.txt` | Python dependencies for the Streamlit app | +| `test_server.py` | Tests the running server via CLI | | `STUDENT_GUIDE.md` | Instructions for students | --- @@ -210,7 +263,7 @@ tmux attach -t llm ### Container build fails - Ensure internet access and sufficient disk space (~20 GB for build cache) -- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest` +- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:nightly` ### "No NVIDIA GPU detected" - Verify `nvidia-smi` works on the host @@ -218,7 +271,7 @@ tmux attach -t llm - Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi` ### "Model type qwen3_5_moe not recognized" -- The container needs vLLM nightly and latest transformers +- The container needs `vllm/vllm-openai:nightly` (not `:latest`) - Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh` ### Students can't connect @@ -229,4 +282,11 @@ tmux attach -t llm ### Slow generation with many users - Expected — vLLM batches requests but throughput is finite - The MoE architecture (3B active) helps with per-token speed +- Disable thinking mode for faster simple responses - Monitor: `curl http://localhost:7080/metrics` + +### Syncing files to the server +- No `git` or `pip` on the host — use `scp` from your local machine: +```bash +scp app.py 03_start_server.sh herzogfloria@silicon.fhgr.ch:~/LLM_local/ +```