Add Streamlit chat app, update container to vLLM nightly
- Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor
This commit is contained in:
parent
076001b07f
commit
9e1e0c0751
6
.gitignore
vendored
6
.gitignore
vendored
@ -10,5 +10,11 @@ models/
|
|||||||
# HuggingFace cache
|
# HuggingFace cache
|
||||||
.cache/
|
.cache/
|
||||||
|
|
||||||
|
# Python venv
|
||||||
|
.venv/
|
||||||
|
|
||||||
|
# Streamlit workspace files
|
||||||
|
workspace/
|
||||||
|
|
||||||
# macOS
|
# macOS
|
||||||
.DS_Store
|
.DS_Store
|
||||||
|
|||||||
249
README.md
249
README.md
@ -1,7 +1,8 @@
|
|||||||
# LLM Local — Qwen3.5-27B Inference Server
|
# LLM Inferenz Server — Qwen3.5-35B-A3B
|
||||||
|
|
||||||
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
|
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
|
||||||
served via **vLLM** inside an **Apptainer** container on a GPU server.
|
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
|
||||||
|
**Apptainer** container on a GPU server.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server.
|
|||||||
Students (OpenAI SDK / curl)
|
Students (OpenAI SDK / curl)
|
||||||
│
|
│
|
||||||
▼
|
▼
|
||||||
┌─────────────────────────┐
|
┌──────────────────────────────┐
|
||||||
│ silicon.fhgr.ch:7080 │
|
│ silicon.fhgr.ch:7080 │
|
||||||
│ OpenAI-compatible API │
|
│ OpenAI-compatible API │
|
||||||
├─────────────────────────┤
|
├──────────────────────────────┤
|
||||||
│ vLLM Server │
|
│ vLLM Server (nightly) │
|
||||||
│ (Apptainer container) │
|
│ Apptainer container (.sif) │
|
||||||
├─────────────────────────┤
|
├──────────────────────────────┤
|
||||||
│ Qwen3.5-27B weights │
|
│ Qwen3.5-35B-A3B weights │
|
||||||
│ (bind-mounted) │
|
│ (bind-mounted from host) │
|
||||||
├─────────────────────────┤
|
├──────────────────────────────┤
|
||||||
│ NVIDIA GPU │
|
│ 2× NVIDIA L40S (46 GB ea.) │
|
||||||
└─────────────────────────┘
|
│ Tensor Parallel = 2 │
|
||||||
|
└──────────────────────────────┘
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Hardware
|
||||||
|
|
||||||
|
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
|
||||||
|
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
|
||||||
|
|
||||||
|
| Component | Value |
|
||||||
|
|-----------|-------|
|
||||||
|
| GPUs used | 2× NVIDIA L40S |
|
||||||
|
| VRAM used | ~92 GB total |
|
||||||
|
| Model size (BF16) | ~67 GB |
|
||||||
|
| Active params/token | 3B (MoE) |
|
||||||
|
| Context length | 32,768 tokens |
|
||||||
|
| Port | 7080 |
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
|
- **Apptainer** (formerly Singularity) installed on the server
|
||||||
Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
|
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
|
||||||
- **Apptainer** (formerly Singularity) installed on the server.
|
- **~80 GB disk** for model weights + ~8 GB for the container image
|
||||||
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
|
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
|
||||||
- **~60 GB disk space** for model weights + ~15 GB for the container image.
|
|
||||||
- **Network**: Students must be on the university network or VPN.
|
|
||||||
|
|
||||||
## Hardware Sizing
|
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
|
||||||
|
> the Apptainer container.
|
||||||
| Component | Minimum | Recommended |
|
|
||||||
|-----------|----------------|-----------------|
|
|
||||||
| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) |
|
|
||||||
| RAM | 64 GB | 128 GB |
|
|
||||||
| Disk | 100 GB free | 200 GB free |
|
|
||||||
|
|
||||||
> **If your GPU has less than 80 GB VRAM**, you have two options:
|
|
||||||
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
|
|
||||||
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -54,11 +59,10 @@ Students (OpenAI SDK / curl)
|
|||||||
ssh herzogfloria@silicon.fhgr.ch
|
ssh herzogfloria@silicon.fhgr.ch
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 1: Clone This Repository
|
### Step 1: Clone the Repository
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Or copy the files to the server
|
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
|
||||||
git clone <your-repo-url> ~/LLM_local
|
|
||||||
cd ~/LLM_local
|
cd ~/LLM_local
|
||||||
chmod +x *.sh
|
chmod +x *.sh
|
||||||
```
|
```
|
||||||
@ -66,124 +70,95 @@ chmod +x *.sh
|
|||||||
### Step 2: Check GPU and Environment
|
### Step 2: Check GPU and Environment
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Verify GPU is visible
|
|
||||||
nvidia-smi
|
nvidia-smi
|
||||||
|
|
||||||
# Verify Apptainer is installed
|
|
||||||
apptainer --version
|
apptainer --version
|
||||||
|
|
||||||
# Check available disk space
|
|
||||||
df -h ~
|
df -h ~
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 3: Download the Model (~60 GB)
|
### Step 3: Build the Apptainer Container
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Install huggingface-cli if not available
|
bash 01_build_container.sh
|
||||||
pip install --user huggingface_hub[cli]
|
|
||||||
|
|
||||||
# Download Qwen3.5-27B
|
|
||||||
bash 01_download_model.sh
|
|
||||||
# Default target: ~/models/Qwen3.5-27B
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
|
Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
|
||||||
|
(required for Qwen3.5 support), installs latest `transformers` from source,
|
||||||
|
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
|
||||||
|
|
||||||
### Step 4: Build the Apptainer Container
|
### Step 4: Download the Model (~67 GB)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
bash 02_build_container.sh
|
bash 02_download_model.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
|
Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
|
||||||
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
|
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
|
||||||
|
depending on bandwidth.
|
||||||
> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
|
|
||||||
> first and convert manually:
|
|
||||||
> ```bash
|
|
||||||
> apptainer pull docker://vllm/vllm-openai:latest
|
|
||||||
> ```
|
|
||||||
|
|
||||||
### Step 5: Start the Server
|
### Step 5: Start the Server
|
||||||
|
|
||||||
**Interactive (foreground):**
|
**Interactive (foreground) — recommended with tmux:**
|
||||||
```bash
|
```bash
|
||||||
|
tmux new -s llm
|
||||||
bash 03_start_server.sh
|
bash 03_start_server.sh
|
||||||
|
# Ctrl+B, then D to detach
|
||||||
```
|
```
|
||||||
|
|
||||||
**Background (recommended for production):**
|
**Background with logging:**
|
||||||
```bash
|
```bash
|
||||||
bash 04_start_server_background.sh
|
bash 04_start_server_background.sh
|
||||||
```
|
|
||||||
|
|
||||||
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
|
|
||||||
```bash
|
|
||||||
tail -f logs/vllm_server_*.log
|
tail -f logs/vllm_server_*.log
|
||||||
```
|
```
|
||||||
|
|
||||||
Look for the line:
|
The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
|
||||||
```
|
```
|
||||||
INFO: Uvicorn running on http://0.0.0.0:8000
|
INFO: Uvicorn running on http://0.0.0.0:7080
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 6: Test the Server
|
### Step 6: Test the Server
|
||||||
|
|
||||||
|
From another terminal on the server:
|
||||||
```bash
|
```bash
|
||||||
# Quick health check
|
|
||||||
curl http://localhost:7080/v1/models
|
curl http://localhost:7080/v1/models
|
||||||
|
```
|
||||||
|
|
||||||
# Full test
|
Or run the full test (uses `openai` SDK inside the container):
|
||||||
pip install openai
|
```bash
|
||||||
python test_server.py
|
apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 7: Share with Students
|
### Step 7: Share with Students
|
||||||
|
|
||||||
Distribute the `STUDENT_GUIDE.md` file or share the connection details:
|
Distribute `STUDENT_GUIDE.md` with connection details:
|
||||||
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
|
- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
|
||||||
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
|
- **Model name**: `qwen3.5-35b-a3b`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
All configuration is via environment variables in `03_start_server.sh`:
|
All configuration is via environment variables passed to `03_start_server.sh`:
|
||||||
|
|
||||||
| Variable | Default | Description |
|
| Variable | Default | Description |
|
||||||
|-------------------|------------------------------|-------------------------------------|
|
|-------------------|----------------------------------|--------------------------------|
|
||||||
| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights |
|
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
|
||||||
| `PORT` | `7080` | HTTP port |
|
| `PORT` | `7080` | HTTP port |
|
||||||
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
|
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
|
||||||
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
|
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
|
||||||
| `API_KEY` | *(empty = no auth)* | API key for authentication |
|
| `API_KEY` | *(empty = no auth)* | API key for authentication |
|
||||||
| `TENSOR_PARALLEL` | `1` | Number of GPUs |
|
| `TENSOR_PARALLEL` | `2` | Number of GPUs |
|
||||||
|
|
||||||
### Context Length Tuning
|
### Examples
|
||||||
|
|
||||||
The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
|
|
||||||
concurrent users. If you have plenty of VRAM headroom:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# Increase context length
|
||||||
MAX_MODEL_LEN=65536 bash 03_start_server.sh
|
MAX_MODEL_LEN=65536 bash 03_start_server.sh
|
||||||
```
|
|
||||||
|
|
||||||
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
|
# Add API key authentication
|
||||||
significantly more GPU memory for KV cache.
|
API_KEY="your-secret-key" bash 03_start_server.sh
|
||||||
|
|
||||||
### Adding Authentication
|
# Use all 4 GPUs (more KV cache headroom)
|
||||||
|
TENSOR_PARALLEL=4 bash 03_start_server.sh
|
||||||
```bash
|
|
||||||
API_KEY="your-secret-key-here" bash 03_start_server.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
Students then use this key in their `api_key` parameter.
|
|
||||||
|
|
||||||
### Multi-GPU Setup
|
|
||||||
|
|
||||||
If you have multiple GPUs:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
TENSOR_PARALLEL=2 bash 03_start_server.sh
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh
|
|||||||
bash 04_start_server_background.sh
|
bash 04_start_server_background.sh
|
||||||
|
|
||||||
# Check if running
|
# Check if running
|
||||||
curl -s http://localhost:7080/v1/models | python -m json.tool
|
curl -s http://localhost:7080/v1/models | python3 -m json.tool
|
||||||
|
|
||||||
# View logs
|
# View logs
|
||||||
tail -f logs/vllm_server_*.log
|
tail -f logs/vllm_server_*.log
|
||||||
@ -205,61 +180,53 @@ bash 05_stop_server.sh
|
|||||||
|
|
||||||
# Monitor GPU usage
|
# Monitor GPU usage
|
||||||
watch -n 2 nvidia-smi
|
watch -n 2 nvidia-smi
|
||||||
```
|
|
||||||
|
|
||||||
### Running Persistently with tmux
|
# Reconnect to tmux session
|
||||||
|
tmux attach -t llm
|
||||||
For a robust setup that survives SSH disconnects:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ssh herzogfloria@silicon.fhgr.ch
|
|
||||||
tmux new -s llm_server
|
|
||||||
bash 03_start_server.sh
|
|
||||||
# Press Ctrl+B, then D to detach
|
|
||||||
|
|
||||||
# Reconnect later:
|
|
||||||
tmux attach -t llm_server
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Files Overview
|
## Files Overview
|
||||||
|
|
||||||
| File | Purpose |
|
| File | Purpose |
|
||||||
|------------------------------|------------------------------------------- |
|
|----------------------------------|------------------------------------------------------|
|
||||||
| `vllm_qwen.def` | Apptainer container definition |
|
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
|
||||||
| `01_download_model.sh` | Downloads model weights from Hugging Face |
|
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
|
||||||
| `02_build_container.sh` | Builds the Apptainer .sif image |
|
| `02_download_model.sh` | Downloads model weights (runs inside container) |
|
||||||
| `03_start_server.sh` | Starts vLLM server (foreground) |
|
| `03_start_server.sh` | Starts vLLM server (foreground) |
|
||||||
| `04_start_server_background.sh` | Starts server in background with logging|
|
| `04_start_server_background.sh` | Starts server in background with logging |
|
||||||
| `05_stop_server.sh` | Stops the background server |
|
| `05_stop_server.sh` | Stops the background server |
|
||||||
| `test_server.py` | Tests the running server |
|
| `test_server.py` | Tests the running server |
|
||||||
| `STUDENT_GUIDE.md` | Instructions for students |
|
| `STUDENT_GUIDE.md` | Instructions for students |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
### "CUDA out of memory"
|
### "CUDA out of memory"
|
||||||
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
|
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
|
||||||
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
|
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
|
||||||
- Use a quantized model variant
|
|
||||||
|
|
||||||
### Container build fails
|
### Container build fails
|
||||||
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
|
- Ensure internet access and sufficient disk space (~20 GB for build cache)
|
||||||
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
|
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
|
||||||
|
|
||||||
### "No NVIDIA GPU detected"
|
### "No NVIDIA GPU detected"
|
||||||
- Check that `nvidia-smi` works outside the container
|
- Verify `nvidia-smi` works on the host
|
||||||
- Ensure `--nv` flag is passed (already in scripts)
|
- Ensure `--nv` flag is present (already in scripts)
|
||||||
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
||||||
|
|
||||||
### Server starts but students can't connect
|
### "Model type qwen3_5_moe not recognized"
|
||||||
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
|
- The container needs vLLM nightly and latest transformers
|
||||||
|
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
|
||||||
|
|
||||||
|
### Students can't connect
|
||||||
|
- Check firewall: ports 7080-7090 must be open
|
||||||
- Verify the server binds to `0.0.0.0` (not just localhost)
|
- Verify the server binds to `0.0.0.0` (not just localhost)
|
||||||
- Students must use the server's hostname/IP, not `localhost`
|
- Students must be on the university network or VPN
|
||||||
|
|
||||||
### Slow generation with many users
|
### Slow generation with many users
|
||||||
- This is expected — vLLM batches requests but throughput is finite
|
- Expected — vLLM batches requests but throughput is finite
|
||||||
- Consider reducing `max_tokens` in student requests
|
- The MoE architecture (3B active) helps with per-token speed
|
||||||
- Monitor with: `curl http://localhost:7080/metrics`
|
- Monitor: `curl http://localhost:7080/metrics`
|
||||||
|
|||||||
@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Streamlit Chat & File Editor App
|
||||||
|
|
||||||
|
A simple web UI is included for chatting with the model and editing files.
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install streamlit openai
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run
|
||||||
|
|
||||||
|
```bash
|
||||||
|
streamlit run app.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This opens a browser with two tabs:
|
||||||
|
|
||||||
|
- **Chat** — Conversational interface with streaming responses. You can save
|
||||||
|
the model's last response directly to a file.
|
||||||
|
- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
|
||||||
|
Use the "Generate with LLM" button to have the model modify your file based
|
||||||
|
on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
|
||||||
|
|
||||||
|
Files are stored in a `workspace/` folder next to `app.py`.
|
||||||
|
|
||||||
|
> **Tip**: The app runs on your local machine and connects to the server — you
|
||||||
|
> don't need to install anything on the GPU server.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Thinking Mode
|
||||||
|
|
||||||
|
By default, the model "thinks" before answering (internal chain-of-thought).
|
||||||
|
This is great for complex reasoning but adds latency for simple questions.
|
||||||
|
|
||||||
|
To disable thinking and get faster direct responses, add this to your API call:
|
||||||
|
|
||||||
|
```python
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="qwen3.5-35b-a3b",
|
||||||
|
messages=[...],
|
||||||
|
max_tokens=1024,
|
||||||
|
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
| Issue | Solution |
|
| Issue | Solution |
|
||||||
|
|||||||
181
app.py
Normal file
181
app.py
Normal file
@ -0,0 +1,181 @@
|
|||||||
|
"""
|
||||||
|
Streamlit Chat & File Editor for Qwen3.5-35B-A3B
|
||||||
|
|
||||||
|
A minimal interface to:
|
||||||
|
1. Chat with the local LLM (OpenAI-compatible API)
|
||||||
|
2. Edit, save, and generate code / LaTeX files
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
pip install streamlit openai
|
||||||
|
streamlit run app.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
import streamlit as st
|
||||||
|
from openai import OpenAI
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1")
|
||||||
|
API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password")
|
||||||
|
MODEL = "qwen3.5-35b-a3b"
|
||||||
|
WORKSPACE = Path("workspace")
|
||||||
|
WORKSPACE.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
client = OpenAI(base_url=API_BASE, api_key=API_KEY)
|
||||||
|
|
||||||
|
LANG_MAP = {
|
||||||
|
".py": "python", ".tex": "latex", ".js": "javascript",
|
||||||
|
".html": "html", ".css": "css", ".sh": "bash",
|
||||||
|
".json": "json", ".yaml": "yaml", ".yml": "yaml",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def extract_code(text: str, lang: str = "") -> str:
|
||||||
|
"""Extract the first fenced code block from markdown text.
|
||||||
|
Falls back to the full text if no code block is found."""
|
||||||
|
pattern = r"```(?:\w*)\n(.*?)```"
|
||||||
|
match = re.search(pattern, text, re.DOTALL)
|
||||||
|
if match:
|
||||||
|
return match.group(1).strip()
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Sidebar — File Manager
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
st.sidebar.markdown("---")
|
||||||
|
st.sidebar.header("File Manager")
|
||||||
|
|
||||||
|
new_filename = st.sidebar.text_input("New file name", placeholder="main.tex")
|
||||||
|
if st.sidebar.button("Create File") and new_filename:
|
||||||
|
(WORKSPACE / new_filename).touch()
|
||||||
|
st.sidebar.success(f"Created {new_filename}")
|
||||||
|
st.rerun()
|
||||||
|
|
||||||
|
files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else []
|
||||||
|
file_names = [f.name for f in files if f.is_file()]
|
||||||
|
selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"])
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main Layout — Two Tabs
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
tab_chat, tab_editor = st.tabs(["Chat", "File Editor"])
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Tab 1: Chat
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
with tab_chat:
|
||||||
|
st.header("Chat with Qwen3.5")
|
||||||
|
|
||||||
|
if "messages" not in st.session_state:
|
||||||
|
st.session_state.messages = []
|
||||||
|
|
||||||
|
for msg in st.session_state.messages:
|
||||||
|
with st.chat_message(msg["role"]):
|
||||||
|
st.markdown(msg["content"])
|
||||||
|
|
||||||
|
if prompt := st.chat_input("Ask anything..."):
|
||||||
|
st.session_state.messages.append({"role": "user", "content": prompt})
|
||||||
|
with st.chat_message("user"):
|
||||||
|
st.markdown(prompt)
|
||||||
|
|
||||||
|
with st.chat_message("assistant"):
|
||||||
|
placeholder = st.empty()
|
||||||
|
full_response = ""
|
||||||
|
|
||||||
|
stream = client.chat.completions.create(
|
||||||
|
model=MODEL,
|
||||||
|
messages=st.session_state.messages,
|
||||||
|
max_tokens=8092,
|
||||||
|
temperature=0.2,
|
||||||
|
stream=True,
|
||||||
|
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
|
||||||
|
)
|
||||||
|
for chunk in stream:
|
||||||
|
delta = chunk.choices[0].delta.content or ""
|
||||||
|
full_response += delta
|
||||||
|
placeholder.markdown(full_response + "▌")
|
||||||
|
placeholder.markdown(full_response)
|
||||||
|
|
||||||
|
st.session_state.messages.append({"role": "assistant", "content": full_response})
|
||||||
|
|
||||||
|
if st.session_state.messages:
|
||||||
|
col_clear, col_save = st.columns([1, 3])
|
||||||
|
with col_clear:
|
||||||
|
if st.button("Clear Chat"):
|
||||||
|
st.session_state.messages = []
|
||||||
|
st.rerun()
|
||||||
|
with col_save:
|
||||||
|
if selected_file and selected_file != "(no files)":
|
||||||
|
if st.button(f"Save code → {selected_file}"):
|
||||||
|
last = st.session_state.messages[-1]["content"]
|
||||||
|
suffix = Path(selected_file).suffix
|
||||||
|
lang = LANG_MAP.get(suffix, "")
|
||||||
|
code = extract_code(last, lang)
|
||||||
|
(WORKSPACE / selected_file).write_text(code)
|
||||||
|
st.success(f"Extracted code saved to workspace/{selected_file}")
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Tab 2: File Editor
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
with tab_editor:
|
||||||
|
st.header("File Editor")
|
||||||
|
|
||||||
|
if selected_file and selected_file != "(no files)":
|
||||||
|
file_path = WORKSPACE / selected_file
|
||||||
|
content = file_path.read_text() if file_path.exists() else ""
|
||||||
|
suffix = file_path.suffix
|
||||||
|
lang = LANG_MAP.get(suffix, "text")
|
||||||
|
|
||||||
|
st.code(content, language=lang if lang != "text" else None, line_numbers=True)
|
||||||
|
|
||||||
|
edited = st.text_area(
|
||||||
|
"Edit below:",
|
||||||
|
value=content,
|
||||||
|
height=400,
|
||||||
|
key=f"editor_{selected_file}_{hash(content)}",
|
||||||
|
)
|
||||||
|
|
||||||
|
col_save, col_gen = st.columns(2)
|
||||||
|
|
||||||
|
with col_save:
|
||||||
|
if st.button("Save File"):
|
||||||
|
file_path.write_text(edited)
|
||||||
|
st.success(f"Saved {selected_file}")
|
||||||
|
st.rerun()
|
||||||
|
|
||||||
|
with col_gen:
|
||||||
|
gen_prompt = st.text_input(
|
||||||
|
"Generation instruction",
|
||||||
|
placeholder="e.g. Add error handling / Fix the LaTeX formatting",
|
||||||
|
key="gen_prompt",
|
||||||
|
)
|
||||||
|
if st.button("Generate with LLM") and gen_prompt:
|
||||||
|
with st.spinner("Generating..."):
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model=MODEL,
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": (
|
||||||
|
f"You are a coding assistant. The user has a {lang} file. "
|
||||||
|
"Return ONLY the raw file content inside a single code block. "
|
||||||
|
"No explanations, no comments about changes."
|
||||||
|
)},
|
||||||
|
{"role": "user", "content": (
|
||||||
|
f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n"
|
||||||
|
f"Instruction: {gen_prompt}"
|
||||||
|
)},
|
||||||
|
],
|
||||||
|
max_tokens=16384,
|
||||||
|
temperature=0.6,
|
||||||
|
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||||||
|
)
|
||||||
|
result = response.choices[0].message.content
|
||||||
|
code = extract_code(result, lang)
|
||||||
|
file_path.write_text(code)
|
||||||
|
st.success("File updated by LLM")
|
||||||
|
st.rerun()
|
||||||
|
else:
|
||||||
|
st.info("Create a file in the sidebar to start editing.")
|
||||||
2
requirements.txt
Normal file
2
requirements.txt
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
streamlit
|
||||||
|
openai
|
||||||
@ -38,9 +38,9 @@ def main():
|
|||||||
response = client.chat.completions.create(
|
response = client.chat.completions.create(
|
||||||
model=model,
|
model=model,
|
||||||
messages=[
|
messages=[
|
||||||
{"role": "user", "content": "What is 2 + 2? Answer in one sentence."}
|
{"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."}
|
||||||
],
|
],
|
||||||
max_tokens=256,
|
max_tokens=16384,
|
||||||
temperature=0.7,
|
temperature=0.7,
|
||||||
)
|
)
|
||||||
print(f" Response: {response.choices[0].message.content}")
|
print(f" Response: {response.choices[0].message.content}")
|
||||||
@ -53,7 +53,7 @@ def main():
|
|||||||
messages=[
|
messages=[
|
||||||
{"role": "user", "content": "Count from 1 to 5."}
|
{"role": "user", "content": "Count from 1 to 5."}
|
||||||
],
|
],
|
||||||
max_tokens=128,
|
max_tokens=16384,
|
||||||
temperature=0.7,
|
temperature=0.7,
|
||||||
stream=True,
|
stream=True,
|
||||||
)
|
)
|
||||||
|
|||||||
@ -1,10 +1,10 @@
|
|||||||
Bootstrap: docker
|
Bootstrap: docker
|
||||||
From: vllm/vllm-openai:latest
|
From: vllm/vllm-openai:nightly
|
||||||
|
|
||||||
%labels
|
%labels
|
||||||
Author herzogfloria
|
Author herzogfloria
|
||||||
Description vLLM nightly inference server for Qwen3.5-35B-A3B
|
Description vLLM nightly inference server for Qwen3.5-35B-A3B
|
||||||
Version 2.0
|
Version 3.0
|
||||||
|
|
||||||
%environment
|
%environment
|
||||||
export HF_HOME=/tmp/hf_cache
|
export HF_HOME=/tmp/hf_cache
|
||||||
@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest
|
|||||||
|
|
||||||
%post
|
%post
|
||||||
apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
|
apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
|
||||||
pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
|
|
||||||
pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
|
pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
|
||||||
pip install --no-cache-dir huggingface_hub[cli]
|
pip install --no-cache-dir huggingface_hub[cli]
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user