Add Streamlit chat app, update container to vLLM nightly
- Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor
This commit is contained in:
parent
076001b07f
commit
9e1e0c0751
6
.gitignore
vendored
6
.gitignore
vendored
@ -10,5 +10,11 @@ models/
|
||||
# HuggingFace cache
|
||||
.cache/
|
||||
|
||||
# Python venv
|
||||
.venv/
|
||||
|
||||
# Streamlit workspace files
|
||||
workspace/
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
249
README.md
249
README.md
@ -1,7 +1,8 @@
|
||||
# LLM Local — Qwen3.5-27B Inference Server
|
||||
# LLM Inferenz Server — Qwen3.5-35B-A3B
|
||||
|
||||
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
|
||||
served via **vLLM** inside an **Apptainer** container on a GPU server.
|
||||
Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B**
|
||||
(MoE, 35B total / 3B active per token), served via **vLLM** inside an
|
||||
**Apptainer** container on a GPU server.
|
||||
|
||||
## Architecture
|
||||
|
||||
@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server.
|
||||
Students (OpenAI SDK / curl)
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ silicon.fhgr.ch:7080 │
|
||||
│ OpenAI-compatible API │
|
||||
├─────────────────────────┤
|
||||
│ vLLM Server │
|
||||
│ (Apptainer container) │
|
||||
├─────────────────────────┤
|
||||
│ Qwen3.5-27B weights │
|
||||
│ (bind-mounted) │
|
||||
├─────────────────────────┤
|
||||
│ NVIDIA GPU │
|
||||
└─────────────────────────┘
|
||||
┌──────────────────────────────┐
|
||||
│ silicon.fhgr.ch:7080 │
|
||||
│ OpenAI-compatible API │
|
||||
├──────────────────────────────┤
|
||||
│ vLLM Server (nightly) │
|
||||
│ Apptainer container (.sif) │
|
||||
├──────────────────────────────┤
|
||||
│ Qwen3.5-35B-A3B weights │
|
||||
│ (bind-mounted from host) │
|
||||
├──────────────────────────────┤
|
||||
│ 2× NVIDIA L40S (46 GB ea.) │
|
||||
│ Tensor Parallel = 2 │
|
||||
└──────────────────────────────┘
|
||||
```
|
||||
|
||||
## Hardware
|
||||
|
||||
The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each).
|
||||
The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free.
|
||||
|
||||
| Component | Value |
|
||||
|-----------|-------|
|
||||
| GPUs used | 2× NVIDIA L40S |
|
||||
| VRAM used | ~92 GB total |
|
||||
| Model size (BF16) | ~67 GB |
|
||||
| Active params/token | 3B (MoE) |
|
||||
| Context length | 32,768 tokens |
|
||||
| Port | 7080 |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
|
||||
Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
|
||||
- **Apptainer** (formerly Singularity) installed on the server.
|
||||
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
|
||||
- **~60 GB disk space** for model weights + ~15 GB for the container image.
|
||||
- **Network**: Students must be on the university network or VPN.
|
||||
- **Apptainer** (formerly Singularity) installed on the server
|
||||
- **NVIDIA drivers** with GPU passthrough support (`--nv` flag)
|
||||
- **~80 GB disk** for model weights + ~8 GB for the container image
|
||||
- **Network access** to Hugging Face (for model download) and Docker Hub (for container build)
|
||||
|
||||
## Hardware Sizing
|
||||
|
||||
| Component | Minimum | Recommended |
|
||||
|-----------|----------------|-----------------|
|
||||
| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) |
|
||||
| RAM | 64 GB | 128 GB |
|
||||
| Disk | 100 GB free | 200 GB free |
|
||||
|
||||
> **If your GPU has less than 80 GB VRAM**, you have two options:
|
||||
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
|
||||
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)
|
||||
> **Note**: No `pip` or `python` is needed on the host — everything runs inside
|
||||
> the Apptainer container.
|
||||
|
||||
---
|
||||
|
||||
@ -54,11 +59,10 @@ Students (OpenAI SDK / curl)
|
||||
ssh herzogfloria@silicon.fhgr.ch
|
||||
```
|
||||
|
||||
### Step 1: Clone This Repository
|
||||
### Step 1: Clone the Repository
|
||||
|
||||
```bash
|
||||
# Or copy the files to the server
|
||||
git clone <your-repo-url> ~/LLM_local
|
||||
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local
|
||||
cd ~/LLM_local
|
||||
chmod +x *.sh
|
||||
```
|
||||
@ -66,124 +70,95 @@ chmod +x *.sh
|
||||
### Step 2: Check GPU and Environment
|
||||
|
||||
```bash
|
||||
# Verify GPU is visible
|
||||
nvidia-smi
|
||||
|
||||
# Verify Apptainer is installed
|
||||
apptainer --version
|
||||
|
||||
# Check available disk space
|
||||
df -h ~
|
||||
```
|
||||
|
||||
### Step 3: Download the Model (~60 GB)
|
||||
### Step 3: Build the Apptainer Container
|
||||
|
||||
```bash
|
||||
# Install huggingface-cli if not available
|
||||
pip install --user huggingface_hub[cli]
|
||||
|
||||
# Download Qwen3.5-27B
|
||||
bash 01_download_model.sh
|
||||
# Default target: ~/models/Qwen3.5-27B
|
||||
bash 01_build_container.sh
|
||||
```
|
||||
|
||||
This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.
|
||||
Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly
|
||||
(required for Qwen3.5 support), installs latest `transformers` from source,
|
||||
and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes.
|
||||
|
||||
### Step 4: Build the Apptainer Container
|
||||
### Step 4: Download the Model (~67 GB)
|
||||
|
||||
```bash
|
||||
bash 02_build_container.sh
|
||||
bash 02_download_model.sh
|
||||
```
|
||||
|
||||
This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
|
||||
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.
|
||||
|
||||
> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
|
||||
> first and convert manually:
|
||||
> ```bash
|
||||
> apptainer pull docker://vllm/vllm-openai:latest
|
||||
> ```
|
||||
Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the
|
||||
container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes
|
||||
depending on bandwidth.
|
||||
|
||||
### Step 5: Start the Server
|
||||
|
||||
**Interactive (foreground):**
|
||||
**Interactive (foreground) — recommended with tmux:**
|
||||
```bash
|
||||
tmux new -s llm
|
||||
bash 03_start_server.sh
|
||||
# Ctrl+B, then D to detach
|
||||
```
|
||||
|
||||
**Background (recommended for production):**
|
||||
**Background with logging:**
|
||||
```bash
|
||||
bash 04_start_server_background.sh
|
||||
```
|
||||
|
||||
The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
|
||||
```bash
|
||||
tail -f logs/vllm_server_*.log
|
||||
```
|
||||
|
||||
Look for the line:
|
||||
The model takes 2-5 minutes to load into GPU memory. It's ready when you see:
|
||||
```
|
||||
INFO: Uvicorn running on http://0.0.0.0:8000
|
||||
INFO: Uvicorn running on http://0.0.0.0:7080
|
||||
```
|
||||
|
||||
### Step 6: Test the Server
|
||||
|
||||
From another terminal on the server:
|
||||
```bash
|
||||
# Quick health check
|
||||
curl http://localhost:7080/v1/models
|
||||
```
|
||||
|
||||
# Full test
|
||||
pip install openai
|
||||
python test_server.py
|
||||
Or run the full test (uses `openai` SDK inside the container):
|
||||
```bash
|
||||
apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py
|
||||
```
|
||||
|
||||
### Step 7: Share with Students
|
||||
|
||||
Distribute the `STUDENT_GUIDE.md` file or share the connection details:
|
||||
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
|
||||
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`
|
||||
Distribute `STUDENT_GUIDE.md` with connection details:
|
||||
- **Base URL**: `http://silicon.fhgr.ch:7080/v1`
|
||||
- **Model name**: `qwen3.5-35b-a3b`
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
All configuration is via environment variables in `03_start_server.sh`:
|
||||
All configuration is via environment variables passed to `03_start_server.sh`:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|-------------------|------------------------------|-------------------------------------|
|
||||
| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights |
|
||||
| `PORT` | `7080` | HTTP port |
|
||||
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
|
||||
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
|
||||
| `API_KEY` | *(empty = no auth)* | API key for authentication |
|
||||
| `TENSOR_PARALLEL` | `1` | Number of GPUs |
|
||||
| Variable | Default | Description |
|
||||
|-------------------|----------------------------------|--------------------------------|
|
||||
| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights |
|
||||
| `PORT` | `7080` | HTTP port |
|
||||
| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) |
|
||||
| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use |
|
||||
| `API_KEY` | *(empty = no auth)* | API key for authentication |
|
||||
| `TENSOR_PARALLEL` | `2` | Number of GPUs |
|
||||
|
||||
### Context Length Tuning
|
||||
|
||||
The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
|
||||
concurrent users. If you have plenty of VRAM headroom:
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Increase context length
|
||||
MAX_MODEL_LEN=65536 bash 03_start_server.sh
|
||||
```
|
||||
|
||||
Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
|
||||
significantly more GPU memory for KV cache.
|
||||
# Add API key authentication
|
||||
API_KEY="your-secret-key" bash 03_start_server.sh
|
||||
|
||||
### Adding Authentication
|
||||
|
||||
```bash
|
||||
API_KEY="your-secret-key-here" bash 03_start_server.sh
|
||||
```
|
||||
|
||||
Students then use this key in their `api_key` parameter.
|
||||
|
||||
### Multi-GPU Setup
|
||||
|
||||
If you have multiple GPUs:
|
||||
|
||||
```bash
|
||||
TENSOR_PARALLEL=2 bash 03_start_server.sh
|
||||
# Use all 4 GPUs (more KV cache headroom)
|
||||
TENSOR_PARALLEL=4 bash 03_start_server.sh
|
||||
```
|
||||
|
||||
---
|
||||
@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh
|
||||
bash 04_start_server_background.sh
|
||||
|
||||
# Check if running
|
||||
curl -s http://localhost:7080/v1/models | python -m json.tool
|
||||
curl -s http://localhost:7080/v1/models | python3 -m json.tool
|
||||
|
||||
# View logs
|
||||
tail -f logs/vllm_server_*.log
|
||||
@ -205,61 +180,53 @@ bash 05_stop_server.sh
|
||||
|
||||
# Monitor GPU usage
|
||||
watch -n 2 nvidia-smi
|
||||
```
|
||||
|
||||
### Running Persistently with tmux
|
||||
|
||||
For a robust setup that survives SSH disconnects:
|
||||
|
||||
```bash
|
||||
ssh herzogfloria@silicon.fhgr.ch
|
||||
tmux new -s llm_server
|
||||
bash 03_start_server.sh
|
||||
# Press Ctrl+B, then D to detach
|
||||
|
||||
# Reconnect later:
|
||||
tmux attach -t llm_server
|
||||
# Reconnect to tmux session
|
||||
tmux attach -t llm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Overview
|
||||
|
||||
| File | Purpose |
|
||||
|------------------------------|------------------------------------------- |
|
||||
| `vllm_qwen.def` | Apptainer container definition |
|
||||
| `01_download_model.sh` | Downloads model weights from Hugging Face |
|
||||
| `02_build_container.sh` | Builds the Apptainer .sif image |
|
||||
| `03_start_server.sh` | Starts vLLM server (foreground) |
|
||||
| `04_start_server_background.sh` | Starts server in background with logging|
|
||||
| `05_stop_server.sh` | Stops the background server |
|
||||
| `test_server.py` | Tests the running server |
|
||||
| `STUDENT_GUIDE.md` | Instructions for students |
|
||||
| File | Purpose |
|
||||
|----------------------------------|------------------------------------------------------|
|
||||
| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) |
|
||||
| `01_build_container.sh` | Builds the Apptainer `.sif` image |
|
||||
| `02_download_model.sh` | Downloads model weights (runs inside container) |
|
||||
| `03_start_server.sh` | Starts vLLM server (foreground) |
|
||||
| `04_start_server_background.sh` | Starts server in background with logging |
|
||||
| `05_stop_server.sh` | Stops the background server |
|
||||
| `test_server.py` | Tests the running server |
|
||||
| `STUDENT_GUIDE.md` | Instructions for students |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "CUDA out of memory"
|
||||
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
|
||||
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
|
||||
- Use a quantized model variant
|
||||
- Reduce `MAX_MODEL_LEN` (e.g., `16384`)
|
||||
- Reduce `GPU_MEM_UTIL` (e.g., `0.85`)
|
||||
|
||||
### Container build fails
|
||||
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
|
||||
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first
|
||||
- Ensure internet access and sufficient disk space (~20 GB for build cache)
|
||||
- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest`
|
||||
|
||||
### "No NVIDIA GPU detected"
|
||||
- Check that `nvidia-smi` works outside the container
|
||||
- Ensure `--nv` flag is passed (already in scripts)
|
||||
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
||||
- Verify `nvidia-smi` works on the host
|
||||
- Ensure `--nv` flag is present (already in scripts)
|
||||
- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi`
|
||||
|
||||
### Server starts but students can't connect
|
||||
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
|
||||
### "Model type qwen3_5_moe not recognized"
|
||||
- The container needs vLLM nightly and latest transformers
|
||||
- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh`
|
||||
|
||||
### Students can't connect
|
||||
- Check firewall: ports 7080-7090 must be open
|
||||
- Verify the server binds to `0.0.0.0` (not just localhost)
|
||||
- Students must use the server's hostname/IP, not `localhost`
|
||||
- Students must be on the university network or VPN
|
||||
|
||||
### Slow generation with many users
|
||||
- This is expected — vLLM batches requests but throughput is finite
|
||||
- Consider reducing `max_tokens` in student requests
|
||||
- Monitor with: `curl http://localhost:7080/metrics`
|
||||
- Expected — vLLM batches requests but throughput is finite
|
||||
- The MoE architecture (3B active) helps with per-token speed
|
||||
- Monitor: `curl http://localhost:7080/metrics`
|
||||
|
||||
@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \
|
||||
|
||||
---
|
||||
|
||||
## Streamlit Chat & File Editor App
|
||||
|
||||
A simple web UI is included for chatting with the model and editing files.
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
pip install streamlit openai
|
||||
```
|
||||
|
||||
### Run
|
||||
|
||||
```bash
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
This opens a browser with two tabs:
|
||||
|
||||
- **Chat** — Conversational interface with streaming responses. You can save
|
||||
the model's last response directly to a file.
|
||||
- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
|
||||
Use the "Generate with LLM" button to have the model modify your file based
|
||||
on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
|
||||
|
||||
Files are stored in a `workspace/` folder next to `app.py`.
|
||||
|
||||
> **Tip**: The app runs on your local machine and connects to the server — you
|
||||
> don't need to install anything on the GPU server.
|
||||
|
||||
---
|
||||
|
||||
## Thinking Mode
|
||||
|
||||
By default, the model "thinks" before answering (internal chain-of-thought).
|
||||
This is great for complex reasoning but adds latency for simple questions.
|
||||
|
||||
To disable thinking and get faster direct responses, add this to your API call:
|
||||
|
||||
```python
|
||||
response = client.chat.completions.create(
|
||||
model="qwen3.5-35b-a3b",
|
||||
messages=[...],
|
||||
max_tokens=1024,
|
||||
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Solution |
|
||||
|
||||
181
app.py
Normal file
181
app.py
Normal file
@ -0,0 +1,181 @@
|
||||
"""
|
||||
Streamlit Chat & File Editor for Qwen3.5-35B-A3B
|
||||
|
||||
A minimal interface to:
|
||||
1. Chat with the local LLM (OpenAI-compatible API)
|
||||
2. Edit, save, and generate code / LaTeX files
|
||||
|
||||
Usage:
|
||||
pip install streamlit openai
|
||||
streamlit run app.py
|
||||
"""
|
||||
|
||||
import re
|
||||
import streamlit as st
|
||||
from openai import OpenAI
|
||||
from pathlib import Path
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Configuration
|
||||
# ---------------------------------------------------------------------------
|
||||
API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1")
|
||||
API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password")
|
||||
MODEL = "qwen3.5-35b-a3b"
|
||||
WORKSPACE = Path("workspace")
|
||||
WORKSPACE.mkdir(exist_ok=True)
|
||||
|
||||
client = OpenAI(base_url=API_BASE, api_key=API_KEY)
|
||||
|
||||
LANG_MAP = {
|
||||
".py": "python", ".tex": "latex", ".js": "javascript",
|
||||
".html": "html", ".css": "css", ".sh": "bash",
|
||||
".json": "json", ".yaml": "yaml", ".yml": "yaml",
|
||||
}
|
||||
|
||||
|
||||
def extract_code(text: str, lang: str = "") -> str:
|
||||
"""Extract the first fenced code block from markdown text.
|
||||
Falls back to the full text if no code block is found."""
|
||||
pattern = r"```(?:\w*)\n(.*?)```"
|
||||
match = re.search(pattern, text, re.DOTALL)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
return text.strip()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sidebar — File Manager
|
||||
# ---------------------------------------------------------------------------
|
||||
st.sidebar.markdown("---")
|
||||
st.sidebar.header("File Manager")
|
||||
|
||||
new_filename = st.sidebar.text_input("New file name", placeholder="main.tex")
|
||||
if st.sidebar.button("Create File") and new_filename:
|
||||
(WORKSPACE / new_filename).touch()
|
||||
st.sidebar.success(f"Created {new_filename}")
|
||||
st.rerun()
|
||||
|
||||
files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else []
|
||||
file_names = [f.name for f in files if f.is_file()]
|
||||
selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"])
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main Layout — Two Tabs
|
||||
# ---------------------------------------------------------------------------
|
||||
tab_chat, tab_editor = st.tabs(["Chat", "File Editor"])
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tab 1: Chat
|
||||
# ---------------------------------------------------------------------------
|
||||
with tab_chat:
|
||||
st.header("Chat with Qwen3.5")
|
||||
|
||||
if "messages" not in st.session_state:
|
||||
st.session_state.messages = []
|
||||
|
||||
for msg in st.session_state.messages:
|
||||
with st.chat_message(msg["role"]):
|
||||
st.markdown(msg["content"])
|
||||
|
||||
if prompt := st.chat_input("Ask anything..."):
|
||||
st.session_state.messages.append({"role": "user", "content": prompt})
|
||||
with st.chat_message("user"):
|
||||
st.markdown(prompt)
|
||||
|
||||
with st.chat_message("assistant"):
|
||||
placeholder = st.empty()
|
||||
full_response = ""
|
||||
|
||||
stream = client.chat.completions.create(
|
||||
model=MODEL,
|
||||
messages=st.session_state.messages,
|
||||
max_tokens=8092,
|
||||
temperature=0.2,
|
||||
stream=True,
|
||||
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
|
||||
)
|
||||
for chunk in stream:
|
||||
delta = chunk.choices[0].delta.content or ""
|
||||
full_response += delta
|
||||
placeholder.markdown(full_response + "▌")
|
||||
placeholder.markdown(full_response)
|
||||
|
||||
st.session_state.messages.append({"role": "assistant", "content": full_response})
|
||||
|
||||
if st.session_state.messages:
|
||||
col_clear, col_save = st.columns([1, 3])
|
||||
with col_clear:
|
||||
if st.button("Clear Chat"):
|
||||
st.session_state.messages = []
|
||||
st.rerun()
|
||||
with col_save:
|
||||
if selected_file and selected_file != "(no files)":
|
||||
if st.button(f"Save code → {selected_file}"):
|
||||
last = st.session_state.messages[-1]["content"]
|
||||
suffix = Path(selected_file).suffix
|
||||
lang = LANG_MAP.get(suffix, "")
|
||||
code = extract_code(last, lang)
|
||||
(WORKSPACE / selected_file).write_text(code)
|
||||
st.success(f"Extracted code saved to workspace/{selected_file}")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tab 2: File Editor
|
||||
# ---------------------------------------------------------------------------
|
||||
with tab_editor:
|
||||
st.header("File Editor")
|
||||
|
||||
if selected_file and selected_file != "(no files)":
|
||||
file_path = WORKSPACE / selected_file
|
||||
content = file_path.read_text() if file_path.exists() else ""
|
||||
suffix = file_path.suffix
|
||||
lang = LANG_MAP.get(suffix, "text")
|
||||
|
||||
st.code(content, language=lang if lang != "text" else None, line_numbers=True)
|
||||
|
||||
edited = st.text_area(
|
||||
"Edit below:",
|
||||
value=content,
|
||||
height=400,
|
||||
key=f"editor_{selected_file}_{hash(content)}",
|
||||
)
|
||||
|
||||
col_save, col_gen = st.columns(2)
|
||||
|
||||
with col_save:
|
||||
if st.button("Save File"):
|
||||
file_path.write_text(edited)
|
||||
st.success(f"Saved {selected_file}")
|
||||
st.rerun()
|
||||
|
||||
with col_gen:
|
||||
gen_prompt = st.text_input(
|
||||
"Generation instruction",
|
||||
placeholder="e.g. Add error handling / Fix the LaTeX formatting",
|
||||
key="gen_prompt",
|
||||
)
|
||||
if st.button("Generate with LLM") and gen_prompt:
|
||||
with st.spinner("Generating..."):
|
||||
response = client.chat.completions.create(
|
||||
model=MODEL,
|
||||
messages=[
|
||||
{"role": "system", "content": (
|
||||
f"You are a coding assistant. The user has a {lang} file. "
|
||||
"Return ONLY the raw file content inside a single code block. "
|
||||
"No explanations, no comments about changes."
|
||||
)},
|
||||
{"role": "user", "content": (
|
||||
f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n"
|
||||
f"Instruction: {gen_prompt}"
|
||||
)},
|
||||
],
|
||||
max_tokens=16384,
|
||||
temperature=0.6,
|
||||
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||||
)
|
||||
result = response.choices[0].message.content
|
||||
code = extract_code(result, lang)
|
||||
file_path.write_text(code)
|
||||
st.success("File updated by LLM")
|
||||
st.rerun()
|
||||
else:
|
||||
st.info("Create a file in the sidebar to start editing.")
|
||||
2
requirements.txt
Normal file
2
requirements.txt
Normal file
@ -0,0 +1,2 @@
|
||||
streamlit
|
||||
openai
|
||||
@ -38,9 +38,9 @@ def main():
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=[
|
||||
{"role": "user", "content": "What is 2 + 2? Answer in one sentence."}
|
||||
{"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."}
|
||||
],
|
||||
max_tokens=256,
|
||||
max_tokens=16384,
|
||||
temperature=0.7,
|
||||
)
|
||||
print(f" Response: {response.choices[0].message.content}")
|
||||
@ -53,7 +53,7 @@ def main():
|
||||
messages=[
|
||||
{"role": "user", "content": "Count from 1 to 5."}
|
||||
],
|
||||
max_tokens=128,
|
||||
max_tokens=16384,
|
||||
temperature=0.7,
|
||||
stream=True,
|
||||
)
|
||||
|
||||
@ -1,10 +1,10 @@
|
||||
Bootstrap: docker
|
||||
From: vllm/vllm-openai:latest
|
||||
From: vllm/vllm-openai:nightly
|
||||
|
||||
%labels
|
||||
Author herzogfloria
|
||||
Description vLLM nightly inference server for Qwen3.5-35B-A3B
|
||||
Version 2.0
|
||||
Version 3.0
|
||||
|
||||
%environment
|
||||
export HF_HOME=/tmp/hf_cache
|
||||
@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest
|
||||
|
||||
%post
|
||||
apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
|
||||
pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
|
||||
pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
|
||||
pip install --no-cache-dir huggingface_hub[cli]
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user