diff --git a/.gitignore b/.gitignore index 0868882..1cab20f 100644 --- a/.gitignore +++ b/.gitignore @@ -10,5 +10,11 @@ models/ # HuggingFace cache .cache/ +# Python venv +.venv/ + +# Streamlit workspace files +workspace/ + # macOS .DS_Store diff --git a/README.md b/README.md index b835a07..75d452d 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ -# LLM Local — Qwen3.5-27B Inference Server +# LLM Inferenz Server — Qwen3.5-35B-A3B -Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**, -served via **vLLM** inside an **Apptainer** container on a GPU server. +Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-35B-A3B** +(MoE, 35B total / 3B active per token), served via **vLLM** inside an +**Apptainer** container on a GPU server. ## Architecture @@ -9,40 +10,44 @@ served via **vLLM** inside an **Apptainer** container on a GPU server. Students (OpenAI SDK / curl) │ ▼ - ┌─────────────────────────┐ - │ silicon.fhgr.ch:7080 │ - │ OpenAI-compatible API │ - ├─────────────────────────┤ - │ vLLM Server │ - │ (Apptainer container) │ - ├─────────────────────────┤ - │ Qwen3.5-27B weights │ - │ (bind-mounted) │ - ├─────────────────────────┤ - │ NVIDIA GPU │ - └─────────────────────────┘ + ┌──────────────────────────────┐ + │ silicon.fhgr.ch:7080 │ + │ OpenAI-compatible API │ + ├──────────────────────────────┤ + │ vLLM Server (nightly) │ + │ Apptainer container (.sif) │ + ├──────────────────────────────┤ + │ Qwen3.5-35B-A3B weights │ + │ (bind-mounted from host) │ + ├──────────────────────────────┤ + │ 2× NVIDIA L40S (46 GB ea.) │ + │ Tensor Parallel = 2 │ + └──────────────────────────────┘ ``` +## Hardware + +The server `silicon.fhgr.ch` has **4× NVIDIA L40S** GPUs (46 GB VRAM each). +The inference server uses **2 GPUs** with tensor parallelism, leaving 2 GPUs free. + +| Component | Value | +|-----------|-------| +| GPUs used | 2× NVIDIA L40S | +| VRAM used | ~92 GB total | +| Model size (BF16) | ~67 GB | +| Active params/token | 3B (MoE) | +| Context length | 32,768 tokens | +| Port | 7080 | + ## Prerequisites -- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended). - Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead. -- **Apptainer** (formerly Singularity) installed on the server. -- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough. -- **~60 GB disk space** for model weights + ~15 GB for the container image. -- **Network**: Students must be on the university network or VPN. +- **Apptainer** (formerly Singularity) installed on the server +- **NVIDIA drivers** with GPU passthrough support (`--nv` flag) +- **~80 GB disk** for model weights + ~8 GB for the container image +- **Network access** to Hugging Face (for model download) and Docker Hub (for container build) -## Hardware Sizing - -| Component | Minimum | Recommended | -|-----------|----------------|-----------------| -| GPU VRAM | 80 GB (1× A100)| 80 GB (1× H100) | -| RAM | 64 GB | 128 GB | -| Disk | 100 GB free | 200 GB free | - -> **If your GPU has less than 80 GB VRAM**, you have two options: -> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM) -> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`) +> **Note**: No `pip` or `python` is needed on the host — everything runs inside +> the Apptainer container. --- @@ -54,11 +59,10 @@ Students (OpenAI SDK / curl) ssh herzogfloria@silicon.fhgr.ch ``` -### Step 1: Clone This Repository +### Step 1: Clone the Repository ```bash -# Or copy the files to the server -git clone ~/LLM_local +git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git ~/LLM_local cd ~/LLM_local chmod +x *.sh ``` @@ -66,124 +70,95 @@ chmod +x *.sh ### Step 2: Check GPU and Environment ```bash -# Verify GPU is visible nvidia-smi - -# Verify Apptainer is installed apptainer --version - -# Check available disk space df -h ~ ``` -### Step 3: Download the Model (~60 GB) +### Step 3: Build the Apptainer Container ```bash -# Install huggingface-cli if not available -pip install --user huggingface_hub[cli] - -# Download Qwen3.5-27B -bash 01_download_model.sh -# Default target: ~/models/Qwen3.5-27B +bash 01_build_container.sh ``` -This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth. +Pulls the `vllm/vllm-openai:latest` Docker image, upgrades vLLM to nightly +(required for Qwen3.5 support), installs latest `transformers` from source, +and packages everything into `vllm_qwen.sif` (~8 GB). Takes 15-20 minutes. -### Step 4: Build the Apptainer Container +### Step 4: Download the Model (~67 GB) ```bash -bash 02_build_container.sh +bash 02_download_model.sh ``` -This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file. -Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB. - -> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image -> first and convert manually: -> ```bash -> apptainer pull docker://vllm/vllm-openai:latest -> ``` +Downloads Qwen3.5-35B-A3B weights using `huggingface-cli` **inside the +container**. Stored at `~/models/Qwen3.5-35B-A3B`. Takes 5-30 minutes +depending on bandwidth. ### Step 5: Start the Server -**Interactive (foreground):** +**Interactive (foreground) — recommended with tmux:** ```bash +tmux new -s llm bash 03_start_server.sh +# Ctrl+B, then D to detach ``` -**Background (recommended for production):** +**Background with logging:** ```bash bash 04_start_server_background.sh -``` - -The server takes 2-5 minutes to load the model into GPU memory. Monitor with: -```bash tail -f logs/vllm_server_*.log ``` -Look for the line: +The model takes 2-5 minutes to load into GPU memory. It's ready when you see: ``` -INFO: Uvicorn running on http://0.0.0.0:8000 +INFO: Uvicorn running on http://0.0.0.0:7080 ``` ### Step 6: Test the Server +From another terminal on the server: ```bash -# Quick health check curl http://localhost:7080/v1/models +``` -# Full test -pip install openai -python test_server.py +Or run the full test (uses `openai` SDK inside the container): +```bash +apptainer exec --writable-tmpfs vllm_qwen.sif python3 test_server.py ``` ### Step 7: Share with Students -Distribute the `STUDENT_GUIDE.md` file or share the connection details: -- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b` -- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b` +Distribute `STUDENT_GUIDE.md` with connection details: +- **Base URL**: `http://silicon.fhgr.ch:7080/v1` +- **Model name**: `qwen3.5-35b-a3b` --- ## Configuration -All configuration is via environment variables in `03_start_server.sh`: +All configuration is via environment variables passed to `03_start_server.sh`: -| Variable | Default | Description | -|-------------------|------------------------------|-------------------------------------| -| `MODEL_DIR` | `~/models/Qwen3.5-27B` | Path to model weights | -| `PORT` | `7080` | HTTP port | -| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) | -| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use | -| `API_KEY` | *(empty = no auth)* | API key for authentication | -| `TENSOR_PARALLEL` | `1` | Number of GPUs | +| Variable | Default | Description | +|-------------------|----------------------------------|--------------------------------| +| `MODEL_DIR` | `~/models/Qwen3.5-35B-A3B` | Path to model weights | +| `PORT` | `7080` | HTTP port | +| `MAX_MODEL_LEN` | `32768` | Max context length (tokens) | +| `GPU_MEM_UTIL` | `0.92` | Fraction of GPU memory to use | +| `API_KEY` | *(empty = no auth)* | API key for authentication | +| `TENSOR_PARALLEL` | `2` | Number of GPUs | -### Context Length Tuning - -The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15 -concurrent users. If you have plenty of VRAM headroom: +### Examples ```bash +# Increase context length MAX_MODEL_LEN=65536 bash 03_start_server.sh -``` -Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require -significantly more GPU memory for KV cache. +# Add API key authentication +API_KEY="your-secret-key" bash 03_start_server.sh -### Adding Authentication - -```bash -API_KEY="your-secret-key-here" bash 03_start_server.sh -``` - -Students then use this key in their `api_key` parameter. - -### Multi-GPU Setup - -If you have multiple GPUs: - -```bash -TENSOR_PARALLEL=2 bash 03_start_server.sh +# Use all 4 GPUs (more KV cache headroom) +TENSOR_PARALLEL=4 bash 03_start_server.sh ``` --- @@ -195,7 +170,7 @@ TENSOR_PARALLEL=2 bash 03_start_server.sh bash 04_start_server_background.sh # Check if running -curl -s http://localhost:7080/v1/models | python -m json.tool +curl -s http://localhost:7080/v1/models | python3 -m json.tool # View logs tail -f logs/vllm_server_*.log @@ -205,61 +180,53 @@ bash 05_stop_server.sh # Monitor GPU usage watch -n 2 nvidia-smi -``` -### Running Persistently with tmux - -For a robust setup that survives SSH disconnects: - -```bash -ssh herzogfloria@silicon.fhgr.ch -tmux new -s llm_server -bash 03_start_server.sh -# Press Ctrl+B, then D to detach - -# Reconnect later: -tmux attach -t llm_server +# Reconnect to tmux session +tmux attach -t llm ``` --- ## Files Overview -| File | Purpose | -|------------------------------|------------------------------------------- | -| `vllm_qwen.def` | Apptainer container definition | -| `01_download_model.sh` | Downloads model weights from Hugging Face | -| `02_build_container.sh` | Builds the Apptainer .sif image | -| `03_start_server.sh` | Starts vLLM server (foreground) | -| `04_start_server_background.sh` | Starts server in background with logging| -| `05_stop_server.sh` | Stops the background server | -| `test_server.py` | Tests the running server | -| `STUDENT_GUIDE.md` | Instructions for students | +| File | Purpose | +|----------------------------------|------------------------------------------------------| +| `vllm_qwen.def` | Apptainer container definition (vLLM nightly + deps) | +| `01_build_container.sh` | Builds the Apptainer `.sif` image | +| `02_download_model.sh` | Downloads model weights (runs inside container) | +| `03_start_server.sh` | Starts vLLM server (foreground) | +| `04_start_server_background.sh` | Starts server in background with logging | +| `05_stop_server.sh` | Stops the background server | +| `test_server.py` | Tests the running server | +| `STUDENT_GUIDE.md` | Instructions for students | --- ## Troubleshooting ### "CUDA out of memory" -- Reduce `MAX_MODEL_LEN` (e.g., 16384) -- Reduce `GPU_MEM_UTIL` (e.g., 0.85) -- Use a quantized model variant +- Reduce `MAX_MODEL_LEN` (e.g., `16384`) +- Reduce `GPU_MEM_UTIL` (e.g., `0.85`) ### Container build fails -- Ensure you have internet access and sufficient disk space (~20 GB for build cache) -- Try: `apptainer pull docker://vllm/vllm-openai:latest` first +- Ensure internet access and sufficient disk space (~20 GB for build cache) +- Try pulling manually first: `apptainer pull docker://vllm/vllm-openai:latest` ### "No NVIDIA GPU detected" -- Check that `nvidia-smi` works outside the container -- Ensure `--nv` flag is passed (already in scripts) -- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi` +- Verify `nvidia-smi` works on the host +- Ensure `--nv` flag is present (already in scripts) +- Test: `apptainer exec --nv vllm_qwen.sif nvidia-smi` -### Server starts but students can't connect -- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent +### "Model type qwen3_5_moe not recognized" +- The container needs vLLM nightly and latest transformers +- Rebuild the container: `rm vllm_qwen.sif && bash 01_build_container.sh` + +### Students can't connect +- Check firewall: ports 7080-7090 must be open - Verify the server binds to `0.0.0.0` (not just localhost) -- Students must use the server's hostname/IP, not `localhost` +- Students must be on the university network or VPN ### Slow generation with many users -- This is expected — vLLM batches requests but throughput is finite -- Consider reducing `max_tokens` in student requests -- Monitor with: `curl http://localhost:7080/metrics` +- Expected — vLLM batches requests but throughput is finite +- The MoE architecture (3B active) helps with per-token speed +- Monitor: `curl http://localhost:7080/metrics` diff --git a/STUDENT_GUIDE.md b/STUDENT_GUIDE.md index 2f333e4..5de5480 100644 --- a/STUDENT_GUIDE.md +++ b/STUDENT_GUIDE.md @@ -107,6 +107,55 @@ curl http://silicon.fhgr.ch:7080/v1/chat/completions \ --- +## Streamlit Chat & File Editor App + +A simple web UI is included for chatting with the model and editing files. + +### Setup + +```bash +pip install streamlit openai +``` + +### Run + +```bash +streamlit run app.py +``` + +This opens a browser with two tabs: + +- **Chat** — Conversational interface with streaming responses. You can save + the model's last response directly to a file. +- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file. + Use the "Generate with LLM" button to have the model modify your file based + on an instruction (e.g. "add error handling" or "fix the LaTeX formatting"). + +Files are stored in a `workspace/` folder next to `app.py`. + +> **Tip**: The app runs on your local machine and connects to the server — you +> don't need to install anything on the GPU server. + +--- + +## Thinking Mode + +By default, the model "thinks" before answering (internal chain-of-thought). +This is great for complex reasoning but adds latency for simple questions. + +To disable thinking and get faster direct responses, add this to your API call: + +```python +response = client.chat.completions.create( + model="qwen3.5-35b-a3b", + messages=[...], + max_tokens=1024, + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, +) +``` + +--- + ## Troubleshooting | Issue | Solution | diff --git a/app.py b/app.py new file mode 100644 index 0000000..9e0ea98 --- /dev/null +++ b/app.py @@ -0,0 +1,181 @@ +""" +Streamlit Chat & File Editor for Qwen3.5-35B-A3B + +A minimal interface to: + 1. Chat with the local LLM (OpenAI-compatible API) + 2. Edit, save, and generate code / LaTeX files + +Usage: + pip install streamlit openai + streamlit run app.py +""" + +import re +import streamlit as st +from openai import OpenAI +from pathlib import Path + +# --------------------------------------------------------------------------- +# Configuration +# --------------------------------------------------------------------------- +API_BASE = st.sidebar.text_input("API Base URL", "http://silicon.fhgr.ch:7080/v1") +API_KEY = st.sidebar.text_input("API Key", "EMPTY", type="password") +MODEL = "qwen3.5-35b-a3b" +WORKSPACE = Path("workspace") +WORKSPACE.mkdir(exist_ok=True) + +client = OpenAI(base_url=API_BASE, api_key=API_KEY) + +LANG_MAP = { + ".py": "python", ".tex": "latex", ".js": "javascript", + ".html": "html", ".css": "css", ".sh": "bash", + ".json": "json", ".yaml": "yaml", ".yml": "yaml", +} + + +def extract_code(text: str, lang: str = "") -> str: + """Extract the first fenced code block from markdown text. + Falls back to the full text if no code block is found.""" + pattern = r"```(?:\w*)\n(.*?)```" + match = re.search(pattern, text, re.DOTALL) + if match: + return match.group(1).strip() + return text.strip() + + +# --------------------------------------------------------------------------- +# Sidebar — File Manager +# --------------------------------------------------------------------------- +st.sidebar.markdown("---") +st.sidebar.header("File Manager") + +new_filename = st.sidebar.text_input("New file name", placeholder="main.tex") +if st.sidebar.button("Create File") and new_filename: + (WORKSPACE / new_filename).touch() + st.sidebar.success(f"Created {new_filename}") + st.rerun() + +files = sorted(WORKSPACE.iterdir()) if WORKSPACE.exists() else [] +file_names = [f.name for f in files if f.is_file()] +selected_file = st.sidebar.selectbox("Open file", file_names if file_names else ["(no files)"]) + +# --------------------------------------------------------------------------- +# Main Layout — Two Tabs +# --------------------------------------------------------------------------- +tab_chat, tab_editor = st.tabs(["Chat", "File Editor"]) + +# --------------------------------------------------------------------------- +# Tab 1: Chat +# --------------------------------------------------------------------------- +with tab_chat: + st.header("Chat with Qwen3.5") + + if "messages" not in st.session_state: + st.session_state.messages = [] + + for msg in st.session_state.messages: + with st.chat_message(msg["role"]): + st.markdown(msg["content"]) + + if prompt := st.chat_input("Ask anything..."): + st.session_state.messages.append({"role": "user", "content": prompt}) + with st.chat_message("user"): + st.markdown(prompt) + + with st.chat_message("assistant"): + placeholder = st.empty() + full_response = "" + + stream = client.chat.completions.create( + model=MODEL, + messages=st.session_state.messages, + max_tokens=8092, + temperature=0.2, + stream=True, + extra_body={"chat_template_kwargs": {"enable_thinking": True}}, + ) + for chunk in stream: + delta = chunk.choices[0].delta.content or "" + full_response += delta + placeholder.markdown(full_response + "▌") + placeholder.markdown(full_response) + + st.session_state.messages.append({"role": "assistant", "content": full_response}) + + if st.session_state.messages: + col_clear, col_save = st.columns([1, 3]) + with col_clear: + if st.button("Clear Chat"): + st.session_state.messages = [] + st.rerun() + with col_save: + if selected_file and selected_file != "(no files)": + if st.button(f"Save code → {selected_file}"): + last = st.session_state.messages[-1]["content"] + suffix = Path(selected_file).suffix + lang = LANG_MAP.get(suffix, "") + code = extract_code(last, lang) + (WORKSPACE / selected_file).write_text(code) + st.success(f"Extracted code saved to workspace/{selected_file}") + +# --------------------------------------------------------------------------- +# Tab 2: File Editor +# --------------------------------------------------------------------------- +with tab_editor: + st.header("File Editor") + + if selected_file and selected_file != "(no files)": + file_path = WORKSPACE / selected_file + content = file_path.read_text() if file_path.exists() else "" + suffix = file_path.suffix + lang = LANG_MAP.get(suffix, "text") + + st.code(content, language=lang if lang != "text" else None, line_numbers=True) + + edited = st.text_area( + "Edit below:", + value=content, + height=400, + key=f"editor_{selected_file}_{hash(content)}", + ) + + col_save, col_gen = st.columns(2) + + with col_save: + if st.button("Save File"): + file_path.write_text(edited) + st.success(f"Saved {selected_file}") + st.rerun() + + with col_gen: + gen_prompt = st.text_input( + "Generation instruction", + placeholder="e.g. Add error handling / Fix the LaTeX formatting", + key="gen_prompt", + ) + if st.button("Generate with LLM") and gen_prompt: + with st.spinner("Generating..."): + response = client.chat.completions.create( + model=MODEL, + messages=[ + {"role": "system", "content": ( + f"You are a coding assistant. The user has a {lang} file. " + "Return ONLY the raw file content inside a single code block. " + "No explanations, no comments about changes." + )}, + {"role": "user", "content": ( + f"Here is my {lang} file:\n\n```\n{edited}\n```\n\n" + f"Instruction: {gen_prompt}" + )}, + ], + max_tokens=16384, + temperature=0.6, + extra_body={"chat_template_kwargs": {"enable_thinking": False}}, + ) + result = response.choices[0].message.content + code = extract_code(result, lang) + file_path.write_text(code) + st.success("File updated by LLM") + st.rerun() + else: + st.info("Create a file in the sidebar to start editing.") diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..d218a70 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +streamlit +openai diff --git a/test_server.py b/test_server.py index 8429080..ff88635 100644 --- a/test_server.py +++ b/test_server.py @@ -38,9 +38,9 @@ def main(): response = client.chat.completions.create( model=model, messages=[ - {"role": "user", "content": "What is 2 + 2? Answer in one sentence."} + {"role": "user", "content": "Create a latex document that derives and explains the principle component analysis (pca). Make a self contain document with introduction, derivation, examples of applications. This is for computer science undergraduate class."} ], - max_tokens=256, + max_tokens=16384, temperature=0.7, ) print(f" Response: {response.choices[0].message.content}") @@ -53,7 +53,7 @@ def main(): messages=[ {"role": "user", "content": "Count from 1 to 5."} ], - max_tokens=128, + max_tokens=16384, temperature=0.7, stream=True, ) diff --git a/vllm_qwen.def b/vllm_qwen.def index 92f9777..4e97b76 100644 --- a/vllm_qwen.def +++ b/vllm_qwen.def @@ -1,10 +1,10 @@ Bootstrap: docker -From: vllm/vllm-openai:latest +From: vllm/vllm-openai:nightly %labels Author herzogfloria Description vLLM nightly inference server for Qwen3.5-35B-A3B - Version 2.0 + Version 3.0 %environment export HF_HOME=/tmp/hf_cache @@ -12,7 +12,6 @@ From: vllm/vllm-openai:latest %post apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/* - pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main" pip install --no-cache-dir huggingface_hub[cli]