LLM_Inferenz_Server_1/STUDENT_GUIDE.md
herzogflorian eff76401ee Add Qwen3.5-122B-A10B-FP8 model support
- Add download script (10), start script (11), and background launcher (12)
  for the 122B FP8 model using all 4 GPUs with TP=4
- Both models share port 7080; only one runs at a time
- Update README with dual-model hardware table, switching workflow, and
  updated file overview
- Update STUDENT_GUIDE with both model names and discovery instructions

Made-with: Cursor
2026-03-02 19:00:32 +01:00

273 lines
8.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Student Guide — Qwen3.5 Inference Server
## Overview
A **Qwen3.5** large language model is running on our GPU server. Two models
may be available at different times (your instructor will let you know which
one is active):
| Model | Params | Best for |
|-------|--------|----------|
| `qwen3.5-35b-a3b` | 35B (3B active) | Fast responses, everyday tasks |
| `qwen3.5-122b-a10b-fp8` | 122B (10B active) | Complex reasoning, coding, research |
There are **three ways** to interact with the model:
1. **Open WebUI** — ChatGPT-like interface in your browser (easiest)
2. **Streamlit App** — Local app with chat, file editor, and code execution
3. **Python SDK / curl** — Programmatic access via the OpenAI-compatible API
> **Note**: You must be on the fhgr network or VPN to reach the server.
## Connection Details
| Parameter | Value |
|------------------|---------------------------------------------|
| **Open WebUI** | `http://silicon.fhgr.ch:7081` |
| **API Base URL** | `http://silicon.fhgr.ch:7080/v1` |
| **Model** | *(check Open WebUI model selector or ask your instructor)* |
| **API Key** | *(ask your instructor — may be `EMPTY`)* |
> **Tip**: In Open WebUI, the model dropdown at the top automatically shows
> whichever model is currently running. For the API, use
> `curl http://silicon.fhgr.ch:7080/v1/models` to check.
---
## Option 1: Open WebUI (Recommended)
The easiest way to chat with the model — no installation required.
### Getting Started
1. Make sure you are connected to the **university network** (or VPN).
2. Open your browser and go to **http://silicon.fhgr.ch:7081**
3. Click **"Sign Up"** to create a new account:
- Enter your **name** (e.g. your first and last name)
- Enter your **email** (use your university email)
- Choose a **password**
- Click **"Create Account"**
4. After signing up you are logged in automatically.
5. Select the model **qwen3.5-35b-a3b** from the model dropdown at the top.
6. Type a message and press Enter — you're chatting with the LLM.
### Returning Later
- Go to **http://silicon.fhgr.ch:7081** and click **"Sign In"**.
- Enter the email and password you used during sign-up.
- All your previous chats are still there.
### Features
- **Chat history** — all conversations are saved on the server and persist across sessions
- **Markdown rendering** with syntax-highlighted code blocks
- **Model selector** — auto-discovers available models from the server
- **Conversation branching** — edit previous messages and explore alternative responses
- **File upload** — attach files to your messages for the model to analyze
- **Search** — search across all your past conversations
### Tips
- Your account and chat history are stored on the server. You can log in
from any device on the university network.
- If you forget your password, ask your instructor to reset it via the
Admin Panel.
- The model works best when you provide clear, specific instructions.
- For code tasks, mention the programming language explicitly (e.g.
"Write a Python function that...").
- Long conversations use more context. Start a **New Chat** (top-left
button) when switching topics to get faster, more focused responses.
---
## Option 2: Streamlit App (Chat + File Editor)
A local app with chat, file editing, and Python/LaTeX execution.
See the [Streamlit section below](#streamlit-chat--file-editor-app) for setup.
---
## Option 3: Python SDK / curl
For programmatic access and scripting.
### Quick Start with Python
#### 1. Install the OpenAI SDK
```bash
pip install openai
```
#### 2. Simple Chat
```python
from openai import OpenAI
client = OpenAI(
base_url="http://silicon.fhgr.ch:7080/v1",
api_key="EMPTY", # replace if your instructor set a key
)
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in simple terms."},
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
```
#### 3. Streaming Responses
```python
stream = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "user", "content": "Write a haiku about machine learning."},
],
max_tokens=256,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
```
---
### Quick Start with curl
```bash
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b-a3b",
"messages": [
{"role": "user", "content": "What is the capital of Switzerland?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
```
---
## Recommended Parameters
| Parameter | Recommended | Notes |
|-----------------|-------------|----------------------------------------------|
| `temperature` | 0.7 | Lower = more deterministic, higher = creative |
| `max_tokens` | 10244096 | Increase for long-form output |
| `top_p` | 0.95 | Nucleus sampling |
| `stream` | `true` | Better UX for interactive use |
---
## Tips & Etiquette
- **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary.
- **Use streaming**: Makes responses feel faster and reduces perceived latency.
- **Don't spam requests**: The server is shared among ~15 students.
- **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter.
---
## Streamlit Chat & File Editor App
A web UI is included for chatting with the model and editing files. It runs
on your own machine and connects to the GPU server.
### Setup
```bash
# Clone the repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
cd LLM_Inferenz_Server_1
# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txt
```
### Run
```bash
streamlit run app.py
```
Opens at `http://localhost:8501` in your browser.
### Features
**Chat Tab**
- Conversational interface with streaming responses
- "Save code" button extracts code from the LLM response and saves it to a
workspace file (strips markdown formatting automatically)
**File Editor Tab**
- Create and edit `.py`, `.tex`, `.html`, or any text file
- Syntax-highlighted preview of file content
- "Generate with LLM" button: describe a change in natural language and the
model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting",
"translate comments to German")
**Sidebar Controls**
- **Connection**: API Base URL and API Key
- **LLM Parameters**: Adjustable for each request
| Parameter | Default | What it does |
|-----------|---------|--------------|
| Thinking Mode | Off | Toggle chain-of-thought reasoning (better for complex tasks, slower) |
| Temperature | 0.7 | Lower = predictable, higher = creative |
| Max Tokens | 4096 | Maximum response length |
| Top P | 0.95 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | Encourage diverse topics |
- **File Manager**: Create new files and switch between them
All generated files are stored in a `workspace/` folder next to `app.py`.
> **Tip**: The app runs entirely on your local machine. Only the LLM requests
> go to the server — your files stay local.
---
## Thinking Mode
By default, the model "thinks" before answering (internal chain-of-thought).
This is great for complex reasoning but adds latency for simple questions.
To disable thinking and get faster direct responses, add this to your API call:
```python
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[...],
max_tokens=1024,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```
---
## Troubleshooting
| Issue | Solution |
|-----------------------------|-----------------------------------------------------|
| Connection refused | Check you're on the university network / VPN |
| Model not found | Use model name `qwen3.5-35b-a3b` exactly |
| Slow responses | The model is shared — peak times may be slower |
| `401 Unauthorized` | Ask your instructor for the API key |
| Response cut off | Increase `max_tokens` in your request |
| Open WebUI login fails | Make sure you created an account first (Sign Up) |
| Open WebUI shows no models | The vLLM server may still be loading — wait a few minutes |