273 lines
8.8 KiB
Markdown
273 lines
8.8 KiB
Markdown
# Student Guide — Qwen3.5 Inference Server
|
||
|
||
## Overview
|
||
|
||
A **Qwen3.5** large language model is running on our GPU server. Two models
|
||
may be available at different times (your instructor will let you know which
|
||
one is active):
|
||
|
||
| Model | Params | Best for |
|
||
|-------|--------|----------|
|
||
| `qwen3.5-35b-a3b` | 35B (3B active) | Fast responses, everyday tasks |
|
||
| `qwen3.5-122b-a10b-fp8` | 122B (10B active) | Complex reasoning, coding, research |
|
||
|
||
There are **three ways** to interact with the model:
|
||
|
||
1. **Open WebUI** — ChatGPT-like interface in your browser (easiest)
|
||
2. **Streamlit App** — Local app with chat, file editor, and code execution
|
||
3. **Python SDK / curl** — Programmatic access via the OpenAI-compatible API
|
||
|
||
> **Note**: You must be on the fhgr network or VPN to reach the server.
|
||
|
||
## Connection Details
|
||
|
||
| Parameter | Value |
|
||
|------------------|---------------------------------------------|
|
||
| **Open WebUI** | `http://silicon.fhgr.ch:7081` |
|
||
| **API Base URL** | `http://silicon.fhgr.ch:7080/v1` |
|
||
| **Model** | *(check Open WebUI model selector or ask your instructor)* |
|
||
| **API Key** | *(ask your instructor — may be `EMPTY`)* |
|
||
|
||
> **Tip**: In Open WebUI, the model dropdown at the top automatically shows
|
||
> whichever model is currently running. For the API, use
|
||
> `curl http://silicon.fhgr.ch:7080/v1/models` to check.
|
||
|
||
---
|
||
|
||
## Option 1: Open WebUI (Recommended)
|
||
|
||
The easiest way to chat with the model — no installation required.
|
||
|
||
### Getting Started
|
||
|
||
1. Make sure you are connected to the **university network** (or VPN).
|
||
2. Open your browser and go to **http://silicon.fhgr.ch:7081**
|
||
3. Click **"Sign Up"** to create a new account:
|
||
- Enter your **name** (e.g. your first and last name)
|
||
- Enter your **email** (use your university email)
|
||
- Choose a **password**
|
||
- Click **"Create Account"**
|
||
4. After signing up you are logged in automatically.
|
||
5. Select the model **qwen3.5-35b-a3b** from the model dropdown at the top.
|
||
6. Type a message and press Enter — you're chatting with the LLM.
|
||
|
||
### Returning Later
|
||
|
||
- Go to **http://silicon.fhgr.ch:7081** and click **"Sign In"**.
|
||
- Enter the email and password you used during sign-up.
|
||
- All your previous chats are still there.
|
||
|
||
### Features
|
||
|
||
- **Chat history** — all conversations are saved on the server and persist across sessions
|
||
- **Markdown rendering** with syntax-highlighted code blocks
|
||
- **Model selector** — auto-discovers available models from the server
|
||
- **Conversation branching** — edit previous messages and explore alternative responses
|
||
- **File upload** — attach files to your messages for the model to analyze
|
||
- **Search** — search across all your past conversations
|
||
|
||
### Tips
|
||
|
||
- Your account and chat history are stored on the server. You can log in
|
||
from any device on the university network.
|
||
- If you forget your password, ask your instructor to reset it via the
|
||
Admin Panel.
|
||
- The model works best when you provide clear, specific instructions.
|
||
- For code tasks, mention the programming language explicitly (e.g.
|
||
"Write a Python function that...").
|
||
- Long conversations use more context. Start a **New Chat** (top-left
|
||
button) when switching topics to get faster, more focused responses.
|
||
|
||
---
|
||
|
||
## Option 2: Streamlit App (Chat + File Editor)
|
||
|
||
A local app with chat, file editing, and Python/LaTeX execution.
|
||
See the [Streamlit section below](#streamlit-chat--file-editor-app) for setup.
|
||
|
||
---
|
||
|
||
## Option 3: Python SDK / curl
|
||
|
||
For programmatic access and scripting.
|
||
|
||
### Quick Start with Python
|
||
|
||
#### 1. Install the OpenAI SDK
|
||
|
||
```bash
|
||
pip install openai
|
||
```
|
||
|
||
#### 2. Simple Chat
|
||
|
||
```python
|
||
from openai import OpenAI
|
||
|
||
client = OpenAI(
|
||
base_url="http://silicon.fhgr.ch:7080/v1",
|
||
api_key="EMPTY", # replace if your instructor set a key
|
||
)
|
||
|
||
response = client.chat.completions.create(
|
||
model="qwen3.5-35b-a3b",
|
||
messages=[
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "Explain gradient descent in simple terms."},
|
||
],
|
||
max_tokens=1024,
|
||
temperature=0.7,
|
||
)
|
||
|
||
print(response.choices[0].message.content)
|
||
```
|
||
|
||
#### 3. Streaming Responses
|
||
|
||
```python
|
||
stream = client.chat.completions.create(
|
||
model="qwen3.5-35b-a3b",
|
||
messages=[
|
||
{"role": "user", "content": "Write a haiku about machine learning."},
|
||
],
|
||
max_tokens=256,
|
||
stream=True,
|
||
)
|
||
|
||
for chunk in stream:
|
||
if chunk.choices[0].delta.content:
|
||
print(chunk.choices[0].delta.content, end="", flush=True)
|
||
print()
|
||
```
|
||
|
||
---
|
||
|
||
### Quick Start with curl
|
||
|
||
```bash
|
||
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "qwen3.5-35b-a3b",
|
||
"messages": [
|
||
{"role": "user", "content": "What is the capital of Switzerland?"}
|
||
],
|
||
"max_tokens": 256,
|
||
"temperature": 0.7
|
||
}'
|
||
```
|
||
|
||
---
|
||
|
||
## Recommended Parameters
|
||
|
||
| Parameter | Recommended | Notes |
|
||
|-----------------|-------------|----------------------------------------------|
|
||
| `temperature` | 0.7 | Lower = more deterministic, higher = creative |
|
||
| `max_tokens` | 1024–4096 | Increase for long-form output |
|
||
| `top_p` | 0.95 | Nucleus sampling |
|
||
| `stream` | `true` | Better UX for interactive use |
|
||
|
||
---
|
||
|
||
## Tips & Etiquette
|
||
|
||
- **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary.
|
||
- **Use streaming**: Makes responses feel faster and reduces perceived latency.
|
||
- **Don't spam requests**: The server is shared among ~15 students.
|
||
- **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter.
|
||
|
||
---
|
||
|
||
## Streamlit Chat & File Editor App
|
||
|
||
A web UI is included for chatting with the model and editing files. It runs
|
||
on your own machine and connects to the GPU server.
|
||
|
||
### Setup
|
||
|
||
```bash
|
||
# Clone the repository
|
||
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
|
||
cd LLM_Inferenz_Server_1
|
||
|
||
# Create a virtual environment and install dependencies
|
||
python3 -m venv .venv
|
||
source .venv/bin/activate # macOS / Linux
|
||
# .venv\Scripts\activate # Windows
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
### Run
|
||
|
||
```bash
|
||
streamlit run app.py
|
||
```
|
||
|
||
Opens at `http://localhost:8501` in your browser.
|
||
|
||
### Features
|
||
|
||
**Chat Tab**
|
||
- Conversational interface with streaming responses
|
||
- "Save code" button extracts code from the LLM response and saves it to a
|
||
workspace file (strips markdown formatting automatically)
|
||
|
||
**File Editor Tab**
|
||
- Create and edit `.py`, `.tex`, `.html`, or any text file
|
||
- Syntax-highlighted preview of file content
|
||
- "Generate with LLM" button: describe a change in natural language and the
|
||
model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting",
|
||
"translate comments to German")
|
||
|
||
**Sidebar Controls**
|
||
- **Connection**: API Base URL and API Key
|
||
- **LLM Parameters**: Adjustable for each request
|
||
|
||
| Parameter | Default | What it does |
|
||
|-----------|---------|--------------|
|
||
| Thinking Mode | Off | Toggle chain-of-thought reasoning (better for complex tasks, slower) |
|
||
| Temperature | 0.7 | Lower = predictable, higher = creative |
|
||
| Max Tokens | 4096 | Maximum response length |
|
||
| Top P | 0.95 | Nucleus sampling threshold |
|
||
| Presence Penalty | 0.0 | Encourage diverse topics |
|
||
|
||
- **File Manager**: Create new files and switch between them
|
||
|
||
All generated files are stored in a `workspace/` folder next to `app.py`.
|
||
|
||
> **Tip**: The app runs entirely on your local machine. Only the LLM requests
|
||
> go to the server — your files stay local.
|
||
|
||
---
|
||
|
||
## Thinking Mode
|
||
|
||
By default, the model "thinks" before answering (internal chain-of-thought).
|
||
This is great for complex reasoning but adds latency for simple questions.
|
||
|
||
To disable thinking and get faster direct responses, add this to your API call:
|
||
|
||
```python
|
||
response = client.chat.completions.create(
|
||
model="qwen3.5-35b-a3b",
|
||
messages=[...],
|
||
max_tokens=1024,
|
||
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
| Issue | Solution |
|
||
|-----------------------------|-----------------------------------------------------|
|
||
| Connection refused | Check you're on the university network / VPN |
|
||
| Model not found | Use model name `qwen3.5-35b-a3b` exactly |
|
||
| Slow responses | The model is shared — peak times may be slower |
|
||
| `401 Unauthorized` | Ask your instructor for the API key |
|
||
| Response cut off | Increase `max_tokens` in your request |
|
||
| Open WebUI login fails | Make sure you created an account first (Sign Up) |
|
||
| Open WebUI shows no models | The vLLM server may still be loading — wait a few minutes |
|