LLM_Inferenz_Server_1/STUDENT_GUIDE.md

# Student Guide — Qwen3.5-35B-A3B Inference Server

## Overview

A **Qwen3.5-35B-A3B** language model is running on our GPU server. It's a
Mixture-of-Experts model (35B total parameters, 3B active per token), providing
fast and high-quality responses.

There are **three ways** to interact with the model:

1. **Open WebUI** — ChatGPT-like interface in your browser (easiest)
2. **Streamlit App** — Local app with chat, file editor, and code execution
3. **Python SDK / curl** — Programmatic access via the OpenAI-compatible API

> **Note**: You must be on the university network or VPN to reach the server.

## Connection Details

| Parameter        | Value                                       |
|------------------|---------------------------------------------|
| **Open WebUI**   | `http://silicon.fhgr.ch:7081`               |
| **API Base URL** | `http://silicon.fhgr.ch:7080/v1`            |
| **Model**        | `qwen3.5-35b-a3b`                           |
| **API Key**      | *(ask your instructor — may be `EMPTY`)*    |

---

## Option 1: Open WebUI (Recommended)

The easiest way to chat with the model — no installation required.

### Getting Started

1. Make sure you are connected to the **university network** (or VPN).
2. Open your browser and go to **http://silicon.fhgr.ch:7081**
3. Click **"Sign Up"** to create a new account:
   - Enter your **name** (e.g. your first and last name)
   - Enter your **email** (use your university email)
   - Choose a **password**
   - Click **"Create Account"**
4. After signing up you are logged in automatically.
5. Select the model **qwen3.5-35b-a3b** from the model dropdown at the top.
6. Type a message and press Enter — you're chatting with the LLM.

### Returning Later

- Go to **http://silicon.fhgr.ch:7081** and click **"Sign In"**.
- Enter the email and password you used during sign-up.
- All your previous chats are still there.

### Features

- **Chat history** — all conversations are saved on the server and persist across sessions
- **Markdown rendering** with syntax-highlighted code blocks
- **Model selector** — auto-discovers available models from the server
- **Conversation branching** — edit previous messages and explore alternative responses
- **File upload** — attach files to your messages for the model to analyze
- **Search** — search across all your past conversations

### Tips

- Your account and chat history are stored on the server. You can log in
  from any device on the university network.
- If you forget your password, ask your instructor to reset it via the
  Admin Panel.
- The model works best when you provide clear, specific instructions.
- For code tasks, mention the programming language explicitly (e.g.
  "Write a Python function that...").
- Long conversations use more context. Start a **New Chat** (top-left
  button) when switching topics to get faster, more focused responses.

---

## Option 2: Streamlit App (Chat + File Editor)

A local app with chat, file editing, and Python/LaTeX execution.
See the [Streamlit section below](#streamlit-chat--file-editor-app) for setup.

---

## Option 3: Python SDK / curl

For programmatic access and scripting.

### Quick Start with Python

#### 1. Install the OpenAI SDK

```bash
pip install openai
```

#### 2. Simple Chat

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)
```

#### 3. Streaming Responses

```python
stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()
```

---

### Quick Start with curl

```bash
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
```

---

## Recommended Parameters

| Parameter       | Recommended | Notes                                        |
|-----------------|-------------|----------------------------------------------|
| `temperature`   | 0.7         | Lower = more deterministic, higher = creative |
| `max_tokens`    | 1024–4096   | Increase for long-form output                |
| `top_p`         | 0.95        | Nucleus sampling                             |
| `stream`        | `true`      | Better UX for interactive use                |

---

## Tips & Etiquette

- **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary.
- **Use streaming**: Makes responses feel faster and reduces perceived latency.
- **Don't spam requests**: The server is shared among ~15 students.
- **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter.

---

## Streamlit Chat & File Editor App

A web UI is included for chatting with the model and editing files. It runs
on your own machine and connects to the GPU server.

### Setup

```bash
# Clone the repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
cd LLM_Inferenz_Server_1

# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate        # macOS / Linux
# .venv\Scripts\activate         # Windows
pip install -r requirements.txt
```

### Run

```bash
streamlit run app.py
```

Opens at `http://localhost:8501` in your browser.

### Features

**Chat Tab**
- Conversational interface with streaming responses
- "Save code" button extracts code from the LLM response and saves it to a
  workspace file (strips markdown formatting automatically)

**File Editor Tab**
- Create and edit `.py`, `.tex`, `.html`, or any text file
- Syntax-highlighted preview of file content
- "Generate with LLM" button: describe a change in natural language and the
  model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting",
  "translate comments to German")

**Sidebar Controls**
- **Connection**: API Base URL and API Key
- **LLM Parameters**: Adjustable for each request

| Parameter | Default | What it does |
|-----------|---------|--------------|
| Thinking Mode | Off | Toggle chain-of-thought reasoning (better for complex tasks, slower) |
| Temperature | 0.7 | Lower = predictable, higher = creative |
| Max Tokens | 4096 | Maximum response length |
| Top P | 0.95 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | Encourage diverse topics |

- **File Manager**: Create new files and switch between them

All generated files are stored in a `workspace/` folder next to `app.py`.

> **Tip**: The app runs entirely on your local machine. Only the LLM requests
> go to the server — your files stay local.

---

## Thinking Mode

By default, the model "thinks" before answering (internal chain-of-thought).
This is great for complex reasoning but adds latency for simple questions.

To disable thinking and get faster direct responses, add this to your API call:

```python
response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[...],
    max_tokens=1024,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```

---

## Troubleshooting

| Issue                       | Solution                                            |
|-----------------------------|-----------------------------------------------------|
| Connection refused          | Check you're on the university network / VPN        |
| Model not found             | Use model name `qwen3.5-35b-a3b` exactly            |
| Slow responses              | The model is shared — peak times may be slower      |
| `401 Unauthorized`          | Ask your instructor for the API key                 |
| Response cut off            | Increase `max_tokens` in your request               |
| Open WebUI login fails      | Make sure you created an account first (Sign Up)    |
| Open WebUI shows no models  | The vLLM server may still be loading — wait a few minutes |