LLM_Inferenz_Server_1/STUDENT_GUIDE.md

# Student Guide — Qwen3.5-35B-A3B Inference Server

## Overview

A **Qwen3.5-35B-A3B** language model is running on our GPU server. It's a
Mixture-of-Experts model (35B total parameters, 3B active per token), providing
fast and high-quality responses. You can interact with it using the
**OpenAI-compatible API**.

## Connection Details

| Parameter    | Value                                       |
|------------- |---------------------------------------------|
| **Base URL** | `http://silicon.fhgr.ch:7080/v1`            |
| **Model**    | `qwen3.5-35b-a3b`                           |
| **API Key**  | *(ask your instructor — may be `EMPTY`)*    |

> **Note**: You must be on the university network or VPN to reach the server.

---

## Quick Start with Python

### 1. Install the OpenAI SDK

```bash
pip install openai
```

### 2. Simple Chat

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)
```

### 3. Streaming Responses

```python
stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()
```

---

## Quick Start with curl

```bash
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
```

---

## Recommended Parameters

| Parameter       | Recommended | Notes                                        |
|-----------------|-------------|----------------------------------------------|
| `temperature`   | 0.7         | Lower = more deterministic, higher = creative |
| `max_tokens`    | 1024–4096   | Increase for long-form output                |
| `top_p`         | 0.95        | Nucleus sampling                             |
| `stream`        | `true`      | Better UX for interactive use                |

---

## Tips & Etiquette

- **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary.
- **Use streaming**: Makes responses feel faster and reduces perceived latency.
- **Don't spam requests**: The server is shared among ~15 students.
- **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter.

---

## Streamlit Chat & File Editor App

A simple web UI is included for chatting with the model and editing files.

### Setup

```bash
pip install streamlit openai
```

### Run

```bash
streamlit run app.py
```

This opens a browser with two tabs:

- **Chat** — Conversational interface with streaming responses. You can save
  the model's last response directly to a file.
- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
  Use the "Generate with LLM" button to have the model modify your file based
  on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").

Files are stored in a `workspace/` folder next to `app.py`.

> **Tip**: The app runs on your local machine and connects to the server — you
> don't need to install anything on the GPU server.

---

## Thinking Mode

By default, the model "thinks" before answering (internal chain-of-thought).
This is great for complex reasoning but adds latency for simple questions.

To disable thinking and get faster direct responses, add this to your API call:

```python
response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[...],
    max_tokens=1024,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```

---

## Troubleshooting

| Issue                       | Solution                                            |
|-----------------------------|-----------------------------------------------------|
| Connection refused          | Check you're on the university network / VPN        |
| Model not found             | Use model name `qwen3.5-35b-a3b` exactly            |
| Slow responses              | The model is shared — peak times may be slower      |
| `401 Unauthorized`          | Ask your instructor for the API key                 |
| Response cut off            | Increase `max_tokens` in your request               |