LLM_Inferenz_Server_1/STUDENT_GUIDE.md

# Student Guide — Qwen3.5-35B-A3B Inference Server

## Overview

A **Qwen3.5-35B-A3B** language model is running on our GPU server. It's a
Mixture-of-Experts model (35B total parameters, 3B active per token), providing
fast and high-quality responses. You can interact with it using the
**OpenAI-compatible API**.

## Connection Details

| Parameter    | Value                                       |
|------------- |---------------------------------------------|
| **Base URL** | `http://silicon.fhgr.ch:7080/v1`            |
| **Model**    | `qwen3.5-35b-a3b`                           |
| **API Key**  | *(ask your instructor — may be `EMPTY`)*    |

> **Note**: You must be on the university network or VPN to reach the server.

---

## Quick Start with Python

### 1. Install the OpenAI SDK

```bash
pip install openai
```

### 2. Simple Chat

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)
```

### 3. Streaming Responses

```python
stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()
```

---

## Quick Start with curl

```bash
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
```

---

## Recommended Parameters

| Parameter       | Recommended | Notes                                        |
|-----------------|-------------|----------------------------------------------|
| `temperature`   | 0.7         | Lower = more deterministic, higher = creative |
| `max_tokens`    | 1024–4096   | Increase for long-form output                |
| `top_p`         | 0.95        | Nucleus sampling                             |
| `stream`        | `true`      | Better UX for interactive use                |

---

## Tips & Etiquette

- **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary.
- **Use streaming**: Makes responses feel faster and reduces perceived latency.
- **Don't spam requests**: The server is shared among ~15 students.
- **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter.

---

## Troubleshooting

| Issue                       | Solution                                            |
|-----------------------------|-----------------------------------------------------|
| Connection refused          | Check you're on the university network / VPN        |
| Model not found             | Use model name `qwen3.5-35b-a3b` exactly            |
| Slow responses              | The model is shared — peak times may be slower      |
| `401 Unauthorized`          | Ask your instructor for the API key                 |
| Response cut off            | Increase `max_tokens` in your request               |