- Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor
168 lines
4.7 KiB
Markdown
168 lines
4.7 KiB
Markdown
# Student Guide — Qwen3.5-35B-A3B Inference Server
|
||
|
||
## Overview
|
||
|
||
A **Qwen3.5-35B-A3B** language model is running on our GPU server. It's a
|
||
Mixture-of-Experts model (35B total parameters, 3B active per token), providing
|
||
fast and high-quality responses. You can interact with it using the
|
||
**OpenAI-compatible API**.
|
||
|
||
## Connection Details
|
||
|
||
| Parameter | Value |
|
||
|------------- |---------------------------------------------|
|
||
| **Base URL** | `http://silicon.fhgr.ch:7080/v1` |
|
||
| **Model** | `qwen3.5-35b-a3b` |
|
||
| **API Key** | *(ask your instructor — may be `EMPTY`)* |
|
||
|
||
> **Note**: You must be on the university network or VPN to reach the server.
|
||
|
||
---
|
||
|
||
## Quick Start with Python
|
||
|
||
### 1. Install the OpenAI SDK
|
||
|
||
```bash
|
||
pip install openai
|
||
```
|
||
|
||
### 2. Simple Chat
|
||
|
||
```python
|
||
from openai import OpenAI
|
||
|
||
client = OpenAI(
|
||
base_url="http://silicon.fhgr.ch:7080/v1",
|
||
api_key="EMPTY", # replace if your instructor set a key
|
||
)
|
||
|
||
response = client.chat.completions.create(
|
||
model="qwen3.5-35b-a3b",
|
||
messages=[
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "Explain gradient descent in simple terms."},
|
||
],
|
||
max_tokens=1024,
|
||
temperature=0.7,
|
||
)
|
||
|
||
print(response.choices[0].message.content)
|
||
```
|
||
|
||
### 3. Streaming Responses
|
||
|
||
```python
|
||
stream = client.chat.completions.create(
|
||
model="qwen3.5-35b-a3b",
|
||
messages=[
|
||
{"role": "user", "content": "Write a haiku about machine learning."},
|
||
],
|
||
max_tokens=256,
|
||
stream=True,
|
||
)
|
||
|
||
for chunk in stream:
|
||
if chunk.choices[0].delta.content:
|
||
print(chunk.choices[0].delta.content, end="", flush=True)
|
||
print()
|
||
```
|
||
|
||
---
|
||
|
||
## Quick Start with curl
|
||
|
||
```bash
|
||
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "qwen3.5-35b-a3b",
|
||
"messages": [
|
||
{"role": "user", "content": "What is the capital of Switzerland?"}
|
||
],
|
||
"max_tokens": 256,
|
||
"temperature": 0.7
|
||
}'
|
||
```
|
||
|
||
---
|
||
|
||
## Recommended Parameters
|
||
|
||
| Parameter | Recommended | Notes |
|
||
|-----------------|-------------|----------------------------------------------|
|
||
| `temperature` | 0.7 | Lower = more deterministic, higher = creative |
|
||
| `max_tokens` | 1024–4096 | Increase for long-form output |
|
||
| `top_p` | 0.95 | Nucleus sampling |
|
||
| `stream` | `true` | Better UX for interactive use |
|
||
|
||
---
|
||
|
||
## Tips & Etiquette
|
||
|
||
- **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary.
|
||
- **Use streaming**: Makes responses feel faster and reduces perceived latency.
|
||
- **Don't spam requests**: The server is shared among ~15 students.
|
||
- **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter.
|
||
|
||
---
|
||
|
||
## Streamlit Chat & File Editor App
|
||
|
||
A simple web UI is included for chatting with the model and editing files.
|
||
|
||
### Setup
|
||
|
||
```bash
|
||
pip install streamlit openai
|
||
```
|
||
|
||
### Run
|
||
|
||
```bash
|
||
streamlit run app.py
|
||
```
|
||
|
||
This opens a browser with two tabs:
|
||
|
||
- **Chat** — Conversational interface with streaming responses. You can save
|
||
the model's last response directly to a file.
|
||
- **File Editor** — Create and edit `.py`, `.tex`, `.html`, or any text file.
|
||
Use the "Generate with LLM" button to have the model modify your file based
|
||
on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").
|
||
|
||
Files are stored in a `workspace/` folder next to `app.py`.
|
||
|
||
> **Tip**: The app runs on your local machine and connects to the server — you
|
||
> don't need to install anything on the GPU server.
|
||
|
||
---
|
||
|
||
## Thinking Mode
|
||
|
||
By default, the model "thinks" before answering (internal chain-of-thought).
|
||
This is great for complex reasoning but adds latency for simple questions.
|
||
|
||
To disable thinking and get faster direct responses, add this to your API call:
|
||
|
||
```python
|
||
response = client.chat.completions.create(
|
||
model="qwen3.5-35b-a3b",
|
||
messages=[...],
|
||
max_tokens=1024,
|
||
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
| Issue | Solution |
|
||
|-----------------------------|-----------------------------------------------------|
|
||
| Connection refused | Check you're on the university network / VPN |
|
||
| Model not found | Use model name `qwen3.5-35b-a3b` exactly |
|
||
| Slow responses | The model is shared — peak times may be slower |
|
||
| `401 Unauthorized` | Ask your instructor for the API key |
|
||
| Response cut off | Increase `max_tokens` in your request |
|