LLM_Inferenz_Server_1/STUDENT_GUIDE.md
herzogflorian 9e1e0c0751 Add Streamlit chat app, update container to vLLM nightly
- Add app.py: Streamlit UI with chat and file editor tabs
- Add requirements.txt: streamlit + openai dependencies
- Update vllm_qwen.def: use nightly image for Qwen3.5 support
- Update README.md: reflect 35B-A3B model, correct script names
- Update STUDENT_GUIDE.md: add app usage and thinking mode docs
- Update .gitignore: exclude .venv/ and workspace/

Made-with: Cursor
2026-03-02 16:30:04 +01:00

4.7 KiB
Raw Blame History

Student Guide — Qwen3.5-35B-A3B Inference Server

Overview

A Qwen3.5-35B-A3B language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses. You can interact with it using the OpenAI-compatible API.

Connection Details

Parameter Value
Base URL http://silicon.fhgr.ch:7080/v1
Model qwen3.5-35b-a3b
API Key (ask your instructor — may be EMPTY)

Note

: You must be on the university network or VPN to reach the server.


Quick Start with Python

1. Install the OpenAI SDK

pip install openai

2. Simple Chat

from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)

3. Streaming Responses

stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Quick Start with curl

curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Parameter Recommended Notes
temperature 0.7 Lower = more deterministic, higher = creative
max_tokens 10244096 Increase for long-form output
top_p 0.95 Nucleus sampling
stream true Better UX for interactive use

Tips & Etiquette

  • Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
  • Use streaming: Makes responses feel faster and reduces perceived latency.
  • Don't spam requests: The server is shared among ~15 students.
  • Check the model name: Always use qwen3.5-35b-a3b as the model parameter.

Streamlit Chat & File Editor App

A simple web UI is included for chatting with the model and editing files.

Setup

pip install streamlit openai

Run

streamlit run app.py

This opens a browser with two tabs:

  • Chat — Conversational interface with streaming responses. You can save the model's last response directly to a file.
  • File Editor — Create and edit .py, .tex, .html, or any text file. Use the "Generate with LLM" button to have the model modify your file based on an instruction (e.g. "add error handling" or "fix the LaTeX formatting").

Files are stored in a workspace/ folder next to app.py.

Tip

: The app runs on your local machine and connects to the server — you don't need to install anything on the GPU server.


Thinking Mode

By default, the model "thinks" before answering (internal chain-of-thought). This is great for complex reasoning but adds latency for simple questions.

To disable thinking and get faster direct responses, add this to your API call:

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[...],
    max_tokens=1024,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

Troubleshooting

Issue Solution
Connection refused Check you're on the university network / VPN
Model not found Use model name qwen3.5-35b-a3b exactly
Slow responses The model is shared — peak times may be slower
401 Unauthorized Ask your instructor for the API key
Response cut off Increase max_tokens in your request