8.8 KiB
Raw Permalink Blame History

Student Guide — Qwen3.5 Inference Server

Overview

A Qwen3.5 large language model is running on our GPU server. Two models may be available at different times (your instructor will let you know which one is active):

Model Params Best for
qwen3.5-35b-a3b 35B (3B active) Fast responses, everyday tasks
qwen3.5-122b-a10b-fp8 122B (10B active) Complex reasoning, coding, research

There are three ways to interact with the model:

  1. Open WebUI — ChatGPT-like interface in your browser (easiest)
  2. Streamlit App — Local app with chat, file editor, and code execution
  3. Python SDK / curl — Programmatic access via the OpenAI-compatible API

Note

: You must be on the fhgr network or VPN to reach the server.

Connection Details

Parameter Value
Open WebUI http://silicon.fhgr.ch:7081
API Base URL http://silicon.fhgr.ch:7080/v1
Model (check Open WebUI model selector or ask your instructor)
API Key (ask your instructor — may be EMPTY)

Tip

: In Open WebUI, the model dropdown at the top automatically shows whichever model is currently running. For the API, use curl http://silicon.fhgr.ch:7080/v1/models to check.


The easiest way to chat with the model — no installation required.

Getting Started

  1. Make sure you are connected to the university network (or VPN).
  2. Open your browser and go to http://silicon.fhgr.ch:7081
  3. Click "Sign Up" to create a new account:
    • Enter your name (e.g. your first and last name)
    • Enter your email (use your university email)
    • Choose a password
    • Click "Create Account"
  4. After signing up you are logged in automatically.
  5. Select the model qwen3.5-35b-a3b from the model dropdown at the top.
  6. Type a message and press Enter — you're chatting with the LLM.

Returning Later

  • Go to http://silicon.fhgr.ch:7081 and click "Sign In".
  • Enter the email and password you used during sign-up.
  • All your previous chats are still there.

Features

  • Chat history — all conversations are saved on the server and persist across sessions
  • Markdown rendering with syntax-highlighted code blocks
  • Model selector — auto-discovers available models from the server
  • Conversation branching — edit previous messages and explore alternative responses
  • File upload — attach files to your messages for the model to analyze
  • Search — search across all your past conversations

Tips

  • Your account and chat history are stored on the server. You can log in from any device on the university network.
  • If you forget your password, ask your instructor to reset it via the Admin Panel.
  • The model works best when you provide clear, specific instructions.
  • For code tasks, mention the programming language explicitly (e.g. "Write a Python function that...").
  • Long conversations use more context. Start a New Chat (top-left button) when switching topics to get faster, more focused responses.

Option 2: Streamlit App (Chat + File Editor)

A local app with chat, file editing, and Python/LaTeX execution. See the Streamlit section below for setup.


Option 3: Python SDK / curl

For programmatic access and scripting.

Quick Start with Python

1. Install the OpenAI SDK

pip install openai

2. Simple Chat

from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)

3. Streaming Responses

stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Quick Start with curl

curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Parameter Recommended Notes
temperature 0.7 Lower = more deterministic, higher = creative
max_tokens 10244096 Increase for long-form output
top_p 0.95 Nucleus sampling
stream true Better UX for interactive use

Tips & Etiquette

  • Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
  • Use streaming: Makes responses feel faster and reduces perceived latency.
  • Don't spam requests: The server is shared among ~15 students.
  • Check the model name: Always use qwen3.5-35b-a3b as the model parameter.

Streamlit Chat & File Editor App

A web UI is included for chatting with the model and editing files. It runs on your own machine and connects to the GPU server.

Setup

# Clone the repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
cd LLM_Inferenz_Server_1

# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate        # macOS / Linux
# .venv\Scripts\activate         # Windows
pip install -r requirements.txt

Run

streamlit run app.py

Opens at http://localhost:8501 in your browser.

Features

Chat Tab

  • Conversational interface with streaming responses
  • "Save code" button extracts code from the LLM response and saves it to a workspace file (strips markdown formatting automatically)

File Editor Tab

  • Create and edit .py, .tex, .html, or any text file
  • Syntax-highlighted preview of file content
  • "Generate with LLM" button: describe a change in natural language and the model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting", "translate comments to German")

Sidebar Controls

  • Connection: API Base URL and API Key
  • LLM Parameters: Adjustable for each request
Parameter Default What it does
Thinking Mode Off Toggle chain-of-thought reasoning (better for complex tasks, slower)
Temperature 0.7 Lower = predictable, higher = creative
Max Tokens 4096 Maximum response length
Top P 0.95 Nucleus sampling threshold
Presence Penalty 0.0 Encourage diverse topics
  • File Manager: Create new files and switch between them

All generated files are stored in a workspace/ folder next to app.py.

Tip

: The app runs entirely on your local machine. Only the LLM requests go to the server — your files stay local.


Thinking Mode

By default, the model "thinks" before answering (internal chain-of-thought). This is great for complex reasoning but adds latency for simple questions.

To disable thinking and get faster direct responses, add this to your API call:

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[...],
    max_tokens=1024,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

Troubleshooting

Issue Solution
Connection refused Check you're on the university network / VPN
Model not found Use model name qwen3.5-35b-a3b exactly
Slow responses The model is shared — peak times may be slower
401 Unauthorized Ask your instructor for the API key
Response cut off Increase max_tokens in your request
Open WebUI login fails Make sure you created an account first (Sign Up)
Open WebUI shows no models The vLLM server may still be loading — wait a few minutes