- Add Open WebUI scripts (06-09) for server-hosted ChatGPT-like interface connected to the vLLM backend on port 7081 - Add context window management to chat (auto-trim, token counter, progress bar) - Add terminal output panel to file editor for running Python/LaTeX files - Update README with Open WebUI setup, architecture diagram, and troubleshooting - Update STUDENT_GUIDE with step-by-step Open WebUI login instructions Made-with: Cursor
8.4 KiB
Student Guide — Qwen3.5-35B-A3B Inference Server
Overview
A Qwen3.5-35B-A3B language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses.
There are three ways to interact with the model:
- Open WebUI — ChatGPT-like interface in your browser (easiest)
- Streamlit App — Local app with chat, file editor, and code execution
- Python SDK / curl — Programmatic access via the OpenAI-compatible API
Note
: You must be on the university network or VPN to reach the server.
Connection Details
| Parameter | Value |
|---|---|
| Open WebUI | http://silicon.fhgr.ch:7081 |
| API Base URL | http://silicon.fhgr.ch:7080/v1 |
| Model | qwen3.5-35b-a3b |
| API Key | (ask your instructor — may be EMPTY) |
Option 1: Open WebUI (Recommended)
The easiest way to chat with the model — no installation required.
Getting Started
- Make sure you are connected to the university network (or VPN).
- Open your browser and go to http://silicon.fhgr.ch:7081
- Click "Sign Up" to create a new account:
- Enter your name (e.g. your first and last name)
- Enter your email (use your university email)
- Choose a password
- Click "Create Account"
- After signing up you are logged in automatically.
- Select the model qwen3.5-35b-a3b from the model dropdown at the top.
- Type a message and press Enter — you're chatting with the LLM.
Returning Later
- Go to http://silicon.fhgr.ch:7081 and click "Sign In".
- Enter the email and password you used during sign-up.
- All your previous chats are still there.
Features
- Chat history — all conversations are saved on the server and persist across sessions
- Markdown rendering with syntax-highlighted code blocks
- Model selector — auto-discovers available models from the server
- Conversation branching — edit previous messages and explore alternative responses
- File upload — attach files to your messages for the model to analyze
- Search — search across all your past conversations
Tips
- Your account and chat history are stored on the server. You can log in from any device on the university network.
- If you forget your password, ask your instructor to reset it via the Admin Panel.
- The model works best when you provide clear, specific instructions.
- For code tasks, mention the programming language explicitly (e.g. "Write a Python function that...").
- Long conversations use more context. Start a New Chat (top-left button) when switching topics to get faster, more focused responses.
Option 2: Streamlit App (Chat + File Editor)
A local app with chat, file editing, and Python/LaTeX execution. See the Streamlit section below for setup.
Option 3: Python SDK / curl
For programmatic access and scripting.
Quick Start with Python
1. Install the OpenAI SDK
pip install openai
2. Simple Chat
from openai import OpenAI
client = OpenAI(
base_url="http://silicon.fhgr.ch:7080/v1",
api_key="EMPTY", # replace if your instructor set a key
)
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in simple terms."},
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
3. Streaming Responses
stream = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "user", "content": "Write a haiku about machine learning."},
],
max_tokens=256,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Quick Start with curl
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b-a3b",
"messages": [
{"role": "user", "content": "What is the capital of Switzerland?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
Recommended Parameters
| Parameter | Recommended | Notes |
|---|---|---|
temperature |
0.7 | Lower = more deterministic, higher = creative |
max_tokens |
1024–4096 | Increase for long-form output |
top_p |
0.95 | Nucleus sampling |
stream |
true |
Better UX for interactive use |
Tips & Etiquette
- Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
- Use streaming: Makes responses feel faster and reduces perceived latency.
- Don't spam requests: The server is shared among ~15 students.
- Check the model name: Always use
qwen3.5-35b-a3bas the model parameter.
Streamlit Chat & File Editor App
A web UI is included for chatting with the model and editing files. It runs on your own machine and connects to the GPU server.
Setup
# Clone the repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
cd LLM_Inferenz_Server_1
# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txt
Run
streamlit run app.py
Opens at http://localhost:8501 in your browser.
Features
Chat Tab
- Conversational interface with streaming responses
- "Save code" button extracts code from the LLM response and saves it to a workspace file (strips markdown formatting automatically)
File Editor Tab
- Create and edit
.py,.tex,.html, or any text file - Syntax-highlighted preview of file content
- "Generate with LLM" button: describe a change in natural language and the model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting", "translate comments to German")
Sidebar Controls
- Connection: API Base URL and API Key
- LLM Parameters: Adjustable for each request
| Parameter | Default | What it does |
|---|---|---|
| Thinking Mode | Off | Toggle chain-of-thought reasoning (better for complex tasks, slower) |
| Temperature | 0.7 | Lower = predictable, higher = creative |
| Max Tokens | 4096 | Maximum response length |
| Top P | 0.95 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | Encourage diverse topics |
- File Manager: Create new files and switch between them
All generated files are stored in a workspace/ folder next to app.py.
Tip
: The app runs entirely on your local machine. Only the LLM requests go to the server — your files stay local.
Thinking Mode
By default, the model "thinks" before answering (internal chain-of-thought). This is great for complex reasoning but adds latency for simple questions.
To disable thinking and get faster direct responses, add this to your API call:
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[...],
max_tokens=1024,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
Troubleshooting
| Issue | Solution |
|---|---|
| Connection refused | Check you're on the university network / VPN |
| Model not found | Use model name qwen3.5-35b-a3b exactly |
| Slow responses | The model is shared — peak times may be slower |
401 Unauthorized |
Ask your instructor for the API key |
| Response cut off | Increase max_tokens in your request |
| Open WebUI login fails | Make sure you created an account first (Sign Up) |
| Open WebUI shows no models | The vLLM server may still be loading — wait a few minutes |