herzogflorian f4fdaab732 Add Open WebUI integration and enhance Streamlit app

- Add Open WebUI scripts (06-09) for server-hosted ChatGPT-like interface
  connected to the vLLM backend on port 7081
- Add context window management to chat (auto-trim, token counter, progress bar)
- Add terminal output panel to file editor for running Python/LaTeX files
- Update README with Open WebUI setup, architecture diagram, and troubleshooting
- Update STUDENT_GUIDE with step-by-step Open WebUI login instructions

Made-with: Cursor

2026-03-02 18:48:51 +01:00

8.4 KiB

Raw Blame History

Student Guide — Qwen3.5-35B-A3B Inference Server

Overview

A Qwen3.5-35B-A3B language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses.

There are three ways to interact with the model:

Open WebUI — ChatGPT-like interface in your browser (easiest)
Streamlit App — Local app with chat, file editor, and code execution
Python SDK / curl — Programmatic access via the OpenAI-compatible API

Note

: You must be on the university network or VPN to reach the server.

Connection Details

Parameter	Value
Open WebUI	`http://silicon.fhgr.ch:7081`
API Base URL	`http://silicon.fhgr.ch:7080/v1`
Model	`qwen3.5-35b-a3b`
API Key	(ask your instructor — may be `EMPTY`)

Option 1: Open WebUI (Recommended)

The easiest way to chat with the model — no installation required.

Getting Started

Make sure you are connected to the university network (or VPN).
Open your browser and go to http://silicon.fhgr.ch:7081
Click "Sign Up" to create a new account:
- Enter your name (e.g. your first and last name)
- Enter your email (use your university email)
- Choose a password
- Click "Create Account"
After signing up you are logged in automatically.
Select the model qwen3.5-35b-a3b from the model dropdown at the top.
Type a message and press Enter — you're chatting with the LLM.

Returning Later

Go to http://silicon.fhgr.ch:7081 and click "Sign In".
Enter the email and password you used during sign-up.
All your previous chats are still there.

Features

Chat history — all conversations are saved on the server and persist across sessions
Markdown rendering with syntax-highlighted code blocks
Model selector — auto-discovers available models from the server
Conversation branching — edit previous messages and explore alternative responses
File upload — attach files to your messages for the model to analyze
Search — search across all your past conversations

Tips

Your account and chat history are stored on the server. You can log in from any device on the university network.
If you forget your password, ask your instructor to reset it via the Admin Panel.
The model works best when you provide clear, specific instructions.
For code tasks, mention the programming language explicitly (e.g. "Write a Python function that...").
Long conversations use more context. Start a New Chat (top-left button) when switching topics to get faster, more focused responses.

Option 2: Streamlit App (Chat + File Editor)

A local app with chat, file editing, and Python/LaTeX execution. See the Streamlit section below for setup.

Option 3: Python SDK / curl

For programmatic access and scripting.

Quick Start with Python

1. Install the OpenAI SDK

pip install openai

2. Simple Chat

from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)

3. Streaming Responses

stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Quick Start with curl

curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Recommended Parameters

Parameter	Recommended	Notes
`temperature`	0.7	Lower = more deterministic, higher = creative
`max_tokens`	1024–4096	Increase for long-form output
`top_p`	0.95	Nucleus sampling
`stream`	`true`	Better UX for interactive use

Tips & Etiquette

Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
Use streaming: Makes responses feel faster and reduces perceived latency.
Don't spam requests: The server is shared among ~15 students.
Check the model name: Always use qwen3.5-35b-a3b as the model parameter.

Streamlit Chat & File Editor App

A web UI is included for chatting with the model and editing files. It runs on your own machine and connects to the GPU server.

Setup

# Clone the repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
cd LLM_Inferenz_Server_1

# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate        # macOS / Linux
# .venv\Scripts\activate         # Windows
pip install -r requirements.txt

Run

streamlit run app.py

Opens at http://localhost:8501 in your browser.

Features

Chat Tab

Conversational interface with streaming responses
"Save code" button extracts code from the LLM response and saves it to a workspace file (strips markdown formatting automatically)

File Editor Tab

Create and edit .py, .tex, .html, or any text file
Syntax-highlighted preview of file content
"Generate with LLM" button: describe a change in natural language and the model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting", "translate comments to German")

Sidebar Controls

Connection: API Base URL and API Key
LLM Parameters: Adjustable for each request

Parameter	Default	What it does
Thinking Mode	Off	Toggle chain-of-thought reasoning (better for complex tasks, slower)
Temperature	0.7	Lower = predictable, higher = creative
Max Tokens	4096	Maximum response length
Top P	0.95	Nucleus sampling threshold
Presence Penalty	0.0	Encourage diverse topics

File Manager: Create new files and switch between them

All generated files are stored in a workspace/ folder next to app.py.

Tip

: The app runs entirely on your local machine. Only the LLM requests go to the server — your files stay local.

Thinking Mode

By default, the model "thinks" before answering (internal chain-of-thought). This is great for complex reasoning but adds latency for simple questions.

To disable thinking and get faster direct responses, add this to your API call:

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[...],
    max_tokens=1024,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

Troubleshooting

Issue	Solution
Connection refused	Check you're on the university network / VPN
Model not found	Use model name `qwen3.5-35b-a3b` exactly
Slow responses	The model is shared — peak times may be slower
`401 Unauthorized`	Ask your instructor for the API key
Response cut off	Increase `max_tokens` in your request
Open WebUI login fails	Make sure you created an account first (Sign Up)
Open WebUI shows no models	The vLLM server may still be loading — wait a few minutes

8.4 KiB Raw Blame History Unescape Escape

Student Guide — Qwen3.5-35B-A3B Inference Server

Overview

Connection Details

Option 1: Open WebUI (Recommended)

Getting Started

Returning Later

Features

Tips

Option 2: Streamlit App (Chat + File Editor)

Option 3: Python SDK / curl

Quick Start with Python

1. Install the OpenAI SDK

2. Simple Chat

3. Streaming Responses

Quick Start with curl

Recommended Parameters

Tips & Etiquette

Streamlit Chat & File Editor App

Setup

Run

Features

Thinking Mode

Troubleshooting

8.4 KiB

Raw Blame History