Add clone/venv setup instructions, feature descriptions for both tabs, sidebar parameter table, and clarify that files stay local. Made-with: Cursor
5.8 KiB
Student Guide — Qwen3.5-35B-A3B Inference Server
Overview
A Qwen3.5-35B-A3B language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses. You can interact with it using the OpenAI-compatible API.
Connection Details
| Parameter | Value |
|---|---|
| Base URL | http://silicon.fhgr.ch:7080/v1 |
| Model | qwen3.5-35b-a3b |
| API Key | (ask your instructor — may be EMPTY) |
Note
: You must be on the university network or VPN to reach the server.
Quick Start with Python
1. Install the OpenAI SDK
pip install openai
2. Simple Chat
from openai import OpenAI
client = OpenAI(
base_url="http://silicon.fhgr.ch:7080/v1",
api_key="EMPTY", # replace if your instructor set a key
)
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in simple terms."},
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
3. Streaming Responses
stream = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "user", "content": "Write a haiku about machine learning."},
],
max_tokens=256,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Quick Start with curl
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b-a3b",
"messages": [
{"role": "user", "content": "What is the capital of Switzerland?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
Recommended Parameters
| Parameter | Recommended | Notes |
|---|---|---|
temperature |
0.7 | Lower = more deterministic, higher = creative |
max_tokens |
1024–4096 | Increase for long-form output |
top_p |
0.95 | Nucleus sampling |
stream |
true |
Better UX for interactive use |
Tips & Etiquette
- Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
- Use streaming: Makes responses feel faster and reduces perceived latency.
- Don't spam requests: The server is shared among ~15 students.
- Check the model name: Always use
qwen3.5-35b-a3bas the model parameter.
Streamlit Chat & File Editor App
A web UI is included for chatting with the model and editing files. It runs on your own machine and connects to the GPU server.
Setup
# Clone the repository
git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git
cd LLM_Inferenz_Server_1
# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txt
Run
streamlit run app.py
Opens at http://localhost:8501 in your browser.
Features
Chat Tab
- Conversational interface with streaming responses
- "Save code" button extracts code from the LLM response and saves it to a workspace file (strips markdown formatting automatically)
File Editor Tab
- Create and edit
.py,.tex,.html, or any text file - Syntax-highlighted preview of file content
- "Generate with LLM" button: describe a change in natural language and the model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting", "translate comments to German")
Sidebar Controls
- Connection: API Base URL and API Key
- LLM Parameters: Adjustable for each request
| Parameter | Default | What it does |
|---|---|---|
| Thinking Mode | Off | Toggle chain-of-thought reasoning (better for complex tasks, slower) |
| Temperature | 0.7 | Lower = predictable, higher = creative |
| Max Tokens | 4096 | Maximum response length |
| Top P | 0.95 | Nucleus sampling threshold |
| Presence Penalty | 0.0 | Encourage diverse topics |
- File Manager: Create new files and switch between them
All generated files are stored in a workspace/ folder next to app.py.
Tip
: The app runs entirely on your local machine. Only the LLM requests go to the server — your files stay local.
Thinking Mode
By default, the model "thinks" before answering (internal chain-of-thought). This is great for complex reasoning but adds latency for simple questions.
To disable thinking and get faster direct responses, add this to your API call:
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[...],
max_tokens=1024,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
Troubleshooting
| Issue | Solution |
|---|---|
| Connection refused | Check you're on the university network / VPN |
| Model not found | Use model name qwen3.5-35b-a3b exactly |
| Slow responses | The model is shared — peak times may be slower |
401 Unauthorized |
Ask your instructor for the API key |
| Response cut off | Increase max_tokens in your request |