Scripts to build container, download model, and serve Qwen3.5-35B-A3B via vLLM with OpenAI-compatible API on port 7080. Configured for 2x NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent students. Made-with: Cursor
3.5 KiB
3.5 KiB
Student Guide — Qwen3.5-35B-A3B Inference Server
Overview
A Qwen3.5-35B-A3B language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses. You can interact with it using the OpenAI-compatible API.
Connection Details
| Parameter | Value |
|---|---|
| Base URL | http://silicon.fhgr.ch:7080/v1 |
| Model | qwen3.5-35b-a3b |
| API Key | (ask your instructor — may be EMPTY) |
Note
: You must be on the university network or VPN to reach the server.
Quick Start with Python
1. Install the OpenAI SDK
pip install openai
2. Simple Chat
from openai import OpenAI
client = OpenAI(
base_url="http://silicon.fhgr.ch:7080/v1",
api_key="EMPTY", # replace if your instructor set a key
)
response = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in simple terms."},
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
3. Streaming Responses
stream = client.chat.completions.create(
model="qwen3.5-35b-a3b",
messages=[
{"role": "user", "content": "Write a haiku about machine learning."},
],
max_tokens=256,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Quick Start with curl
curl http://silicon.fhgr.ch:7080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b-a3b",
"messages": [
{"role": "user", "content": "What is the capital of Switzerland?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
Recommended Parameters
| Parameter | Recommended | Notes |
|---|---|---|
temperature |
0.7 | Lower = more deterministic, higher = creative |
max_tokens |
1024–4096 | Increase for long-form output |
top_p |
0.95 | Nucleus sampling |
stream |
true |
Better UX for interactive use |
Tips & Etiquette
- Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
- Use streaming: Makes responses feel faster and reduces perceived latency.
- Don't spam requests: The server is shared among ~15 students.
- Check the model name: Always use
qwen3.5-35b-a3bas the model parameter.
Troubleshooting
| Issue | Solution |
|---|---|
| Connection refused | Check you're on the university network / VPN |
| Model not found | Use model name qwen3.5-35b-a3b exactly |
| Slow responses | The model is shared — peak times may be slower |
401 Unauthorized |
Ask your instructor for the API key |
| Response cut off | Increase max_tokens in your request |