# Student Guide — Qwen3.5-35B-A3B Inference Server ## Overview A **Qwen3.5-35B-A3B** language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses. You can interact with it using the **OpenAI-compatible API**. ## Connection Details | Parameter | Value | |------------- |---------------------------------------------| | **Base URL** | `http://silicon.fhgr.ch:7080/v1` | | **Model** | `qwen3.5-35b-a3b` | | **API Key** | *(ask your instructor — may be `EMPTY`)* | > **Note**: You must be on the university network or VPN to reach the server. --- ## Quick Start with Python ### 1. Install the OpenAI SDK ```bash pip install openai ``` ### 2. Simple Chat ```python from openai import OpenAI client = OpenAI( base_url="http://silicon.fhgr.ch:7080/v1", api_key="EMPTY", # replace if your instructor set a key ) response = client.chat.completions.create( model="qwen3.5-35b-a3b", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain gradient descent in simple terms."}, ], max_tokens=1024, temperature=0.7, ) print(response.choices[0].message.content) ``` ### 3. Streaming Responses ```python stream = client.chat.completions.create( model="qwen3.5-35b-a3b", messages=[ {"role": "user", "content": "Write a haiku about machine learning."}, ], max_tokens=256, stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) print() ``` --- ## Quick Start with curl ```bash curl http://silicon.fhgr.ch:7080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.5-35b-a3b", "messages": [ {"role": "user", "content": "What is the capital of Switzerland?"} ], "max_tokens": 256, "temperature": 0.7 }' ``` --- ## Recommended Parameters | Parameter | Recommended | Notes | |-----------------|-------------|----------------------------------------------| | `temperature` | 0.7 | Lower = more deterministic, higher = creative | | `max_tokens` | 1024–4096 | Increase for long-form output | | `top_p` | 0.95 | Nucleus sampling | | `stream` | `true` | Better UX for interactive use | --- ## Tips & Etiquette - **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary. - **Use streaming**: Makes responses feel faster and reduces perceived latency. - **Don't spam requests**: The server is shared among ~15 students. - **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter. --- ## Troubleshooting | Issue | Solution | |-----------------------------|-----------------------------------------------------| | Connection refused | Check you're on the university network / VPN | | Model not found | Use model name `qwen3.5-35b-a3b` exactly | | Slow responses | The model is shared — peak times may be slower | | `401 Unauthorized` | Ask your instructor for the API key | | Response cut off | Increase `max_tokens` in your request |