# Student Guide — Qwen3.5-35B-A3B Inference Server ## Overview A **Qwen3.5-35B-A3B** language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses. There are **three ways** to interact with the model: 1. **Open WebUI** — ChatGPT-like interface in your browser (easiest) 2. **Streamlit App** — Local app with chat, file editor, and code execution 3. **Python SDK / curl** — Programmatic access via the OpenAI-compatible API > **Note**: You must be on the university network or VPN to reach the server. ## Connection Details | Parameter | Value | |------------------|---------------------------------------------| | **Open WebUI** | `http://silicon.fhgr.ch:7081` | | **API Base URL** | `http://silicon.fhgr.ch:7080/v1` | | **Model** | `qwen3.5-35b-a3b` | | **API Key** | *(ask your instructor — may be `EMPTY`)* | --- ## Option 1: Open WebUI (Recommended) The easiest way to chat with the model — no installation required. ### Getting Started 1. Make sure you are connected to the **university network** (or VPN). 2. Open your browser and go to **http://silicon.fhgr.ch:7081** 3. Click **"Sign Up"** to create a new account: - Enter your **name** (e.g. your first and last name) - Enter your **email** (use your university email) - Choose a **password** - Click **"Create Account"** 4. After signing up you are logged in automatically. 5. Select the model **qwen3.5-35b-a3b** from the model dropdown at the top. 6. Type a message and press Enter — you're chatting with the LLM. ### Returning Later - Go to **http://silicon.fhgr.ch:7081** and click **"Sign In"**. - Enter the email and password you used during sign-up. - All your previous chats are still there. ### Features - **Chat history** — all conversations are saved on the server and persist across sessions - **Markdown rendering** with syntax-highlighted code blocks - **Model selector** — auto-discovers available models from the server - **Conversation branching** — edit previous messages and explore alternative responses - **File upload** — attach files to your messages for the model to analyze - **Search** — search across all your past conversations ### Tips - Your account and chat history are stored on the server. You can log in from any device on the university network. - If you forget your password, ask your instructor to reset it via the Admin Panel. - The model works best when you provide clear, specific instructions. - For code tasks, mention the programming language explicitly (e.g. "Write a Python function that..."). - Long conversations use more context. Start a **New Chat** (top-left button) when switching topics to get faster, more focused responses. --- ## Option 2: Streamlit App (Chat + File Editor) A local app with chat, file editing, and Python/LaTeX execution. See the [Streamlit section below](#streamlit-chat--file-editor-app) for setup. --- ## Option 3: Python SDK / curl For programmatic access and scripting. ### Quick Start with Python #### 1. Install the OpenAI SDK ```bash pip install openai ``` #### 2. Simple Chat ```python from openai import OpenAI client = OpenAI( base_url="http://silicon.fhgr.ch:7080/v1", api_key="EMPTY", # replace if your instructor set a key ) response = client.chat.completions.create( model="qwen3.5-35b-a3b", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain gradient descent in simple terms."}, ], max_tokens=1024, temperature=0.7, ) print(response.choices[0].message.content) ``` #### 3. Streaming Responses ```python stream = client.chat.completions.create( model="qwen3.5-35b-a3b", messages=[ {"role": "user", "content": "Write a haiku about machine learning."}, ], max_tokens=256, stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) print() ``` --- ### Quick Start with curl ```bash curl http://silicon.fhgr.ch:7080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.5-35b-a3b", "messages": [ {"role": "user", "content": "What is the capital of Switzerland?"} ], "max_tokens": 256, "temperature": 0.7 }' ``` --- ## Recommended Parameters | Parameter | Recommended | Notes | |-----------------|-------------|----------------------------------------------| | `temperature` | 0.7 | Lower = more deterministic, higher = creative | | `max_tokens` | 1024–4096 | Increase for long-form output | | `top_p` | 0.95 | Nucleus sampling | | `stream` | `true` | Better UX for interactive use | --- ## Tips & Etiquette - **Be mindful of context length**: Avoid excessively long prompts (>8K tokens) unless necessary. - **Use streaming**: Makes responses feel faster and reduces perceived latency. - **Don't spam requests**: The server is shared among ~15 students. - **Check the model name**: Always use `qwen3.5-35b-a3b` as the model parameter. --- ## Streamlit Chat & File Editor App A web UI is included for chatting with the model and editing files. It runs on your own machine and connects to the GPU server. ### Setup ```bash # Clone the repository git clone https://gitea.fhgr.ch/herzogfloria/LLM_Inferenz_Server_1.git cd LLM_Inferenz_Server_1 # Create a virtual environment and install dependencies python3 -m venv .venv source .venv/bin/activate # macOS / Linux # .venv\Scripts\activate # Windows pip install -r requirements.txt ``` ### Run ```bash streamlit run app.py ``` Opens at `http://localhost:8501` in your browser. ### Features **Chat Tab** - Conversational interface with streaming responses - "Save code" button extracts code from the LLM response and saves it to a workspace file (strips markdown formatting automatically) **File Editor Tab** - Create and edit `.py`, `.tex`, `.html`, or any text file - Syntax-highlighted preview of file content - "Generate with LLM" button: describe a change in natural language and the model rewrites the file (e.g. "add error handling", "fix the LaTeX formatting", "translate comments to German") **Sidebar Controls** - **Connection**: API Base URL and API Key - **LLM Parameters**: Adjustable for each request | Parameter | Default | What it does | |-----------|---------|--------------| | Thinking Mode | Off | Toggle chain-of-thought reasoning (better for complex tasks, slower) | | Temperature | 0.7 | Lower = predictable, higher = creative | | Max Tokens | 4096 | Maximum response length | | Top P | 0.95 | Nucleus sampling threshold | | Presence Penalty | 0.0 | Encourage diverse topics | - **File Manager**: Create new files and switch between them All generated files are stored in a `workspace/` folder next to `app.py`. > **Tip**: The app runs entirely on your local machine. Only the LLM requests > go to the server — your files stay local. --- ## Thinking Mode By default, the model "thinks" before answering (internal chain-of-thought). This is great for complex reasoning but adds latency for simple questions. To disable thinking and get faster direct responses, add this to your API call: ```python response = client.chat.completions.create( model="qwen3.5-35b-a3b", messages=[...], max_tokens=1024, extra_body={"chat_template_kwargs": {"enable_thinking": False}}, ) ``` --- ## Troubleshooting | Issue | Solution | |-----------------------------|-----------------------------------------------------| | Connection refused | Check you're on the university network / VPN | | Model not found | Use model name `qwen3.5-35b-a3b` exactly | | Slow responses | The model is shared — peak times may be slower | | `401 Unauthorized` | Ask your instructor for the API key | | Response cut off | Increase `max_tokens` in your request | | Open WebUI login fails | Make sure you created an account first (Sign Up) | | Open WebUI shows no models | The vLLM server may still be loading — wait a few minutes |