herzogflorian 076001b07f Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer

Scripts to build container, download model, and serve Qwen3.5-35B-A3B
via vLLM with OpenAI-compatible API on port 7080. Configured for 2x
NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent
students.

Made-with: Cursor

2026-03-02 14:43:39 +01:00

3.5 KiB

Raw Blame History

Student Guide — Qwen3.5-35B-A3B Inference Server

Overview

A Qwen3.5-35B-A3B language model is running on our GPU server. It's a Mixture-of-Experts model (35B total parameters, 3B active per token), providing fast and high-quality responses. You can interact with it using the OpenAI-compatible API.

Connection Details

Parameter	Value
Base URL	`http://silicon.fhgr.ch:7080/v1`
Model	`qwen3.5-35b-a3b`
API Key	(ask your instructor — may be `EMPTY`)

Note

: You must be on the university network or VPN to reach the server.

Quick Start with Python

1. Install the OpenAI SDK

pip install openai

2. Simple Chat

from openai import OpenAI

client = OpenAI(
    base_url="http://silicon.fhgr.ch:7080/v1",
    api_key="EMPTY",  # replace if your instructor set a key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."},
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)

3. Streaming Responses

stream = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Write a haiku about machine learning."},
    ],
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Quick Start with curl

curl http://silicon.fhgr.ch:7080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b-a3b",
    "messages": [
      {"role": "user", "content": "What is the capital of Switzerland?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Recommended Parameters

Parameter	Recommended	Notes
`temperature`	0.7	Lower = more deterministic, higher = creative
`max_tokens`	1024–4096	Increase for long-form output
`top_p`	0.95	Nucleus sampling
`stream`	`true`	Better UX for interactive use

Tips & Etiquette

Be mindful of context length: Avoid excessively long prompts (>8K tokens) unless necessary.
Use streaming: Makes responses feel faster and reduces perceived latency.
Don't spam requests: The server is shared among ~15 students.
Check the model name: Always use qwen3.5-35b-a3b as the model parameter.

Troubleshooting

Issue	Solution
Connection refused	Check you're on the university network / VPN
Model not found	Use model name `qwen3.5-35b-a3b` exactly
Slow responses	The model is shared — peak times may be slower
`401 Unauthorized`	Ask your instructor for the API key
Response cut off	Increase `max_tokens` in your request

3.5 KiB Raw Blame History Unescape Escape