Scripts to build container, download model, and serve Qwen3.5-35B-A3B via vLLM with OpenAI-compatible API on port 7080. Configured for 2x NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent students. Made-with: Cursor
24 lines
750 B
Modula-2
24 lines
750 B
Modula-2
Bootstrap: docker
|
|
From: vllm/vllm-openai:latest
|
|
|
|
%labels
|
|
Author herzogfloria
|
|
Description vLLM nightly inference server for Qwen3.5-35B-A3B
|
|
Version 2.0
|
|
|
|
%environment
|
|
export HF_HOME=/tmp/hf_cache
|
|
export VLLM_USAGE_SOURCE=production
|
|
|
|
%post
|
|
apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
|
|
pip install --no-cache-dir vllm --extra-index-url https://wheels.vllm.ai/nightly
|
|
pip install --no-cache-dir "transformers @ git+https://github.com/huggingface/transformers.git@main"
|
|
pip install --no-cache-dir huggingface_hub[cli]
|
|
|
|
%runscript
|
|
exec python3 -m vllm.entrypoints.openai.api_server "$@"
|
|
|
|
%help
|
|
Apptainer container for serving Qwen3.5-35B-A3B via vLLM (nightly).
|