The single TP=4 server on 4x L40S (no NVLink) pays a per-layer
all-reduce tax over PCIe. Since the A10B MoE fits in 2 cards at FP8,
run two TP=2 replicas (GPUs 0,1 / 2,3) behind a streaming load
balancer on the public port 7080 for better concurrent throughput.
- 14_start_replica_122b.sh: one TP=2 replica pinned to a GPU pair
- 15_start_replicas_122b.sh: launch both replicas + load balancer
- 16_start_loadbalancer.sh + lb_proxy.py: least-in-flight streaming
reverse proxy on 7080 -> replicas on 7091/7092 (clear of Open WebUI
on 7081)
- 17_stop_replicas_122b.sh: stop LB + both replicas
- 11_start_server_122b.sh: add --kv-cache-dtype fp8 (~2x more 128k KV
slots), --max-num-seqs 16, chunked prefill, gpu-util 0.95
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- 11_start_server_122b.sh: raise MAX_MODEL_LEN default to 131072 (128k),
make max-num-seqs overridable via MAX_NUM_SEQS (default 4 for low
concurrency / large KV cache). Echo concurrency in startup banner.
- 13_check_server.sh: new health check that polls /v1/models until ready,
then sends a properly-sized test prompt and reports OK/failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Auto-detect available models from the vLLM API instead of hardcoding.
Extract code blocks by matching on language tag and picking the largest
block, avoiding false matches on short pip/run commands.
Made-with: Cursor
- Add download script (10), start script (11), and background launcher (12)
for the 122B FP8 model using all 4 GPUs with TP=4
- Both models share port 7080; only one runs at a time
- Update README with dual-model hardware table, switching workflow, and
updated file overview
- Update STUDENT_GUIDE with both model names and discovery instructions
Made-with: Cursor
- Add Open WebUI scripts (06-09) for server-hosted ChatGPT-like interface
connected to the vLLM backend on port 7081
- Add context window management to chat (auto-trim, token counter, progress bar)
- Add terminal output panel to file editor for running Python/LaTeX files
- Update README with Open WebUI setup, architecture diagram, and troubleshooting
- Update STUDENT_GUIDE with step-by-step Open WebUI login instructions
Made-with: Cursor
Add Streamlit app section with setup, usage, and sidebar controls.
Document nightly Docker image requirement, scp workflow for server
sync, and practical troubleshooting tips from setup experience.
Made-with: Cursor
Thinking mode toggle, temperature, max tokens, top_p, and presence
penalty sliders in the Streamlit sidebar. Parameters apply to both
chat and file editor generation.
Made-with: Cursor
Scripts to build container, download model, and serve Qwen3.5-35B-A3B
via vLLM with OpenAI-compatible API on port 7080. Configured for 2x
NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent
students.
Made-with: Cursor