LLM_Inferenz_Server_1

Author	SHA1	Message	Date
herzogflorian	51726f9351	Add 2-replica TP=2 serving for 122B + FP8 KV cache throughput tuning The single TP=4 server on 4x L40S (no NVLink) pays a per-layer all-reduce tax over PCIe. Since the A10B MoE fits in 2 cards at FP8, run two TP=2 replicas (GPUs 0,1 / 2,3) behind a streaming load balancer on the public port 7080 for better concurrent throughput. - 14_start_replica_122b.sh: one TP=2 replica pinned to a GPU pair - 15_start_replicas_122b.sh: launch both replicas + load balancer - 16_start_loadbalancer.sh + lb_proxy.py: least-in-flight streaming reverse proxy on 7080 -> replicas on 7091/7092 (clear of Open WebUI on 7081) - 17_stop_replicas_122b.sh: stop LB + both replicas - 11_start_server_122b.sh: add --kv-cache-dtype fp8 (~2x more 128k KV slots), --max-num-seqs 16, chunked prefill, gpu-util 0.95 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:54:38 +02:00
herzogflorian	b9eaf2df18	Default 122B server to 128k context, add health-check script - 11_start_server_122b.sh: raise MAX_MODEL_LEN default to 131072 (128k), make max-num-seqs overridable via MAX_NUM_SEQS (default 4 for low concurrency / large KV cache). Echo concurrency in startup banner. - 13_check_server.sh: new health check that polls /v1/models until ready, then sends a properly-sized test prompt and reports OK/failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 10:29:13 +02:00
herzogflorian	030d9f7935	Replace hardcoded username with placeholder in README Made-with: Cursor	2026-03-02 20:59:27 +01:00
herzogflorian	a5657c3c1f	Add dynamic model discovery and improve code extraction in app Auto-detect available models from the vLLM API instead of hardcoding. Extract code blocks by matching on language tag and picking the largest block, avoiding false matches on short pip/run commands. Made-with: Cursor	2026-03-02 20:03:45 +01:00
herzogflorian	a9ed1060cc	Fix 122B model download to use Python API instead of huggingface-cli Made-with: Cursor	2026-03-02 20:03:39 +01:00
herzogflorian	eff76401ee	Add Qwen3.5-122B-A10B-FP8 model support - Add download script (10), start script (11), and background launcher (12) for the 122B FP8 model using all 4 GPUs with TP=4 - Both models share port 7080; only one runs at a time - Update README with dual-model hardware table, switching workflow, and updated file overview - Update STUDENT_GUIDE with both model names and discovery instructions Made-with: Cursor	2026-03-02 19:00:32 +01:00
herzogflorian	f4fdaab732	Add Open WebUI integration and enhance Streamlit app - Add Open WebUI scripts (06-09) for server-hosted ChatGPT-like interface connected to the vLLM backend on port 7081 - Add context window management to chat (auto-trim, token counter, progress bar) - Add terminal output panel to file editor for running Python/LaTeX files - Update README with Open WebUI setup, architecture diagram, and troubleshooting - Update STUDENT_GUIDE with step-by-step Open WebUI login instructions Made-with: Cursor	2026-03-02 18:48:51 +01:00
herzogflorian	d59285fe69	Update student guide with full app.py documentation Add clone/venv setup instructions, feature descriptions for both tabs, sidebar parameter table, and clarify that files stay local. Made-with: Cursor	2026-03-02 16:43:21 +01:00
herzogflorian	deee5038d1	Update README to reflect current project state Add Streamlit app section with setup, usage, and sidebar controls. Document nightly Docker image requirement, scp workflow for server sync, and practical troubleshooting tips from setup experience. Made-with: Cursor	2026-03-02 16:42:33 +01:00
herzogflorian	12f9e3ac9b	Add LLM parameter controls to sidebar Thinking mode toggle, temperature, max tokens, top_p, and presence penalty sliders in the Streamlit sidebar. Parameters apply to both chat and file editor generation. Made-with: Cursor	2026-03-02 16:41:05 +01:00
herzogflorian	9e1e0c0751	Add Streamlit chat app, update container to vLLM nightly - Add app.py: Streamlit UI with chat and file editor tabs - Add requirements.txt: streamlit + openai dependencies - Update vllm_qwen.def: use nightly image for Qwen3.5 support - Update README.md: reflect 35B-A3B model, correct script names - Update STUDENT_GUIDE.md: add app usage and thinking mode docs - Update .gitignore: exclude .venv/ and workspace/ Made-with: Cursor	2026-03-02 16:30:04 +01:00
herzogflorian	076001b07f	Add vLLM inference setup for Qwen3.5-35B-A3B on Apptainer Scripts to build container, download model, and serve Qwen3.5-35B-A3B via vLLM with OpenAI-compatible API on port 7080. Configured for 2x NVIDIA L40S GPUs with tensor parallelism, supporting ~15 concurrent students. Made-with: Cursor	2026-03-02 14:43:39 +01:00

12 Commits