LLM_Inferenz_Server_1

herzogfloria/LLM_Inferenz_Server_1

Fork 0

Commit Graph

Author	SHA1	Message	Date
herzogflorian	51726f9351	Add 2-replica TP=2 serving for 122B + FP8 KV cache throughput tuning The single TP=4 server on 4x L40S (no NVLink) pays a per-layer all-reduce tax over PCIe. Since the A10B MoE fits in 2 cards at FP8, run two TP=2 replicas (GPUs 0,1 / 2,3) behind a streaming load balancer on the public port 7080 for better concurrent throughput. - 14_start_replica_122b.sh: one TP=2 replica pinned to a GPU pair - 15_start_replicas_122b.sh: launch both replicas + load balancer - 16_start_loadbalancer.sh + lb_proxy.py: least-in-flight streaming reverse proxy on 7080 -> replicas on 7091/7092 (clear of Open WebUI on 7081) - 17_stop_replicas_122b.sh: stop LB + both replicas - 11_start_server_122b.sh: add --kv-cache-dtype fp8 (~2x more 128k KV slots), --max-num-seqs 16, chunked prefill, gpu-util 0.95 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:54:38 +02:00

Author

SHA1

Message

Date

herzogflorian

51726f9351

Add 2-replica TP=2 serving for 122B + FP8 KV cache throughput tuning

The single TP=4 server on 4x L40S (no NVLink) pays a per-layer
all-reduce tax over PCIe. Since the A10B MoE fits in 2 cards at FP8,
run two TP=2 replicas (GPUs 0,1 / 2,3) behind a streaming load
balancer on the public port 7080 for better concurrent throughput.

- 14_start_replica_122b.sh: one TP=2 replica pinned to a GPU pair
- 15_start_replicas_122b.sh: launch both replicas + load balancer
- 16_start_loadbalancer.sh + lb_proxy.py: least-in-flight streaming
  reverse proxy on 7080 -> replicas on 7091/7092 (clear of Open WebUI
  on 7081)
- 17_stop_replicas_122b.sh: stop LB + both replicas
- 11_start_server_122b.sh: add --kv-cache-dtype fp8 (~2x more 128k KV
  slots), --max-num-seqs 16, chunked prefill, gpu-util 0.95

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-09 15:54:38 +02:00

1 Commits