LLM_Inferenz_Server_1/README.md

# LLM Local — Qwen3.5-27B Inference Server

Self-hosted LLM inference for ~15 concurrent students using **Qwen3.5-27B**,
served via **vLLM** inside an **Apptainer** container on a GPU server.

## Architecture

```
Students (OpenAI SDK / curl)
        │
        ▼
  ┌─────────────────────────┐
  │  silicon.fhgr.ch:7080   │
  │  OpenAI-compatible API  │
  ├─────────────────────────┤
  │  vLLM Server            │
  │  (Apptainer container)  │
  ├─────────────────────────┤
  │  Qwen3.5-27B weights    │
  │  (bind-mounted)         │
  ├─────────────────────────┤
  │  NVIDIA GPU             │
  └─────────────────────────┘
```

## Prerequisites

- **GPU**: NVIDIA GPU with >=80 GB VRAM (A100-80GB or H100 recommended).
  Qwen3.5-27B in BF16 requires ~56 GB VRAM plus KV cache overhead.
- **Apptainer** (formerly Singularity) installed on the server.
- **NVIDIA drivers** + **nvidia-container-cli** for GPU passthrough.
- **~60 GB disk space** for model weights + ~15 GB for the container image.
- **Network**: Students must be on the university network or VPN.

## Hardware Sizing

| Component | Minimum        | Recommended     |
|-----------|----------------|-----------------|
| GPU VRAM  | 80 GB (1× A100)| 80 GB (1× H100) |
| RAM       | 64 GB          | 128 GB          |
| Disk      | 100 GB free    | 200 GB free     |

> **If your GPU has less than 80 GB VRAM**, you have two options:
> 1. Use a **quantized** version (e.g., AWQ/GPTQ 4-bit — ~16 GB VRAM)
> 2. Use **tensor parallelism** across multiple GPUs (set `TENSOR_PARALLEL=2`)

---

## Step-by-Step Setup

### Step 0: SSH into the Server

```bash
ssh herzogfloria@silicon.fhgr.ch
```

### Step 1: Clone This Repository

```bash
# Or copy the files to the server
git clone <your-repo-url> ~/LLM_local
cd ~/LLM_local
chmod +x *.sh
```

### Step 2: Check GPU and Environment

```bash
# Verify GPU is visible
nvidia-smi

# Verify Apptainer is installed
apptainer --version

# Check available disk space
df -h ~
```

### Step 3: Download the Model (~60 GB)

```bash
# Install huggingface-cli if not available
pip install --user huggingface_hub[cli]

# Download Qwen3.5-27B
bash 01_download_model.sh
# Default target: ~/models/Qwen3.5-27B
```

This downloads the full BF16 weights. Takes 20-60 minutes depending on bandwidth.

### Step 4: Build the Apptainer Container

```bash
bash 02_build_container.sh
```

This pulls the `vllm/vllm-openai:latest` Docker image and converts it to a `.sif` file.
Takes 10-20 minutes. The resulting `vllm_qwen.sif` is ~12-15 GB.

> **Tip**: If building fails due to network/proxy issues, you can pull the Docker image
> first and convert manually:
> ```bash
> apptainer pull docker://vllm/vllm-openai:latest
> ```

### Step 5: Start the Server

**Interactive (foreground):**
```bash
bash 03_start_server.sh
```

**Background (recommended for production):**
```bash
bash 04_start_server_background.sh
```

The server takes 2-5 minutes to load the model into GPU memory. Monitor with:
```bash
tail -f logs/vllm_server_*.log
```

Look for the line:
```
INFO:     Uvicorn running on http://0.0.0.0:8000
```

### Step 6: Test the Server

```bash
# Quick health check
curl http://localhost:7080/v1/models

# Full test
pip install openai
python test_server.py
```

### Step 7: Share with Students

Distribute the `STUDENT_GUIDE.md` file or share the connection details:
- **27B Base URL**: `http://silicon.fhgr.ch:7080/v1` — model name: `qwen3.5-27b`
- **35B Base URL**: `http://silicon.fhgr.ch:7081/v1` — model name: `qwen3.5-35b-a3b`

---

## Configuration

All configuration is via environment variables in `03_start_server.sh`:

| Variable          | Default                      | Description                         |
|-------------------|------------------------------|-------------------------------------|
| `MODEL_DIR`       | `~/models/Qwen3.5-27B`      | Path to model weights               |
| `PORT`            | `7080`                       | HTTP port                           |
| `MAX_MODEL_LEN`   | `32768`                      | Max context length (tokens)         |
| `GPU_MEM_UTIL`    | `0.92`                       | Fraction of GPU memory to use       |
| `API_KEY`         | *(empty = no auth)*          | API key for authentication          |
| `TENSOR_PARALLEL` | `1`                          | Number of GPUs                      |

### Context Length Tuning

The default `MAX_MODEL_LEN=32768` is conservative and ensures stable operation for 15
concurrent users. If you have plenty of VRAM headroom:

```bash
MAX_MODEL_LEN=65536 bash 03_start_server.sh
```

Qwen3.5-27B natively supports up to 262,144 tokens, but longer contexts require
significantly more GPU memory for KV cache.

### Adding Authentication

```bash
API_KEY="your-secret-key-here" bash 03_start_server.sh
```

Students then use this key in their `api_key` parameter.

### Multi-GPU Setup

If you have multiple GPUs:

```bash
TENSOR_PARALLEL=2 bash 03_start_server.sh
```

---

## Server Management

```bash
# Start in background
bash 04_start_server_background.sh

# Check if running
curl -s http://localhost:7080/v1/models | python -m json.tool

# View logs
tail -f logs/vllm_server_*.log

# Stop
bash 05_stop_server.sh

# Monitor GPU usage
watch -n 2 nvidia-smi
```

### Running Persistently with tmux

For a robust setup that survives SSH disconnects:

```bash
ssh herzogfloria@silicon.fhgr.ch
tmux new -s llm_server
bash 03_start_server.sh
# Press Ctrl+B, then D to detach

# Reconnect later:
tmux attach -t llm_server
```

---

## Files Overview

| File                         | Purpose                                    |
|------------------------------|------------------------------------------- |
| `vllm_qwen.def`             | Apptainer container definition             |
| `01_download_model.sh`       | Downloads model weights from Hugging Face  |
| `02_build_container.sh`      | Builds the Apptainer .sif image            |
| `03_start_server.sh`         | Starts vLLM server (foreground)            |
| `04_start_server_background.sh` | Starts server in background with logging|
| `05_stop_server.sh`          | Stops the background server                |
| `test_server.py`             | Tests the running server                   |
| `STUDENT_GUIDE.md`           | Instructions for students                  |

---

## Troubleshooting

### "CUDA out of memory"
- Reduce `MAX_MODEL_LEN` (e.g., 16384)
- Reduce `GPU_MEM_UTIL` (e.g., 0.85)
- Use a quantized model variant

### Container build fails
- Ensure you have internet access and sufficient disk space (~20 GB for build cache)
- Try: `apptainer pull docker://vllm/vllm-openai:latest` first

### "No NVIDIA GPU detected"
- Check that `nvidia-smi` works outside the container
- Ensure `--nv` flag is passed (already in scripts)
- Verify nvidia-container-cli: `apptainer exec --nv vllm_qwen.sif nvidia-smi`

### Server starts but students can't connect
- Check firewall: `sudo ufw allow 7080:7090/tcp` or equivalent
- Verify the server binds to `0.0.0.0` (not just localhost)
- Students must use the server's hostname/IP, not `localhost`

### Slow generation with many users
- This is expected — vLLM batches requests but throughput is finite
- Consider reducing `max_tokens` in student requests
- Monitor with: `curl http://localhost:7080/metrics`