AISE1_CLASS/Code embeddings/README.md

# Code Embeddings — Hands-On Examples

**AISE501 – AI in Software Engineering I**
Fachhochschule Graubünden — Spring Semester 2026

## Overview

Seven self-contained Python programs that demonstrate how embedding
models work. Each script loads a pre-trained model, embeds text or code
snippets, and explores a different capability of embeddings.

| # | Script | What it demonstrates |
|---|--------|---------------------|
| 0 | `00_tokens_and_embeddings_intro.py` | Tokenization basics and general text embeddings (German) |
| 1 | `01_basic_embeddings.py` | Compute code embeddings and pairwise cosine similarity |
| 2 | `02_text_to_code_search.py` | Semantic search: find code from natural language queries |
| 3 | `03_cross_language.py` | Same algorithm in 4 languages → similar embeddings |
| 4 | `04_clone_detection.py` | Detect duplicate/similar code in a simulated codebase |
| 5 | `05_visualize_embeddings.py` | PCA and t-SNE plots of the embedding space |
| 6 | `06_pca_denoising.py` | PCA denoising: fewer dimensions can improve similarity |

## Setup

### 1. Create a virtual environment (recommended)

```bash
python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate
```

### 2. Install dependencies

```bash
pip install -r requirements.txt
```

**PyTorch GPU support:**

- **Apple Silicon Mac (M1/M2/M3/M4):** MPS acceleration works
  out of the box with the standard PyTorch install. No extra steps needed.
- **NVIDIA GPU (Windows/Linux):** Install the CUDA version of PyTorch.
  See https://pytorch.org/get-started/locally/ for the correct command
  for your CUDA version.
- **CPU only:** Everything works on CPU too, just a bit slower.

### 3. Run any example

```bash
python 00_tokens_and_embeddings_intro.py
python 01_basic_embeddings.py
python 02_text_to_code_search.py
python 03_cross_language.py
python 04_clone_detection.py
python 05_visualize_embeddings.py
python 06_pca_denoising.py
```

The first run will download the model (~300 MB). Subsequent runs
use the cached model.

## Model

All code embedding examples (01–06) use **st-codesearch-distilroberta-base**
(82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million
code-comment pairs from CodeSearchNet using contrastive learning
(MultipleNegativesRankingLoss). It produces 768-dimensional embedding
vectors optimized for matching natural language descriptions to code,
making it ideal for semantic code search and similarity tasks.

The introductory example (00) uses **paraphrase-multilingual-mpnet-base-v2**
for demonstrating general language embeddings with German text.

## Hardware Requirements

- **RAM:** 1 GB free (for the model)
- **Disk:** ~500 MB (for the downloaded model, cached in `~/.cache/huggingface/`)
- **GPU:** Optional — all scripts auto-detect and use:
  - CUDA (NVIDIA GPUs)
  - MPS (Apple Silicon)
  - CPU (fallback)

## Expected Output

Each script prints structured output with explanations. Example 5
saves two PNG images (`code_embeddings_pca.png` and
`code_embeddings_tsne.png`) showing the embedding space. Example 6
saves `pca_denoising_analysis.png` with three sub-plots analyzing
optimal embedding dimensions.