94 lines
3.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Code Embeddings — Hands-On Examples
**AISE501 AI in Software Engineering I**
Fachhochschule Graubünden — Spring Semester 2026
## Overview
Seven self-contained Python programs that demonstrate how embedding
models work. Each script loads a pre-trained model, embeds text or code
snippets, and explores a different capability of embeddings.
| # | Script | What it demonstrates |
|---|--------|---------------------|
| 0 | `00_tokens_and_embeddings_intro.py` | Tokenization basics and general text embeddings (German) |
| 1 | `01_basic_embeddings.py` | Compute code embeddings and pairwise cosine similarity |
| 2 | `02_text_to_code_search.py` | Semantic search: find code from natural language queries |
| 3 | `03_cross_language.py` | Same algorithm in 4 languages → similar embeddings |
| 4 | `04_clone_detection.py` | Detect duplicate/similar code in a simulated codebase |
| 5 | `05_visualize_embeddings.py` | PCA and t-SNE plots of the embedding space |
| 6 | `06_pca_denoising.py` | PCA denoising: fewer dimensions can improve similarity |
## Setup
### 1. Create a virtual environment (recommended)
```bash
python -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
**PyTorch GPU support:**
- **Apple Silicon Mac (M1/M2/M3/M4):** MPS acceleration works
out of the box with the standard PyTorch install. No extra steps needed.
- **NVIDIA GPU (Windows/Linux):** Install the CUDA version of PyTorch.
See https://pytorch.org/get-started/locally/ for the correct command
for your CUDA version.
- **CPU only:** Everything works on CPU too, just a bit slower.
### 3. Run any example
```bash
python 00_tokens_and_embeddings_intro.py
python 01_basic_embeddings.py
python 02_text_to_code_search.py
python 03_cross_language.py
python 04_clone_detection.py
python 05_visualize_embeddings.py
python 06_pca_denoising.py
```
The first run will download the model (~300 MB). Subsequent runs
use the cached model.
## Model
All code embedding examples (0106) use **st-codesearch-distilroberta-base**
(82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million
code-comment pairs from CodeSearchNet using contrastive learning
(MultipleNegativesRankingLoss). It produces 768-dimensional embedding
vectors optimized for matching natural language descriptions to code,
making it ideal for semantic code search and similarity tasks.
The introductory example (00) uses **paraphrase-multilingual-mpnet-base-v2**
for demonstrating general language embeddings with German text.
## Hardware Requirements
- **RAM:** 1 GB free (for the model)
- **Disk:** ~500 MB (for the downloaded model, cached in `~/.cache/huggingface/`)
- **GPU:** Optional — all scripts auto-detect and use:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- CPU (fallback)
## Expected Output
Each script prints structured output with explanations. Example 5
saves two PNG images (`code_embeddings_pca.png` and
`code_embeddings_tsne.png`) showing the embedding space. Example 6
saves `pca_denoising_analysis.png` with three sub-plots analyzing
optimal embedding dimensions.