94 lines
3.1 KiB
Markdown
94 lines
3.1 KiB
Markdown
# Code Embeddings — Hands-On Examples
|
||
|
||
**AISE501 – AI in Software Engineering I**
|
||
Fachhochschule Graubünden — Spring Semester 2026
|
||
|
||
## Overview
|
||
|
||
Seven self-contained Python programs that demonstrate how embedding
|
||
models work. Each script loads a pre-trained model, embeds text or code
|
||
snippets, and explores a different capability of embeddings.
|
||
|
||
| # | Script | What it demonstrates |
|
||
|---|--------|---------------------|
|
||
| 0 | `00_tokens_and_embeddings_intro.py` | Tokenization basics and general text embeddings (German) |
|
||
| 1 | `01_basic_embeddings.py` | Compute code embeddings and pairwise cosine similarity |
|
||
| 2 | `02_text_to_code_search.py` | Semantic search: find code from natural language queries |
|
||
| 3 | `03_cross_language.py` | Same algorithm in 4 languages → similar embeddings |
|
||
| 4 | `04_clone_detection.py` | Detect duplicate/similar code in a simulated codebase |
|
||
| 5 | `05_visualize_embeddings.py` | PCA and t-SNE plots of the embedding space |
|
||
| 6 | `06_pca_denoising.py` | PCA denoising: fewer dimensions can improve similarity |
|
||
|
||
## Setup
|
||
|
||
### 1. Create a virtual environment (recommended)
|
||
|
||
```bash
|
||
python -m venv venv
|
||
|
||
# macOS / Linux
|
||
source venv/bin/activate
|
||
|
||
# Windows
|
||
venv\Scripts\activate
|
||
```
|
||
|
||
### 2. Install dependencies
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
**PyTorch GPU support:**
|
||
|
||
- **Apple Silicon Mac (M1/M2/M3/M4):** MPS acceleration works
|
||
out of the box with the standard PyTorch install. No extra steps needed.
|
||
- **NVIDIA GPU (Windows/Linux):** Install the CUDA version of PyTorch.
|
||
See https://pytorch.org/get-started/locally/ for the correct command
|
||
for your CUDA version.
|
||
- **CPU only:** Everything works on CPU too, just a bit slower.
|
||
|
||
### 3. Run any example
|
||
|
||
```bash
|
||
python 00_tokens_and_embeddings_intro.py
|
||
python 01_basic_embeddings.py
|
||
python 02_text_to_code_search.py
|
||
python 03_cross_language.py
|
||
python 04_clone_detection.py
|
||
python 05_visualize_embeddings.py
|
||
python 06_pca_denoising.py
|
||
```
|
||
|
||
The first run will download the model (~300 MB). Subsequent runs
|
||
use the cached model.
|
||
|
||
## Model
|
||
|
||
All code embedding examples (01–06) use **st-codesearch-distilroberta-base**
|
||
(82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million
|
||
code-comment pairs from CodeSearchNet using contrastive learning
|
||
(MultipleNegativesRankingLoss). It produces 768-dimensional embedding
|
||
vectors optimized for matching natural language descriptions to code,
|
||
making it ideal for semantic code search and similarity tasks.
|
||
|
||
The introductory example (00) uses **paraphrase-multilingual-mpnet-base-v2**
|
||
for demonstrating general language embeddings with German text.
|
||
|
||
## Hardware Requirements
|
||
|
||
- **RAM:** 1 GB free (for the model)
|
||
- **Disk:** ~500 MB (for the downloaded model, cached in `~/.cache/huggingface/`)
|
||
- **GPU:** Optional — all scripts auto-detect and use:
|
||
- CUDA (NVIDIA GPUs)
|
||
- MPS (Apple Silicon)
|
||
- CPU (fallback)
|
||
|
||
## Expected Output
|
||
|
||
Each script prints structured output with explanations. Example 5
|
||
saves two PNG images (`code_embeddings_pca.png` and
|
||
`code_embeddings_tsne.png`) showing the embedding space. Example 6
|
||
saves `pca_denoising_analysis.png` with three sub-plots analyzing
|
||
optimal embedding dimensions.
|