Code Embeddings — Hands-On Examples

AISE501 – AI in Software Engineering I Fachhochschule Graubünden — Spring Semester 2026

Overview

Seven self-contained Python programs that demonstrate how embedding models work. Each script loads a pre-trained model, embeds text or code snippets, and explores a different capability of embeddings.

#	Script	What it demonstrates
0	`00_tokens_and_embeddings_intro.py`	Tokenization basics and general text embeddings (German)
1	`01_basic_embeddings.py`	Compute code embeddings and pairwise cosine similarity
2	`02_text_to_code_search.py`	Semantic search: find code from natural language queries
3	`03_cross_language.py`	Same algorithm in 4 languages → similar embeddings
4	`04_clone_detection.py`	Detect duplicate/similar code in a simulated codebase
5	`05_visualize_embeddings.py`	PCA and t-SNE plots of the embedding space
6	`06_pca_denoising.py`	PCA denoising: fewer dimensions can improve similarity

Setup

1. Create a virtual environment (recommended)

python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

PyTorch GPU support:

Apple Silicon Mac (M1/M2/M3/M4): MPS acceleration works out of the box with the standard PyTorch install. No extra steps needed.
NVIDIA GPU (Windows/Linux): Install the CUDA version of PyTorch. See https://pytorch.org/get-started/locally/ for the correct command for your CUDA version.
CPU only: Everything works on CPU too, just a bit slower.

3. Run any example

python 00_tokens_and_embeddings_intro.py
python 01_basic_embeddings.py
python 02_text_to_code_search.py
python 03_cross_language.py
python 04_clone_detection.py
python 05_visualize_embeddings.py
python 06_pca_denoising.py

The first run will download the model (~300 MB). Subsequent runs use the cached model.

Model

All code embedding examples (01–06) use st-codesearch-distilroberta-base (82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million code-comment pairs from CodeSearchNet using contrastive learning (MultipleNegativesRankingLoss). It produces 768-dimensional embedding vectors optimized for matching natural language descriptions to code, making it ideal for semantic code search and similarity tasks.

The introductory example (00) uses paraphrase-multilingual-mpnet-base-v2 for demonstrating general language embeddings with German text.

Hardware Requirements

RAM: 1 GB free (for the model)
Disk: ~500 MB (for the downloaded model, cached in ~/.cache/huggingface/)
GPU: Optional — all scripts auto-detect and use:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- CPU (fallback)

Expected Output

Each script prints structured output with explanations. Example 5 saves two PNG images (code_embeddings_pca.png and code_embeddings_tsne.png) showing the embedding space. Example 6 saves pca_denoising_analysis.png with three sub-plots analyzing optimal embedding dimensions.

3.1 KiB Raw Permalink Blame History Unescape Escape