3.1 KiB
Raw Permalink Blame History

Code Embeddings — Hands-On Examples

AISE501 AI in Software Engineering I Fachhochschule Graubünden — Spring Semester 2026

Overview

Seven self-contained Python programs that demonstrate how embedding models work. Each script loads a pre-trained model, embeds text or code snippets, and explores a different capability of embeddings.

# Script What it demonstrates
0 00_tokens_and_embeddings_intro.py Tokenization basics and general text embeddings (German)
1 01_basic_embeddings.py Compute code embeddings and pairwise cosine similarity
2 02_text_to_code_search.py Semantic search: find code from natural language queries
3 03_cross_language.py Same algorithm in 4 languages → similar embeddings
4 04_clone_detection.py Detect duplicate/similar code in a simulated codebase
5 05_visualize_embeddings.py PCA and t-SNE plots of the embedding space
6 06_pca_denoising.py PCA denoising: fewer dimensions can improve similarity

Setup

python -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

PyTorch GPU support:

  • Apple Silicon Mac (M1/M2/M3/M4): MPS acceleration works out of the box with the standard PyTorch install. No extra steps needed.
  • NVIDIA GPU (Windows/Linux): Install the CUDA version of PyTorch. See https://pytorch.org/get-started/locally/ for the correct command for your CUDA version.
  • CPU only: Everything works on CPU too, just a bit slower.

3. Run any example

python 00_tokens_and_embeddings_intro.py
python 01_basic_embeddings.py
python 02_text_to_code_search.py
python 03_cross_language.py
python 04_clone_detection.py
python 05_visualize_embeddings.py
python 06_pca_denoising.py

The first run will download the model (~300 MB). Subsequent runs use the cached model.

Model

All code embedding examples (0106) use st-codesearch-distilroberta-base (82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million code-comment pairs from CodeSearchNet using contrastive learning (MultipleNegativesRankingLoss). It produces 768-dimensional embedding vectors optimized for matching natural language descriptions to code, making it ideal for semantic code search and similarity tasks.

The introductory example (00) uses paraphrase-multilingual-mpnet-base-v2 for demonstrating general language embeddings with German text.

Hardware Requirements

  • RAM: 1 GB free (for the model)
  • Disk: ~500 MB (for the downloaded model, cached in ~/.cache/huggingface/)
  • GPU: Optional — all scripts auto-detect and use:
    • CUDA (NVIDIA GPUs)
    • MPS (Apple Silicon)
    • CPU (fallback)

Expected Output

Each script prints structured output with explanations. Example 5 saves two PNG images (code_embeddings_pca.png and code_embeddings_tsne.png) showing the embedding space. Example 6 saves pca_denoising_analysis.png with three sub-plots analyzing optimal embedding dimensions.