Code Embeddings — Hands-On Examples
AISE501 – AI in Software Engineering I Fachhochschule Graubünden — Spring Semester 2026
Overview
Seven self-contained Python programs that demonstrate how embedding models work. Each script loads a pre-trained model, embeds text or code snippets, and explores a different capability of embeddings.
| # | Script | What it demonstrates |
|---|---|---|
| 0 | 00_tokens_and_embeddings_intro.py |
Tokenization basics and general text embeddings (German) |
| 1 | 01_basic_embeddings.py |
Compute code embeddings and pairwise cosine similarity |
| 2 | 02_text_to_code_search.py |
Semantic search: find code from natural language queries |
| 3 | 03_cross_language.py |
Same algorithm in 4 languages → similar embeddings |
| 4 | 04_clone_detection.py |
Detect duplicate/similar code in a simulated codebase |
| 5 | 05_visualize_embeddings.py |
PCA and t-SNE plots of the embedding space |
| 6 | 06_pca_denoising.py |
PCA denoising: fewer dimensions can improve similarity |
Setup
1. Create a virtual environment (recommended)
python -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
2. Install dependencies
pip install -r requirements.txt
PyTorch GPU support:
- Apple Silicon Mac (M1/M2/M3/M4): MPS acceleration works out of the box with the standard PyTorch install. No extra steps needed.
- NVIDIA GPU (Windows/Linux): Install the CUDA version of PyTorch. See https://pytorch.org/get-started/locally/ for the correct command for your CUDA version.
- CPU only: Everything works on CPU too, just a bit slower.
3. Run any example
python 00_tokens_and_embeddings_intro.py
python 01_basic_embeddings.py
python 02_text_to_code_search.py
python 03_cross_language.py
python 04_clone_detection.py
python 05_visualize_embeddings.py
python 06_pca_denoising.py
The first run will download the model (~300 MB). Subsequent runs use the cached model.
Model
All code embedding examples (01–06) use st-codesearch-distilroberta-base (82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million code-comment pairs from CodeSearchNet using contrastive learning (MultipleNegativesRankingLoss). It produces 768-dimensional embedding vectors optimized for matching natural language descriptions to code, making it ideal for semantic code search and similarity tasks.
The introductory example (00) uses paraphrase-multilingual-mpnet-base-v2 for demonstrating general language embeddings with German text.
Hardware Requirements
- RAM: 1 GB free (for the model)
- Disk: ~500 MB (for the downloaded model, cached in
~/.cache/huggingface/) - GPU: Optional — all scripts auto-detect and use:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- CPU (fallback)
Expected Output
Each script prints structured output with explanations. Example 5
saves two PNG images (code_embeddings_pca.png and
code_embeddings_tsne.png) showing the embedding space. Example 6
saves pca_denoising_analysis.png with three sub-plots analyzing
optimal embedding dimensions.