# Code Embeddings — Hands-On Examples **AISE501 – AI in Software Engineering I** Fachhochschule Graubünden — Spring Semester 2026 ## Overview Seven self-contained Python programs that demonstrate how embedding models work. Each script loads a pre-trained model, embeds text or code snippets, and explores a different capability of embeddings. | # | Script | What it demonstrates | |---|--------|---------------------| | 0 | `00_tokens_and_embeddings_intro.py` | Tokenization basics and general text embeddings (German) | | 1 | `01_basic_embeddings.py` | Compute code embeddings and pairwise cosine similarity | | 2 | `02_text_to_code_search.py` | Semantic search: find code from natural language queries | | 3 | `03_cross_language.py` | Same algorithm in 4 languages → similar embeddings | | 4 | `04_clone_detection.py` | Detect duplicate/similar code in a simulated codebase | | 5 | `05_visualize_embeddings.py` | PCA and t-SNE plots of the embedding space | | 6 | `06_pca_denoising.py` | PCA denoising: fewer dimensions can improve similarity | ## Setup ### 1. Create a virtual environment (recommended) ```bash python -m venv venv # macOS / Linux source venv/bin/activate # Windows venv\Scripts\activate ``` ### 2. Install dependencies ```bash pip install -r requirements.txt ``` **PyTorch GPU support:** - **Apple Silicon Mac (M1/M2/M3/M4):** MPS acceleration works out of the box with the standard PyTorch install. No extra steps needed. - **NVIDIA GPU (Windows/Linux):** Install the CUDA version of PyTorch. See https://pytorch.org/get-started/locally/ for the correct command for your CUDA version. - **CPU only:** Everything works on CPU too, just a bit slower. ### 3. Run any example ```bash python 00_tokens_and_embeddings_intro.py python 01_basic_embeddings.py python 02_text_to_code_search.py python 03_cross_language.py python 04_clone_detection.py python 05_visualize_embeddings.py python 06_pca_denoising.py ``` The first run will download the model (~300 MB). Subsequent runs use the cached model. ## Model All code embedding examples (01–06) use **st-codesearch-distilroberta-base** (82M parameters), a DistilRoBERTa model fine-tuned on 1.38 million code-comment pairs from CodeSearchNet using contrastive learning (MultipleNegativesRankingLoss). It produces 768-dimensional embedding vectors optimized for matching natural language descriptions to code, making it ideal for semantic code search and similarity tasks. The introductory example (00) uses **paraphrase-multilingual-mpnet-base-v2** for demonstrating general language embeddings with German text. ## Hardware Requirements - **RAM:** 1 GB free (for the model) - **Disk:** ~500 MB (for the downloaded model, cached in `~/.cache/huggingface/`) - **GPU:** Optional — all scripts auto-detect and use: - CUDA (NVIDIA GPUs) - MPS (Apple Silicon) - CPU (fallback) ## Expected Output Each script prints structured output with explanations. Example 5 saves two PNG images (`code_embeddings_pca.png` and `code_embeddings_tsne.png`) showing the embedding space. Example 6 saves `pca_denoising_analysis.png` with three sub-plots analyzing optimal embedding dimensions.