{ "cells": [ { "cell_type": "markdown", "id": "0e332a07", "metadata": {}, "source": [ "# Part 0 — Conceptual Foundation\n", "\n", "*Last updated:* 2026-01-08\n", "\n", "This section builds the mental model you need **before touching code**. It is written for **wet-lab researchers** and **bioinformaticians** who want reliable, reproducible identifier conversion across Ensembl releases and external databases.\n", "\n", "**Learning objectives**\n", "- Understand what *identifier drift* is and why it breaks analyses.\n", "- Understand the two axes: **time** (Ensembl release) and **space** (namespace/database).\n", "- Understand what a **snapshot boundary** is and why it makes results reproducible.\n", "- Know how to interpret IDTrack outcomes: **1→0**, **1→1**, **1→n**.\n", "\n", "> **Tip:** If you remember one sentence: *IDTrack does time travel on the Ensembl backbone, then optional hops into external namespaces, all inside a bounded snapshot so results are auditable and reproducible.*\n" ] }, { "cell_type": "markdown", "id": "fd6d7f6e", "metadata": {}, "source": [ "## 0.1 — What is IDTrack?\n", "\n", "In modern biology you rarely analyze a single dataset in isolation. You compare:\n", "- older vs newer studies (different Ensembl releases)\n", "- different technologies (bulk vs scRNA-seq)\n", "- different annotation sources (Ensembl IDs vs HGNC symbols vs RefSeq vs UniProt)\n", "\n", "The challenge: **identifiers change** (this is called *identifier drift*).\n", "- Ensembl stable IDs can be *retired*, *merged*, *split*, or *re-assigned across releases*.\n", "- Gene symbols are convenient but can be **ambiguous** (one symbol can refer to multiple genes) and can change over time.\n", "- External databases provide cross-references, but they overlap and can introduce many-to-many ambiguity.\n", "\n", "IDTrack solves this by building a **time-aware graph** of identifier relationships and answering:\n", "\"Given this ID, what does it correspond to at release X (and optionally in database Y), inside a reproducible snapshot?\"\n", "\n", "### When should you use IDTrack (vs alternatives)?\n", "\n", "Use IDTrack when you need:\n", "- **Reproducibility across time**: \"map everything to Ensembl release X and keep that fixed for the project\".\n", "- **Auditability**: you want to inspect *why* a mapping happened (or why it didn’t).\n", "- **Mixed namespaces**: your inputs are a mix of Ensembl IDs, symbols, UniProt/RefSeq/Entrez, etc.\n", "- **Honest ambiguity handling**: you prefer explicit 1→n results to silent coercion.\n", "\n", "You might *not* need IDTrack if you only need:\n", "- a quick \"latest release only\" mapping for a handful of genes and you do not care about snapshot reproducibility.\n" ] }, { "cell_type": "markdown", "id": "c3658551", "metadata": {}, "source": [ "## 0.2 — Architecture & Workflow Overview\n", "\n", "IDTrack is easiest to understand with two axes:\n", "\n", "1. **Time axis** → *Ensembl releases* (e.g. 90, 100, 110, 115)\n", "2. **Space axis** → *identifier namespaces* (Ensembl gene IDs, gene symbols, UniProt accessions, RefSeq IDs, …)\n", "\n", "Most real tasks are:\n", "\n", "- **Time travel on the backbone** (move an Ensembl identifier from one release to another)\n", "- then optionally **switch namespaces** (e.g. Ensembl → HGNC Symbol, or Ensembl → UniProt)\n", "\n", "A useful picture (the conversion pipeline):\n", "\n", "```text\n", "Your ID (some database, some year)\n", " |\n", " v\n", " [Normalize + match to graph node]\n", " |\n", " v\n", " [Time travel across Ensembl releases]\n", " |\n", " +--> (optional) [external hop(s) if needed]\n", " |\n", " v\n", " [Arrive at target release]\n", " |\n", " v\n", " (optional) [Convert to requested external database]\n", "```\n", "\n", "> **Why this matters:** If you say \"convert to release 115\", you are choosing a point on the time axis. If you say \"give me HGNC symbols\", you are choosing a coordinate on the space axis. The rest is controlled path-finding inside a reproducible snapshot.\n" ] }, { "cell_type": "markdown", "id": "ae02e33c", "metadata": {}, "source": [ "### 0.2.1 — Graph Snapshots and the Snapshot Boundary\n", "\n", "IDTrack builds a **graph** where:\n", "- nodes are identifiers (Ensembl IDs, base IDs, external IDs)\n", "- edges encode relationships (release-to-release history, gene↔transcript↔protein links, external cross-references)\n", "\n", "A key feature is the **snapshot boundary** (also called the *snapshot release*):\n", "- you choose a *maximum Ensembl release* (e.g. 115)\n", "- IDTrack ignores anything newer\n", "\n", "This matters because it makes results **reproducible**:\n", "- same snapshot boundary + same external YAML → same graph → same conversions\n", "\n", "A practical metaphor: you are doing \"time travel\" in a museum that you *freeze in time*. The outside world (newer releases) keeps changing, but your museum exhibits do not.\n" ] }, { "cell_type": "markdown", "id": "abb08433", "metadata": {}, "source": [ "### 0.2.2 — Backbone vs External Namespaces (and the External YAML)\n", "\n", "IDTrack does *not* automatically include every external database Ensembl knows about.\n", "Instead, you explicitly opt in via a small YAML file (the **external YAML**).\n", "\n", "Why? Because including everything would:\n", "- make the graph huge\n", "- slow down path-finding\n", "- increase ambiguity (many-to-many relationships)\n", "\n", "So the workflow is:\n", "1. generate a template YAML for an organism\n", "2. enable a curated set of external databases (set `Include: true`)\n", "3. build the graph snapshot\n", "\n", "> **Tip:** Think of Ensembl IDs as the \"backbone\" (the timeline). External databases are \"bridges\" you can enable when you need them.\n" ] }, { "cell_type": "markdown", "id": "5efc4855", "metadata": {}, "source": [ "### 0.2.3 — Understanding Mapping Outcomes (1→0, 1→1, 1→n)\n", "\n", "When you convert one identifier, three outcomes are common:\n", "\n", "- **1→0**: nothing matches (unknown ID, or no path exists in the snapshot)\n", "- **1→1**: clean conversion\n", "- **1→n**: ambiguous conversion (splits, merged history, promiscuous external IDs, symbols, …)\n", "\n", "IDTrack will *tell you which case you are in*. This is a feature, not a failure: it prevents silent mistakes.\n" ] }, { "cell_type": "markdown", "id": "6ee2f722", "metadata": {}, "source": [ "### 0.2.4 — Genome assemblies are part of the mapping\n", "\n", "A **genome assembly** (also called a “build”) is the reference coordinate system used to define genes and transcripts.\n", "Common examples:\n", "\n", "- human: **GRCh38** and **GRCh37**\n", "- mouse: **GRCm39** and **GRCm38**\n", "- pig: **Sscrofa11.1** and **Sscrofa10.2**\n", "\n", "Why this matters in practice: two datasets can use the “same kind of identifiers” but still be anchored to different builds\n", "because they were annotated with different GTFs or reference packages. That is a very common atlas-building situation.\n", "\n", "IDTrack treats assemblies as a first-class part of the graph. A snapshot is not only “release-bounded in time” — it can also be\n", "**multi-assembly**:\n", "\n", "- nodes/edges carry assembly context where needed\n", "- the path-finder can move across **releases** and across **assemblies** to reach a unified target space\n", "- some external databases are only present (or only reliable) on specific assembly/release combinations, so having multiple assemblies\n", " available can increase connectivity\n", "\n", "In most projects you choose a single **target** (a snapshot boundary + a primary assembly for output), and use IDTrack to bring mixed\n", "inputs into that target in a reproducible way.\n", "\n", "\n", "Human is the main case where two assemblies are actively supported in parallel (GRCh38 + GRCh37). Mouse and pig are generally *clean-handoff* species (one maintained assembly per release), but older assemblies still matter for legacy datasets." ] }, { "cell_type": "markdown", "id": "b8664a4f", "metadata": {}, "source": [ "### 0.2.5 — Key Abstractions You Will Use\n", "\n", "You do not need to be a developer to use these, but knowing the names helps you navigate the tutorials:\n", "\n", "1. **`idtrack.API`** — the high-level entry point\n", " - resolves organism names\n", " - builds/loads the graph snapshot\n", " - provides `convert_identifier(...)` and batch helpers\n", "\n", "2. **`idtrack.DatabaseManager`** — data access + caching\n", " - downloads tables from Ensembl (live MySQL when reachable; otherwise HTTPS/FTP MySQL dumps)\n", " - manages your external YAML (`*_externals_modified.yml`)\n", "\n", "3. **`idtrack.Track`** — the conversion engine\n", " - performs path-finding and scoring in the graph\n", " - you usually access it as `api.track`\n", "\n", "4. **`idtrack.HarmonizeFeatures`** — multi-dataset harmonization\n", " - converts gene identifiers across multiple `.h5ad` datasets\n", " - helps build a unified integrated dataset\n", "\n", "5. **`idtrack._external_mappers` (optional)** — orthologs and external mapping services\n", " - advanced features that require extra dependencies\n" ] }, { "cell_type": "markdown", "id": "92d22b52", "metadata": {}, "source": [ "## 0.3 — Assumptions & Limitations\n", "\n", "IDTrack is deliberately opinionated about **reproducibility** and **transparency**. That comes with assumptions and limits.\n", "\n", "IDTrack assumes you have:\n", "- a writable **local repository folder** (cache + graphs + YAML configs)\n", "- network access the first time you build a graph (REST + HTTPS/FTP dumps; later runs reuse the cache)\n", "- enough disk space (graphs and cached tables can be large)\n", "\n", "IDTrack can:\n", "- time-travel identifiers across Ensembl releases inside a chosen snapshot boundary\n", "- optionally hop into external namespaces you explicitly enabled\n", "- surface ambiguity (1→n) instead of hiding it\n", "\n", "IDTrack cannot magically fix upstream ambiguity:\n", "- if Ensembl history says a gene split into multiple descendants, the correct answer may be **1→n**\n", "- if a symbol is reused across multiple genes, symbols can stay ambiguous\n", "- mapping quality is bounded by upstream metadata quality\n", "\n", "> **Why this matters:** IDTrack prefers to be \"honestly ambiguous\" rather than silently wrong. If you need a single answer from a 1→n case, you should make that choice explicitly (and record how you did it).\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "b5e06b65", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "IDTrack version: 0.0.5\n", "Local repository: /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache\n" ] } ], "source": [ "# 1) Minimal setup cell (safe to run in any notebook)\n", "from __future__ import annotations\n", "\n", "import os\n", "from pathlib import Path\n", "\n", "import idtrack\n", "\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n", "\n", "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", "api.configure_logger()\n", "\n", "print('IDTrack version:', idtrack.__version__)\n", "print('Local repository:', LOCAL_REPOSITORY)\n" ] }, { "cell_type": "markdown", "id": "ba7b94a6", "metadata": {}, "source": [ "The cell above does three things:\n", "1. chooses a cache directory (your *local repository*)\n", "2. creates the high-level `idtrack.API` object\n", "3. enables logging so you can see progress (downloads, caching, graph build)\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "f911dcbf", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-01-09 21:31:06 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n", "2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "'human' -> 'homo_sapiens' (latest Ensembl release: 115)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "'mus musculus' -> 'mus_musculus' (latest Ensembl release: 115)\n", "'sus scrofa' -> 'sus_scrofa' (latest Ensembl release: 115)\n" ] } ], "source": [ "# 2) Resolve organism names the way IDTrack expects\n", "# You can use common names ('human'), scientific names, taxon IDs, or Ensembl-style names.\n", "\n", "for query in ['human', 'mus musculus', 'sus scrofa']:\n", " formal_name, latest_release = api.resolve_organism(query)\n", " print(f'{query!r} -> {formal_name!r} (latest Ensembl release: {latest_release})')\n" ] }, { "cell_type": "markdown", "id": "cd6e998b", "metadata": {}, "source": [ "You will see outputs like:\n", "- `'human' -> 'homo_sapiens'`\n", "- `'mus musculus' -> 'mus_musculus'`\n", "- `'sus scrofa' -> 'sus_scrofa'`\n", "\n", "From now on, the tutorials use the **formal Ensembl names** (snake_case).\n" ] }, { "cell_type": "markdown", "id": "2aa221c1", "metadata": {}, "source": [ "## 0.4 — Tutorial Roadmap (Recommended Order)\n", "\n", "1. **Install + verify** (Part 1)\n", "2. **Prepare external YAMLs** (Part 2)\n", "3. **Build graph snapshots** (Part 3)\n", "4. **Run self-tests / sanity checks** (initialization tests)\n", "5. **Human API deep dive** (Part 4)\n", "6. **Harmonization tutorial (HLCA-style)** (Part 5)\n", "7. **HLCA experiments case study** (Part 5, advanced)\n", "8. **Cross-species humanization** (Part 6, advanced)\n", "9. **Advanced topics** (Part 7)\n" ] }, { "cell_type": "markdown", "id": "e25b05fd", "metadata": {}, "source": [ "## 0.5 — Quick Troubleshooting (Fast Wins)\n", "\n", "If something fails, check these first:\n", "\n", "1. **Permissions**: is your local repository writable?\n", "2. **Network**: first-time runs need to reach Ensembl services (REST + HTTPS/FTP dumps; MySQL is optional).\n", "3. **Disk space**: building graphs can use multiple GB.\n", "4. **Snapshot boundary + assemblies**: the snapshot release must exist for the organism and the chosen primary assembly. Older assemblies have\n", " different release coverage; a multi-assembly snapshot can still include them when they exist within the release window.\n", "5. **External YAML**: for mouse/pig you must create a `*_externals_modified.yml` in your local repository.\n", "\n", "> **Tip:** For a deeper checklist (including diagnostics helpers), see Part 7.3 in `07_advanced_topics.ipynb`.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }