{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0e332a07",
   "metadata": {},
   "source": [
    "# Part 0 — Conceptual Foundation\n",
    "\n",
    "*Last updated:* 2026-01-08\n",
    "\n",
    "This section builds the mental model you need **before touching code**. It is written for **wet-lab researchers** and **bioinformaticians** who want reliable, reproducible identifier conversion across Ensembl releases and external databases.\n",
    "\n",
    "**Learning objectives**\n",
    "- Understand what *identifier drift* is and why it breaks analyses.\n",
    "- Understand the two axes: **time** (Ensembl release) and **space** (namespace/database).\n",
    "- Understand what a **snapshot boundary** is and why it makes results reproducible.\n",
    "- Know how to interpret IDTrack outcomes: **1→0**, **1→1**, **1→n**.\n",
    "\n",
    "> **Tip:** If you remember one sentence: *IDTrack does time travel on the Ensembl backbone, then optional hops into external namespaces, all inside a bounded snapshot so results are auditable and reproducible.*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd6d7f6e",
   "metadata": {},
   "source": [
    "## 0.1 — What is IDTrack?\n",
    "\n",
    "In modern biology you rarely analyze a single dataset in isolation. You compare:\n",
    "- older vs newer studies (different Ensembl releases)\n",
    "- different technologies (bulk vs scRNA-seq)\n",
    "- different annotation sources (Ensembl IDs vs HGNC symbols vs RefSeq vs UniProt)\n",
    "\n",
    "The challenge: **identifiers change** (this is called *identifier drift*).\n",
    "- Ensembl stable IDs can be *retired*, *merged*, *split*, or *re-assigned across releases*.\n",
    "- Gene symbols are convenient but can be **ambiguous** (one symbol can refer to multiple genes) and can change over time.\n",
    "- External databases provide cross-references, but they overlap and can introduce many-to-many ambiguity.\n",
    "\n",
    "IDTrack solves this by building a **time-aware graph** of identifier relationships and answering:\n",
    "\"Given this ID, what does it correspond to at release X (and optionally in database Y), inside a reproducible snapshot?\"\n",
    "\n",
    "### When should you use IDTrack (vs alternatives)?\n",
    "\n",
    "Use IDTrack when you need:\n",
    "- **Reproducibility across time**: \"map everything to Ensembl release X and keep that fixed for the project\".\n",
    "- **Auditability**: you want to inspect *why* a mapping happened (or why it didn’t).\n",
    "- **Mixed namespaces**: your inputs are a mix of Ensembl IDs, symbols, UniProt/RefSeq/Entrez, etc.\n",
    "- **Honest ambiguity handling**: you prefer explicit 1→n results to silent coercion.\n",
    "\n",
    "You might *not* need IDTrack if you only need:\n",
    "- a quick \"latest release only\" mapping for a handful of genes and you do not care about snapshot reproducibility.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3658551",
   "metadata": {},
   "source": [
    "## 0.2 — Architecture & Workflow Overview\n",
    "\n",
    "IDTrack is easiest to understand with two axes:\n",
    "\n",
    "1. **Time axis** → *Ensembl releases* (e.g. 90, 100, 110, 115)\n",
    "2. **Space axis** → *identifier namespaces* (Ensembl gene IDs, gene symbols, UniProt accessions, RefSeq IDs, …)\n",
    "\n",
    "Most real tasks are:\n",
    "\n",
    "- **Time travel on the backbone** (move an Ensembl identifier from one release to another)\n",
    "- then optionally **switch namespaces** (e.g. Ensembl → HGNC Symbol, or Ensembl → UniProt)\n",
    "\n",
    "A useful picture (the conversion pipeline):\n",
    "\n",
    "```text\n",
    "Your ID (some database, some year)\n",
    "        |\n",
    "        v\n",
    "  [Normalize + match to graph node]\n",
    "        |\n",
    "        v\n",
    "  [Time travel across Ensembl releases]\n",
    "        |\n",
    "        +--> (optional) [external hop(s) if needed]\n",
    "        |\n",
    "        v\n",
    "  [Arrive at target release]\n",
    "        |\n",
    "        v\n",
    "  (optional) [Convert to requested external database]\n",
    "```\n",
    "\n",
    "> **Why this matters:** If you say \"convert to release 115\", you are choosing a point on the time axis. If you say \"give me HGNC symbols\", you are choosing a coordinate on the space axis. The rest is controlled path-finding inside a reproducible snapshot.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae02e33c",
   "metadata": {},
   "source": [
    "### 0.2.1 — Graph Snapshots and the Snapshot Boundary\n",
    "\n",
    "IDTrack builds a **graph** where:\n",
    "- nodes are identifiers (Ensembl IDs, base IDs, external IDs)\n",
    "- edges encode relationships (release-to-release history, gene↔transcript↔protein links, external cross-references)\n",
    "\n",
    "A key feature is the **snapshot boundary** (also called the *snapshot release*):\n",
    "- you choose a *maximum Ensembl release* (e.g. 115)\n",
    "- IDTrack ignores anything newer\n",
    "\n",
    "This matters because it makes results **reproducible**:\n",
    "- same snapshot boundary + same external YAML → same graph → same conversions\n",
    "\n",
    "A practical metaphor: you are doing \"time travel\" in a museum that you *freeze in time*. The outside world (newer releases) keeps changing, but your museum exhibits do not.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abb08433",
   "metadata": {},
   "source": [
    "### 0.2.2 — Backbone vs External Namespaces (and the External YAML)\n",
    "\n",
    "IDTrack does *not* automatically include every external database Ensembl knows about.\n",
    "Instead, you explicitly opt in via a small YAML file (the **external YAML**).\n",
    "\n",
    "Why? Because including everything would:\n",
    "- make the graph huge\n",
    "- slow down path-finding\n",
    "- increase ambiguity (many-to-many relationships)\n",
    "\n",
    "So the workflow is:\n",
    "1. generate a template YAML for an organism\n",
    "2. enable a curated set of external databases (set `Include: true`)\n",
    "3. build the graph snapshot\n",
    "\n",
    "> **Tip:** Think of Ensembl IDs as the \"backbone\" (the timeline). External databases are \"bridges\" you can enable when you need them.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5efc4855",
   "metadata": {},
   "source": [
    "### 0.2.3 — Understanding Mapping Outcomes (1→0, 1→1, 1→n)\n",
    "\n",
    "When you convert one identifier, three outcomes are common:\n",
    "\n",
    "- **1→0**: nothing matches (unknown ID, or no path exists in the snapshot)\n",
    "- **1→1**: clean conversion\n",
    "- **1→n**: ambiguous conversion (splits, merged history, promiscuous external IDs, symbols, …)\n",
    "\n",
    "IDTrack will *tell you which case you are in*. This is a feature, not a failure: it prevents silent mistakes.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ee2f722",
   "metadata": {},
   "source": [
    "### 0.2.4 — Genome assemblies are part of the mapping\n",
    "\n",
    "A **genome assembly** (also called a “build”) is the reference coordinate system used to define genes and transcripts.\n",
    "Common examples:\n",
    "\n",
    "- human: **GRCh38** and **GRCh37**\n",
    "- mouse: **GRCm39** and **GRCm38**\n",
    "- pig: **Sscrofa11.1** and **Sscrofa10.2**\n",
    "\n",
    "Why this matters in practice: two datasets can use the “same kind of identifiers” but still be anchored to different builds\n",
    "because they were annotated with different GTFs or reference packages. That is a very common atlas-building situation.\n",
    "\n",
    "IDTrack treats assemblies as a first-class part of the graph. A snapshot is not only “release-bounded in time” — it can also be\n",
    "**multi-assembly**:\n",
    "\n",
    "- nodes/edges carry assembly context where needed\n",
    "- the path-finder can move across **releases** and across **assemblies** to reach a unified target space\n",
    "- some external databases are only present (or only reliable) on specific assembly/release combinations, so having multiple assemblies\n",
    "  available can increase connectivity\n",
    "\n",
    "In most projects you choose a single **target** (a snapshot boundary + a primary assembly for output), and use IDTrack to bring mixed\n",
    "inputs into that target in a reproducible way.\n",
    "\n",
    "\n",
    "Human is the main case where two assemblies are actively supported in parallel (GRCh38 + GRCh37). Mouse and pig are generally *clean-handoff* species (one maintained assembly per release), but older assemblies still matter for legacy datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8664a4f",
   "metadata": {},
   "source": [
    "### 0.2.5 — Key Abstractions You Will Use\n",
    "\n",
    "You do not need to be a developer to use these, but knowing the names helps you navigate the tutorials:\n",
    "\n",
    "1. **`idtrack.API`** — the high-level entry point\n",
    "   - resolves organism names\n",
    "   - builds/loads the graph snapshot\n",
    "   - provides `convert_identifier(...)` and batch helpers\n",
    "\n",
    "2. **`idtrack.DatabaseManager`** — data access + caching\n",
    "   - downloads tables from Ensembl (live MySQL when reachable; otherwise HTTPS/FTP MySQL dumps)\n",
    "   - manages your external YAML (`*_externals_modified.yml`)\n",
    "\n",
    "3. **`idtrack.Track`** — the conversion engine\n",
    "   - performs path-finding and scoring in the graph\n",
    "   - you usually access it as `api.track`\n",
    "\n",
    "4. **`idtrack.HarmonizeFeatures`** — multi-dataset harmonization\n",
    "   - converts gene identifiers across multiple `.h5ad` datasets\n",
    "   - helps build a unified integrated dataset\n",
    "\n",
    "5. **`idtrack._external_mappers` (optional)** — orthologs and external mapping services\n",
    "   - advanced features that require extra dependencies\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92d22b52",
   "metadata": {},
   "source": [
    "## 0.3 — Assumptions & Limitations\n",
    "\n",
    "IDTrack is deliberately opinionated about **reproducibility** and **transparency**. That comes with assumptions and limits.\n",
    "\n",
    "IDTrack assumes you have:\n",
    "- a writable **local repository folder** (cache + graphs + YAML configs)\n",
    "- network access the first time you build a graph (REST + HTTPS/FTP dumps; later runs reuse the cache)\n",
    "- enough disk space (graphs and cached tables can be large)\n",
    "\n",
    "IDTrack can:\n",
    "- time-travel identifiers across Ensembl releases inside a chosen snapshot boundary\n",
    "- optionally hop into external namespaces you explicitly enabled\n",
    "- surface ambiguity (1→n) instead of hiding it\n",
    "\n",
    "IDTrack cannot magically fix upstream ambiguity:\n",
    "- if Ensembl history says a gene split into multiple descendants, the correct answer may be **1→n**\n",
    "- if a symbol is reused across multiple genes, symbols can stay ambiguous\n",
    "- mapping quality is bounded by upstream metadata quality\n",
    "\n",
    "> **Why this matters:** IDTrack prefers to be \"honestly ambiguous\" rather than silently wrong. If you need a single answer from a 1→n case, you should make that choice explicitly (and record how you did it).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b5e06b65",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IDTrack version: 0.0.5\n",
      "Local repository: /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache\n"
     ]
    }
   ],
   "source": [
    "# 1) Minimal setup cell (safe to run in any notebook)\n",
    "from __future__ import annotations\n",
    "\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import idtrack\n",
    "\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "api.configure_logger()\n",
    "\n",
    "print('IDTrack version:', idtrack.__version__)\n",
    "print('Local repository:', LOCAL_REPOSITORY)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba7b94a6",
   "metadata": {},
   "source": [
    "The cell above does three things:\n",
    "1. chooses a cache directory (your *local repository*)\n",
    "2. creates the high-level `idtrack.API` object\n",
    "3. enables logging so you can see progress (downloads, caching, graph build)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f911dcbf",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-01-09 21:31:06 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n",
      "2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'human' -> 'homo_sapiens' (latest Ensembl release: 115)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'mus musculus' -> 'mus_musculus' (latest Ensembl release: 115)\n",
      "'sus scrofa' -> 'sus_scrofa' (latest Ensembl release: 115)\n"
     ]
    }
   ],
   "source": [
    "# 2) Resolve organism names the way IDTrack expects\n",
    "# You can use common names ('human'), scientific names, taxon IDs, or Ensembl-style names.\n",
    "\n",
    "for query in ['human', 'mus musculus', 'sus scrofa']:\n",
    "    formal_name, latest_release = api.resolve_organism(query)\n",
    "    print(f'{query!r} -> {formal_name!r} (latest Ensembl release: {latest_release})')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd6e998b",
   "metadata": {},
   "source": [
    "You will see outputs like:\n",
    "- `'human' -> 'homo_sapiens'`\n",
    "- `'mus musculus' -> 'mus_musculus'`\n",
    "- `'sus scrofa' -> 'sus_scrofa'`\n",
    "\n",
    "From now on, the tutorials use the **formal Ensembl names** (snake_case).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aa221c1",
   "metadata": {},
   "source": [
    "## 0.4 — Tutorial Roadmap (Recommended Order)\n",
    "\n",
    "1. **Install + verify** (Part 1)\n",
    "2. **Prepare external YAMLs** (Part 2)\n",
    "3. **Build graph snapshots** (Part 3)\n",
    "4. **Run self-tests / sanity checks** (initialization tests)\n",
    "5. **Human API deep dive** (Part 4)\n",
    "6. **Harmonization tutorial (HLCA-style)** (Part 5)\n",
    "7. **HLCA experiments case study** (Part 5, advanced)\n",
    "8. **Cross-species humanization** (Part 6, advanced)\n",
    "9. **Advanced topics** (Part 7)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e25b05fd",
   "metadata": {},
   "source": [
    "## 0.5 — Quick Troubleshooting (Fast Wins)\n",
    "\n",
    "If something fails, check these first:\n",
    "\n",
    "1. **Permissions**: is your local repository writable?\n",
    "2. **Network**: first-time runs need to reach Ensembl services (REST + HTTPS/FTP dumps; MySQL is optional).\n",
    "3. **Disk space**: building graphs can use multiple GB.\n",
    "4. **Snapshot boundary + assemblies**: the snapshot release must exist for the organism and the chosen primary assembly. Older assemblies have\n",
    "   different release coverage; a multi-assembly snapshot can still include them when they exist within the release window.\n",
    "5. **External YAML**: for mouse/pig you must create a `*_externals_modified.yml` in your local repository.\n",
    "\n",
    "> **Tip:** For a deeper checklist (including diagnostics helpers), see Part 7.3 in `07_advanced_topics.ipynb`.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}