{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "23c813fb",
   "metadata": {},
   "source": [
    "# Part 4 — Core API Deep-Dive (Human Example)\n",
    "\n",
    "*Last updated:* 2026-01-08\n",
    "\n",
    "This notebook is a **hands-on tutorial** of the public IDTrack API using **human**.\n",
    "\n",
    "**Learning objectives**\n",
    "- Create the `idtrack.API` façade and understand what it wraps.\n",
    "- Convert single identifiers (time travel + optional external outputs).\n",
    "- Convert batches and summarize outcomes (1→0 / 1→1 / 1→n).\n",
    "- Request explanation payloads for audit trails.\n",
    "- Learn advanced knobs (external bridging, ambiguity strategy, assembly awareness).\n",
    "- Learn introspection helpers (available databases, assemblies, releases, active ranges).\n",
    "\n",
    "> **Prerequisite:** `03_initialization_graph.ipynb` (Part 3) is recommended so the graph loads from cache.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa04831d",
   "metadata": {},
   "source": [
    "## 4.1 — The API Facade\n",
    "\n",
    "`idtrack.API` is the user-facing entry point. It handles:\n",
    "- organism resolution (human/mouse/pig names and synonyms)\n",
    "- building or loading a graph snapshot (the reproducible snapshot boundary)\n",
    "- conversion helpers like `convert_identifier(...)` and `convert_identifier_multiple(...)`\n",
    "\n",
    "In this notebook we build (or load) the **human** snapshot, then use it for the rest of the examples.\n",
    "\n",
    "> **Expected result:** after the setup cell runs, `api.track` exists and conversions become available.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e64623d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import idtrack\n",
    "\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "api.configure_logger()\n",
    "\n",
    "organism, latest_release = api.resolve_organism('human')\n",
    "SNAPSHOT_RELEASE = latest_release\n",
    "\n",
    "api.build_graph(organism_name=organism, snapshot_release=SNAPSHOT_RELEASE, calculate_caches=True)\n",
    "print('Ready:', organism, 'snapshot', SNAPSHOT_RELEASE)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adc0ede4",
   "metadata": {},
   "source": [
    "## 4.2 — Single Identifier Conversion\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbedd707",
   "metadata": {},
   "outputs": [],
   "source": [
    "api.convert_identifier('TP53', to_release=SNAPSHOT_RELEASE)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "567b689b",
   "metadata": {},
   "source": [
    "If you see `no_corresponding=True`, it means the input could not be matched.\n",
    "Try a different spelling/casing, or use an Ensembl ID directly.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d832c626",
   "metadata": {},
   "source": [
    "### 4.2.1 Example: time travel (convert to an older release)\n",
    "\n",
    "Why this matters: published datasets often use older releases.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f251b006",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose an older release to demonstrate time travel\n",
    "older_release = SNAPSHOT_RELEASE - 10\n",
    "api.convert_identifier('TP53', to_release=older_release)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f4beed4",
   "metadata": {},
   "source": [
    "### 4.2.2 Convert into an external database (HGNC)\n",
    "\n",
    "To convert into a specific external database, pass `final_database=...`.\n",
    "Database names are the same names you see in your external YAML.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9c3dd54",
   "metadata": {},
   "outputs": [],
   "source": [
    "api.convert_identifier('ENSG00000141510', to_release=SNAPSHOT_RELEASE, final_database='HGNC Symbol')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b50aa467",
   "metadata": {},
   "source": [
    "### 4.2.3 How do I know which external databases are available?\n",
    "\n",
    "Use the graph itself to list databases currently represented.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67598312",
   "metadata": {},
   "outputs": [],
   "source": [
    "g = api.track.graph\n",
    "sorted(g.available_external_databases)[:50]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10f77c9b",
   "metadata": {},
   "source": [
    "### 4.2.4 Understanding the result dictionary\n",
    "\n",
    "Key fields:\n",
    "- `query_id`: exactly what you typed\n",
    "- `graph_id`: what IDTrack matched internally (normalization step)\n",
    "- `target_id`: list of outputs (can be 0, 1, or many)\n",
    "- `no_corresponding`: input didn’t match any node\n",
    "- `no_conversion`: input matched, but no path to target release / database\n",
    "- `no_target`: reached an Ensembl target, but requested external DB had no synonym\n",
    "\n",
    "Important: `target_id` is a list because ambiguity is real and common.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac47627b",
   "metadata": {},
   "source": [
    "### 4.2.5 Ambiguity control: strategy='best' vs strategy='all'\n",
    "\n",
    "- `strategy='best'` (default): returns a single best target when possible\n",
    "- `strategy='all'`: returns *all* candidates IDTrack found\n",
    "\n",
    "Use `'all'` when you are doing QC or want to inspect ambiguous mappings.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0a3f563",
   "metadata": {},
   "outputs": [],
   "source": [
    "api.convert_identifier('TP53', to_release=SNAPSHOT_RELEASE, strategy='all')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ab242d3",
   "metadata": {},
   "source": [
    "## 4.3 — Batch Conversion\n",
    "\n",
    "Most workflows start from a list of identifiers (genes in a count matrix, markers, hits, etc.).\n",
    "IDTrack provides two helpers:\n",
    "- `convert_identifier_multiple(...)`\n",
    "- `classify_multiple_conversion(...)` to summarize outcomes\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca00f64f",
   "metadata": {},
   "outputs": [],
   "source": [
    "genes = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS', 'NOT_A_REAL_GENE']\n",
    "results = api.convert_identifier_multiple(genes, to_release=SNAPSHOT_RELEASE, final_database='HGNC Symbol')\n",
    "results[:2]  # show first two\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49300b27",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = api.classify_multiple_conversion(results)\n",
    "# Each bin is a list of per-gene dictionaries\n",
    "{k: len(v) for k, v in summary.items()}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24116800",
   "metadata": {},
   "source": [
    "If you want a human-readable report, you can print the summary bins:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9931ed6",
   "metadata": {},
   "outputs": [],
   "source": [
    "api.print_binned_conversion(summary)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "739b6db6",
   "metadata": {},
   "source": [
    "## 4.4 — Explainability & Auditability\n",
    "\n",
    "When you set `explain=True`, the result includes a `the_path` field describing the graph edges followed.\n",
    "This is very useful for advanced QC and debugging.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a51a1c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "explained = api.convert_identifier('TP53', to_release=SNAPSHOT_RELEASE, final_database='HGNC Symbol', explain=True)\n",
    "list(explained.keys())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "102844e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# The path dictionary keys are (target_id, ensembl_gene_id) pairs\n",
    "list(explained['the_path'].keys())[:3]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa85d082",
   "metadata": {},
   "source": [
    "`the_path` is intentionally detailed. For most users, the summary flags and `target_id` are enough.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af01ac49",
   "metadata": {},
   "source": [
    "## 4.5 — Advanced Conversion Options\n",
    "\n",
    "`api.convert_identifier(...)` is a convenience wrapper around `api.track.convert(...)`.\n",
    "\n",
    "Use the high-level API most of the time.\n",
    "But if you need full control (search settings, whether external bridging is allowed, deeper path diagnostics), you can call `Track.convert` directly.\n",
    "\n",
    "### 4.5.1 Best vs all (selection strategy)\n",
    "\n",
    "- `strategy='best'` returns a single globally best target.\n",
    "- `strategy='all'` returns all scored targets (useful for ambiguity-aware pipelines).\n",
    "\n",
    "### 4.5.2 Controlling external bridging\n",
    "\n",
    "External bridging helps reconnect broken Ensembl histories using external IDs, but it can also increase search space.\n",
    "Power users can toggle it via `go_external` on `Track.convert`.\n",
    "\n",
    "### 4.5.3 Hyperconnected nodes\n",
    "\n",
    "Some external identifiers connect to *many* entities (e.g. generic accessions). IDTrack detects these and limits their use to keep searches fast.\n",
    "\n",
    "### 4.5.4 Assembly-aware conversions\n",
    "\n",
    "In IDTrack, genome assemblies are part of the graph. This is crucial when you integrate datasets that were annotated with different references\n",
    "(for example, a GRCh37-based GTF and a GRCh38-based GTF).\n",
    "\n",
    "When you build a snapshot you choose a **primary assembly** (default: the newest/highest-priority assembly for that organism).\n",
    "The snapshot can still include other assemblies that Ensembl exposes within the snapshot window, and the path-finder can traverse between\n",
    "assemblies when it improves connectivity.\n",
    "\n",
    "Practical consequences:\n",
    "- You can feed identifiers originating from older builds and still harmonize them into one target space (your snapshot release + primary assembly).\n",
    "- External databases can be assembly-scoped; keeping assembly blocks enabled in your external YAML increases the set of bridges available for mapping.\n",
    "\n",
    "If you truly need outputs anchored to a different primary assembly (for example, a GRCh37-only downstream reference), rebuild with\n",
    "`genome_assembly=37`. Note that the cached graph filename does not include the assembly; use a separate local repository if you want to keep\n",
    "multiple primary-assembly snapshots side-by-side.\n",
    "\n",
    "Example (advanced):\n",
    "```python\n",
    "api.track.convert(\n",
    "    from_id='TP53',\n",
    "    from_release=None,\n",
    "    to_release=SNAPSHOT_RELEASE,\n",
    "    final_database='HGNC Symbol',\n",
    "    go_external=True,\n",
    "    return_path=True,\n",
    ")\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53eeb29f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Advanced demo: inspect hyperconnected nodes and compare `go_external` behavior.\n",
    "# Safe: does not modify your cache; it only runs conversions.\n",
    "\n",
    "# 1) Hyperconnected nodes (performance/ambiguity concept)\n",
    "g = api.track.graph\n",
    "hc = getattr(g, 'hyperconnective_nodes', {})\n",
    "print('Hyperconnected external nodes:', len(hc))\n",
    "if hc:\n",
    "    top = sorted(hc.items(), key=lambda kv: kv[1], reverse=True)[:10]\n",
    "    print('Top 10 by out-degree:')\n",
    "    for node, deg in top:\n",
    "        print(' ', deg, '-', node)\n",
    "\n",
    "# 2) External bridging toggle (often matters when backbone history is disconnected)\n",
    "# For many well-behaved genes, both calls will succeed; the point is the *option* exists.\n",
    "res_no_external = api.track.convert(\n",
    "    from_id='TP53',\n",
    "    from_release=None,\n",
    "    to_release=SNAPSHOT_RELEASE,\n",
    "    final_database=None,\n",
    "    go_external=False,\n",
    "    prioritize_to_one_filter=True,\n",
    "    return_path=False,\n",
    ")\n",
    "res_with_external = api.track.convert(\n",
    "    from_id='TP53',\n",
    "    from_release=None,\n",
    "    to_release=SNAPSHOT_RELEASE,\n",
    "    final_database=None,\n",
    "    go_external=True,\n",
    "    prioritize_to_one_filter=True,\n",
    "    return_path=False,\n",
    ")\n",
    "\n",
    "print('go_external=False ->', 'OK' if res_no_external else None)\n",
    "print('go_external=True  ->', 'OK' if res_with_external else None)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7b55b58",
   "metadata": {},
   "source": [
    "## 4.6 — Introspection & Discovery\n",
    "\n",
    "These helpers answer practical questions like:\n",
    "- *Which external databases are available in my current graph?*\n",
    "- *Which genome assemblies are represented?*\n",
    "- *What Ensembl release range does my snapshot cover?*\n",
    "- *When was a given identifier active across releases?*\n",
    "\n",
    "The next cell demonstrates the most useful introspection calls.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed563412",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Introspection demo\n",
    "\n",
    "print('Assemblies in this graph:', sorted(api.list_genome_assemblies()))\n",
    "\n",
    "ext_dbs = sorted(api.list_external_databases())\n",
    "print('External DBs enabled (count):', len(ext_dbs))\n",
    "print('External DBs (first 25):', ext_dbs[:25])\n",
    "\n",
    "forms = api.external_database_forms()\n",
    "print()\n",
    "print('External DB connection forms (sample):')\n",
    "for name in ext_dbs[:10]:\n",
    "    print(' ', name, '→', forms.get(name))\n",
    "\n",
    "rels = api.list_ensembl_releases()\n",
    "print()\n",
    "print('Ensembl releases in snapshot window:', (min(rels), max(rels)) if rels else None)\n",
    "\n",
    "# Active ranges: when was an ID \"alive\" across releases?\n",
    "# (Useful for provenance documentation.)\n",
    "g = api.track.graph\n",
    "example_gene = 'ENSG00000141510'  # TP53\n",
    "if example_gene in g.nodes:\n",
    "    print()\n",
    "    print('Active ranges (main assembly) for', example_gene, ':', g.get_active_ranges_of_id.get(example_gene))\n",
    "    try:\n",
    "        print('Active ranges (all assemblies) for', example_gene, ':', g.get_active_ranges_of_id_ensembl_all_inclusive(example_gene))\n",
    "    except Exception as e:\n",
    "        print('All-assemblies active range failed ->', repr(e))\n",
    "else:\n",
    "    print('Example gene not found in graph (unexpected).')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e306c2e",
   "metadata": {},
   "source": [
    "## 4.7 — Practical advice (the kind that saves you a week)\n",
    "\n",
    "1. Always record your **snapshot boundary** (release) in your analysis notes.\n",
    "2. If you share results, share the **external YAML** too.\n",
    "3. When mapping is ambiguous, do not hide it — decide how your pipeline should handle 1→n mappings.\n",
    "4. For scRNA-seq harmonization, prefer stable namespaces (Ensembl IDs) before switching to symbols.\n",
    "\n",
    "> **Tip:** If you need troubleshooting checklists and diagnostics helpers, see `07_advanced_topics.ipynb` (Part 7.3).\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}