{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3af32b3c",
   "metadata": {},
   "source": [
    "# Part 5 — Real-World Experiments: Harmonization\n",
    "\n",
    "*Last updated:* 2026-01-08\n",
    "\n",
    "This tutorial shows how to use IDTrack for **real-world dataset harmonization**.\n",
    "\n",
    "You will learn:\n",
    "- how to harmonize feature identifiers across multiple `.h5ad` datasets (HLCA-style)\n",
    "- how to interpret harmonization diagnostics (what changed, what failed, what is ambiguous)\n",
    "- how to choose between **union** vs **intersection** feature spaces\n",
    "- how to approach **legacy data rescue** (older identifiers, mixed namespaces)\n",
    "\n",
    "> **Tip:** Start with the toy demo first. The exact same logic scales to large datasets.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f06d6790",
   "metadata": {},
   "source": [
    "## 5.0 — Why harmonization matters (plain language)\n",
    "\n",
    "When you merge datasets, you implicitly assume that feature `X` in dataset A is the same biological entity\n",
    "as feature `X` in dataset B.\n",
    "\n",
    "This breaks when:\n",
    "- datasets use different Ensembl releases (IDs changed)\n",
    "- one dataset uses HGNC symbols and the other uses Ensembl IDs\n",
    "- some features map 1→n (ambiguity) or 1→0 (no match)\n",
    "\n",
    "IDTrack makes these cases explicit and gives you reproducible conversions anchored to a graph snapshot.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "484981c1",
   "metadata": {},
   "source": [
    "## 5.0.1 Pre-requisites\n",
    "\n",
    "- You can run `03_initialization_graph.ipynb` for human (graph snapshot exists in your local repository).\n",
    "- You have `anndata` installed (it is an IDTrack dependency).\n",
    "- For the HLCA section, you need access to the HLCA `.h5ad` files (not bundled here).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25cd6840",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1) Setup\n",
    "from __future__ import annotations\n",
    "\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import anndata as ad\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import sparse\n",
    "\n",
    "import idtrack\n",
    "\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "print('IDTrack local repository:', LOCAL_REPOSITORY)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2b5a81a",
   "metadata": {},
   "source": [
    "## 5.1 — HLCA (Human Lung Cell Atlas) Harmonization (start small, then scale)\n",
    "\n",
    "We will create two small `AnnData` objects with overlapping genes but different identifier styles.\n",
    "This shows the workflow without requiring large data downloads.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b6272bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2.1 Create toy datasets\n",
    "\n",
    "toy_dir = LOCAL_REPOSITORY / 'toy_harmonization_demo'\n",
    "toy_dir.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Dataset A: HGNC symbols (common in many wet-lab exports)\n",
    "genes_a = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']\n",
    "X_a = sparse.random(50, len(genes_a), density=0.2, format='csr', random_state=0)\n",
    "adata_a = ad.AnnData(X=X_a, obs=pd.DataFrame(index=[f'cellA_{i}' for i in range(X_a.shape[0])]), var=pd.DataFrame(index=genes_a))\n",
    "\n",
    "# Dataset B: mix of HGNC + an Ensembl stable ID (realistic messy scenario)\n",
    "genes_b = ['TP53', 'ENSG00000141510', 'BRCA1', 'NOT_A_REAL_GENE']\n",
    "X_b = sparse.random(60, len(genes_b), density=0.2, format='csr', random_state=1)\n",
    "adata_b = ad.AnnData(X=X_b, obs=pd.DataFrame(index=[f'cellB_{i}' for i in range(X_b.shape[0])]), var=pd.DataFrame(index=genes_b))\n",
    "\n",
    "path_a = toy_dir / 'toy_A_symbols.h5ad'\n",
    "path_b = toy_dir / 'toy_B_mixed.h5ad'\n",
    "adata_a.write_h5ad(path_a)\n",
    "adata_b.write_h5ad(path_b)\n",
    "\n",
    "path_a, path_b\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5c339e9",
   "metadata": {},
   "source": [
    "### 5.1.1 Run the harmonizer\n",
    "\n",
    "IDTrack provides `idtrack.HarmonizeFeatures`.\n",
    "\n",
    "Key parameters (interpretation):\n",
    "- `idtrack_local_repository`: where your built graph snapshot lives\n",
    "- `graph_last_ensembl_release`: which snapshot release the graph contains (must match your build)\n",
    "- `target_ensembl_release`: the release you want to harmonize *to*\n",
    "- `final_database`: what you want to keep as feature IDs (HGNC, Ensembl, …)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37affa4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2.2 Harmonize the toy datasets\n",
    "\n",
    "data_h5ad_dict = {\n",
    "    'toy_A': str(path_a),\n",
    "    'toy_B': str(path_b),\n",
    "}\n",
    "\n",
    "project_out = toy_dir / 'outputs'\n",
    "project_out.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')\n",
    "\n",
    "harmonizer = idtrack.HarmonizeFeatures(\n",
    "    project_name='toy_demo',\n",
    "    data_h5ad_dict=data_h5ad_dict,\n",
    "    project_local_repository=str(project_out),\n",
    "    idtrack_local_repository=str(LOCAL_REPOSITORY),\n",
    "    target_ensembl_release=latest_release,\n",
    "    final_database='HGNC Symbol',\n",
    "    organism_name=organism_name,\n",
    "    graph_last_ensembl_release=latest_release,\n",
    "    verbose_level=2,\n",
    ")\n",
    "\n",
    "harmonizer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3711c22e",
   "metadata": {},
   "source": [
    "### 5.1.2 Inspect what happened\n",
    "\n",
    "Useful things to look at:\n",
    "- which identifiers failed conversion\n",
    "- which identifiers were ambiguous\n",
    "- per-dataset result pickle files written under `project_local_repository`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a87f98d",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Conversion failed (any dataset):', sorted(list(harmonizer.conversion_failed_identifiers))[:20])\n",
    "print('Conversion failed but consistent:', sorted(list(harmonizer.conversion_failed_but_consistent_identifiers))[:20])\n",
    "print('Converted IDs with multiple Ensembl possibilities (collapsed):', list(harmonizer.multiple_ensembl_dict)[:10])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bb06dbe",
   "metadata": {},
   "source": [
    "### 5.1.3 Produce a unified AnnData (union or intersection)\n",
    "\n",
    "After harmonization, you often want a single merged dataset for downstream analysis.\n",
    "\n",
    "IDTrack’s harmonizer can merge datasets in two modes:\n",
    "- `mode='union'` (default): keep the union of all features (missing genes become zeros)\n",
    "- `mode='intersect'`: keep only features present in every dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "258450c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge the toy datasets (this uses AnnData.concat under the hood)\n",
    "unified_union = harmonizer.unify_multiple_anndatas(mode='union')\n",
    "unified_intersect = harmonizer.unify_multiple_anndatas(mode='intersect')\n",
    "\n",
    "print('Union shape:', unified_union.shape)\n",
    "print('Intersect shape:', unified_intersect.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5111dc60",
   "metadata": {},
   "source": [
    "At HLCA scale, `union` can create a very large feature matrix.\n",
    "If you plan to run memory-heavy models, consider starting with `intersect` or filtering genes first.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "549c8c6b",
   "metadata": {},
   "source": [
    "## 5.2 — Multi-Dataset Integration (best practices)\n",
    "\n",
    "Once you can harmonize identifiers, you can integrate datasets *without accidentally mixing incompatible feature definitions*.\n",
    "\n",
    "Practical best practices:\n",
    "\n",
    "- **Start by harmonizing into a stable backbone** (usually Ensembl gene IDs). Convert to symbols only for reporting.\n",
    "- **Treat 1→n mappings as real biology/data ambiguity**, not as a nuisance. Decide a policy:\n",
    "  - drop ambiguous features\n",
    "  - keep all candidates (inflates feature space)\n",
    "  - collapse to a single representative (requires an explicit rule)\n",
    "- **Pilot first**: run harmonization on a small subset and inspect diagnostics before scaling up.\n",
    "- **Union vs intersection**:\n",
    "  - `union` keeps more features but can produce a very wide matrix.\n",
    "  - `intersect` is stricter and often improves comparability, but may drop biologically important genes.\n",
    "\n",
    "> **Expected result:** you should be able to justify (and report) your integration choice, not just run a tool.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bea2eee",
   "metadata": {},
   "source": [
    "### 5.2.1 Scaling up to HLCA (the real use case)\n",
    "\n",
    "HLCA-scale harmonization is the same logic, but you need to manage:\n",
    "- many datasets\n",
    "- large gene lists\n",
    "- storage for intermediate results\n",
    "\n",
    "### 5.2.2 Point IDTrack to your HLCA data\n",
    "\n",
    "Set `HLCA_BASE_PATH` to a folder containing `.h5ad` files. Example layout:\n",
    "\n",
    "```text\n",
    "HLCA_BASE_PATH/\n",
    "  Dataset1.h5ad\n",
    "  Dataset2.h5ad\n",
    "  ...\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d578806",
   "metadata": {},
   "outputs": [],
   "source": [
    "HLCA_BASE_PATH = Path(os.environ.get('HLCA_BASE_PATH', ''))\n",
    "HLCA_BASE_PATH\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53fcfafe",
   "metadata": {},
   "source": [
    "If that prints an empty path, set it in your environment or replace it manually in the cell.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92997437",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3.2 Build a dataset dictionary (dataset name -> .h5ad path)\n",
    "# This scans only the top level. Adjust if your HLCA layout is nested.\n",
    "\n",
    "if HLCA_BASE_PATH and HLCA_BASE_PATH.exists():\n",
    "    hlca_files = sorted(HLCA_BASE_PATH.glob('*.h5ad'))\n",
    "    data_h5ad_dict_hlca = {p.stem: str(p) for p in hlca_files}\n",
    "    print('Found .h5ad files:', len(data_h5ad_dict_hlca))\n",
    "    print('Example keys:', list(data_h5ad_dict_hlca)[:10])\n",
    "else:\n",
    "    data_h5ad_dict_hlca = {}\n",
    "    print('HLCA_BASE_PATH not set or does not exist; skipping HLCA run.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca90336b",
   "metadata": {},
   "source": [
    "### 5.2.3 Run harmonization on HLCA\n",
    "\n",
    "This can take time. IDTrack will:\n",
    "- load/build the graph snapshot (from your local repository)\n",
    "- run identifier conversion for each dataset\n",
    "- write per-dataset mapping results as pickles under the project output directory\n",
    "\n",
    "Tip: start with a **small subset** (e.g. 2–3 datasets) to validate your setup.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6bcdc04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3.3 (OPTIONAL) HLCA run — uncomment to execute\n",
    "#\n",
    "# if data_h5ad_dict_hlca:\n",
    "#     project_out_hlca = LOCAL_REPOSITORY / 'hlca_harmonization_outputs'\n",
    "#     project_out_hlca.mkdir(parents=True, exist_ok=True)\n",
    "#\n",
    "#     organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')\n",
    "#\n",
    "#     harmonizer_hlca = idtrack.HarmonizeFeatures(\n",
    "#         project_name='hlca',\n",
    "#         data_h5ad_dict=data_h5ad_dict_hlca,\n",
    "#         project_local_repository=str(project_out_hlca),\n",
    "#         idtrack_local_repository=str(LOCAL_REPOSITORY),\n",
    "#         target_ensembl_release=latest_release,\n",
    "#         final_database='HGNC Symbol',\n",
    "#         organism_name=organism_name,\n",
    "#         graph_last_ensembl_release=latest_release,\n",
    "#         verbose_level=2,\n",
    "#     )\n",
    "#\n",
    "#     harmonizer_hlca\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19928eb0",
   "metadata": {},
   "source": [
    "### 5.2.4 Advanced notes (recommended once you’re comfortable)\n",
    "\n",
    "1. **Choice of `final_database`**\n",
    "   - For integration, Ensembl IDs are often the most stable.\n",
    "   - For interpretability, HGNC symbols are convenient but can be ambiguous.\n",
    "\n",
    "2. **Handling 1→n mappings**\n",
    "   - Decide whether you keep all targets, pick best, or drop ambiguous genes.\n",
    "   - The right choice depends on your downstream analysis.\n",
    "\n",
    "3. **Auditability**\n",
    "   - IDTrack writes intermediate results (pickles).\n",
    "   - Treat those as provenance artifacts: keep them with your analysis.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dba14136",
   "metadata": {},
   "source": [
    "### 5.2.5 HLCA experiments (extended): large-scale harmonization + QA\n",
    "\n",
    "The minimal HLCA snippet above is intentionally conservative: it shows how to *wire* IDTrack into an HLCA-style run.\n",
    "This extended section turns it into a robust, paper-friendly workflow:\n",
    "\n",
    "- verify your inputs and environment variables\n",
    "- pin a snapshot release (for reproducibility)\n",
    "- run a small pilot first (2–3 datasets)\n",
    "- read diagnostics (1→0 and 1→n signals)\n",
    "- only then scale to the full collection\n",
    "\n",
    "> **Why this matters:** large runs are slow and generate many artifacts. A pilot run gives you early feedback and a clean audit trail.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1398910",
   "metadata": {},
   "source": [
    "#### 5.2.5.1 Verify HLCA discovery (safe to run)\n",
    "\n",
    "This cell does *not* download anything. It only checks whether your ``HLCA_BASE_PATH`` pointed to real ``.h5ad`` files.\n",
    "\n",
    "**Expected output:** a boolean and (if available) a dataset count.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb986f71",
   "metadata": {},
   "outputs": [],
   "source": [
    "# HLCA presence check (uses the dataset dictionary built earlier)\n",
    "HAS_HLCA = bool(data_h5ad_dict_hlca)\n",
    "print('HLCA detected:', HAS_HLCA)\n",
    "print('Datasets found:', len(data_h5ad_dict_hlca))\n",
    "print('Example dataset names:', list(data_h5ad_dict_hlca)[:10])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47f2cb8b",
   "metadata": {},
   "source": [
    "#### 5.2.5.2 Choose your snapshot release (reproducibility knob)\n",
    "\n",
    "For exploratory work, using the latest release is fine. For a manuscript or production pipeline, pin the release explicitly.\n",
    "\n",
    "> **Tip:** keep the chosen release in your analysis metadata (or config) so you can reproduce results later.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a04067d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an API handle (reuses the same local repository)\n",
    "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "\n",
    "organism, latest_release = api.resolve_organism(\"human\")\n",
    "\n",
    "# Option A (default): always use latest available release\n",
    "TARGET_RELEASE = latest_release\n",
    "\n",
    "# Option B (recommended for papers): pin to an explicit release\n",
    "# TARGET_RELEASE = 110\n",
    "\n",
    "organism, latest_release, TARGET_RELEASE\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df65f117",
   "metadata": {},
   "source": [
    "#### 5.2.5.3 Ensure the human graph snapshot is available\n",
    "\n",
    "If you already ran Part 3 for human, this should load quickly from cache.\n",
    "If not, this will download reference tables and build the snapshot (minutes, one-time).\n",
    "\n",
    "> **Warning:** building a snapshot requires network access the first time (unless your cache is already populated).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3be4c692",
   "metadata": {},
   "outputs": [],
   "source": [
    "if HAS_HLCA:\n",
    "    api.build_graph(organism_name=organism, snapshot_release=TARGET_RELEASE, calculate_caches=True)\n",
    "\n",
    "    g = api.track.graph\n",
    "    print('Graph organism:', g.graph.get('organism'))\n",
    "    print('Graph snapshot release:', g.graph.get('ensembl_release'))\n",
    "    print('Graph nodes:', g.number_of_nodes())\n",
    "    print('Graph edges:', g.number_of_edges())\n",
    "else:\n",
    "    print('No HLCA datasets detected; skipping graph build.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "696f8bf5",
   "metadata": {},
   "source": [
    "#### 5.2.5.4 Build a pilot dataset dictionary\n",
    "\n",
    "Start with a tiny pilot (2–3 datasets). The pilot run answers:\n",
    "\n",
    "- Do my files load?\n",
    "- Does conversion work for my typical identifiers?\n",
    "- Are there many 1→0 failures or 1→n ambiguities?\n",
    "\n",
    "Only after the pilot looks sensible should you scale to the full HLCA collection.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "062426e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def _pick_first_items(d: dict[str, str], n: int) -> dict[str, str]:\n",
    "    keys = list(d)[:n]\n",
    "    return {k: d[k] for k in keys}\n",
    "\n",
    "data_h5ad_dict_hlca_all = data_h5ad_dict_hlca\n",
    "data_h5ad_dict_hlca_pilot = _pick_first_items(data_h5ad_dict_hlca_all, n=3)\n",
    "\n",
    "print('All HLCA datasets:', len(data_h5ad_dict_hlca_all))\n",
    "print('Pilot datasets:', len(data_h5ad_dict_hlca_pilot))\n",
    "list(data_h5ad_dict_hlca_pilot)[:10]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69882339",
   "metadata": {},
   "source": [
    "#### 5.2.5.5 Pilot harmonization run\n",
    "\n",
    "This constructs a harmonizer object. It will run conversions and write intermediate artifacts under a dedicated project folder.\n",
    "\n",
    "**Expected output:** a ``HarmonizeFeatures`` object (or a helpful message if HLCA data was not detected).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6655069",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep experiment outputs separate from the global cache\n",
    "project_out_hlca = (LOCAL_REPOSITORY / 'hlca_experiments').resolve()\n",
    "project_out_hlca.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Choose your final namespace\n",
    "FINAL_DATABASE = 'HGNC Symbol'  # try also: 'ensembl_gene'\n",
    "\n",
    "if not data_h5ad_dict_hlca_pilot:\n",
    "    print('No pilot datasets found; set HLCA_BASE_PATH and re-run from section 5.2.2.')\n",
    "    harmonizer_hlca_pilot = None\n",
    "else:\n",
    "    harmonizer_hlca_pilot = idtrack.HarmonizeFeatures(\n",
    "        project_name='hlca_pilot',\n",
    "        data_h5ad_dict=data_h5ad_dict_hlca_pilot,\n",
    "        project_local_repository=str(project_out_hlca),\n",
    "        idtrack_local_repository=str(LOCAL_REPOSITORY),\n",
    "        target_ensembl_release=TARGET_RELEASE,\n",
    "        final_database=FINAL_DATABASE,\n",
    "        organism_name=organism,\n",
    "        graph_last_ensembl_release=TARGET_RELEASE,\n",
    "        verbose_level=2,\n",
    "    )\n",
    "\n",
    "harmonizer_hlca_pilot\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51852cbd",
   "metadata": {},
   "source": [
    "#### 5.2.5.6 Read the diagnostics (this is the most important part)\n",
    "\n",
    "Treat diagnostics as first-class results. They tell you whether harmonization is trustworthy for your data.\n",
    "\n",
    "- **Conversion failures (1→0)**: identifiers that could not be placed on the graph\n",
    "- **Ambiguity (1→n)**: identifiers that map to multiple plausible targets\n",
    "\n",
    "> **Tip:** store these diagnostics alongside your paper or pipeline outputs. They explain why features were dropped or changed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b62328c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "if harmonizer_hlca_pilot is None:\n",
    "    print('Pilot harmonizer not created; skipping diagnostics.')\n",
    "else:\n",
    "    print('Datasets in pilot:', len(harmonizer_hlca_pilot.data_h5ad_dict))\n",
    "\n",
    "    print('\\n--- Conversion failures (1→0) ---')\n",
    "    print('Count:', len(harmonizer_hlca_pilot.conversion_failed_identifiers))\n",
    "    print('Example (first 25):', sorted(list(harmonizer_hlca_pilot.conversion_failed_identifiers))[:25])\n",
    "\n",
    "    print('\\n--- Failed but consistent (kept) ---')\n",
    "    print('Count:', len(harmonizer_hlca_pilot.conversion_failed_but_consistent_identifiers))\n",
    "    print('Example (first 25):', sorted(list(harmonizer_hlca_pilot.conversion_failed_but_consistent_identifiers))[:25])\n",
    "\n",
    "    print('\\n--- Collapsed ambiguous mappings (1→n; multiple Ensembl candidates) ---')\n",
    "    print('Count:', len(harmonizer_hlca_pilot.multiple_ensembl_dict))\n",
    "    some_keys = list(harmonizer_hlca_pilot.multiple_ensembl_dict)[:10]\n",
    "    print('Example keys:', some_keys)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2f1db0b",
   "metadata": {},
   "source": [
    "#### 5.2.5.7 “Union vs intersect” experiment (feature retention)\n",
    "\n",
    "This is a quick way to understand how much “feature width” you are adding when you merge many datasets.\n",
    "Start with the pilot: it is fast and gives you the same qualitative signal.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdbee3ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "if harmonizer_hlca_pilot is None:\n",
    "    print('Pilot harmonizer not created; skipping merge step.')\n",
    "else:\n",
    "    hlca_pilot_union = harmonizer_hlca_pilot.unify_multiple_anndatas(mode=\"union\")\n",
    "    hlca_pilot_intersect = harmonizer_hlca_pilot.unify_multiple_anndatas(mode=\"intersect\")\n",
    "\n",
    "    print('Union shape (cells x genes):', hlca_pilot_union.shape)\n",
    "    print('Intersect shape (cells x genes):', hlca_pilot_intersect.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c890de54",
   "metadata": {},
   "source": [
    "#### 5.2.5.8 Scale up to full HLCA (template)\n",
    "\n",
    "Once the pilot looks good, you can scale up by switching the dataset dictionary.\n",
    "The code below is commented out by default because a full HLCA run can take a long time and use significant disk.\n",
    "\n",
    "> **Warning:** keep the output folder and intermediate artifacts. They are your provenance record.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "498cb548",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full HLCA run template (commented out by default)\n",
    "#\n",
    "# if data_h5ad_dict_hlca_all:\n",
    "#     project_out_full = (LOCAL_REPOSITORY / 'hlca_full_run').resolve()\n",
    "#     project_out_full.mkdir(parents=True, exist_ok=True)\n",
    "#\n",
    "#     harmonizer_full = idtrack.HarmonizeFeatures(\n",
    "#         project_name='hlca_full',\n",
    "#         data_h5ad_dict=data_h5ad_dict_hlca_all,\n",
    "#         project_local_repository=str(project_out_full),\n",
    "#         idtrack_local_repository=str(LOCAL_REPOSITORY),\n",
    "#         target_ensembl_release=TARGET_RELEASE,\n",
    "#         final_database=FINAL_DATABASE,\n",
    "#         organism_name=organism,\n",
    "#         graph_last_ensembl_release=TARGET_RELEASE,\n",
    "#         verbose_level=2,\n",
    "#     )\n",
    "#\n",
    "#     hlca_union = harmonizer_full.unify_multiple_anndatas(mode=\"union\")\n",
    "#     hlca_union.write_h5ad(project_out_full / 'hlca_union_harmonized.h5ad')\n",
    "#\n",
    "#     # Optional: also keep an intersect version\n",
    "#     # hlca_intersect = harmonizer_full.unify_multiple_anndatas(mode=\"intersect\")\n",
    "#     # hlca_intersect.write_h5ad(project_out_full / 'hlca_intersect_harmonized.h5ad')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd17a0e2",
   "metadata": {},
   "source": [
    "## 5.3 — Legacy Data Rescue\n",
    "\n",
    "Legacy datasets often contain a mix of:\n",
    "- older Ensembl IDs (from older releases)\n",
    "- gene symbols (which may have changed)\n",
    "- project-specific aliases\n",
    "\n",
    "A safe, reproducible rescue workflow is:\n",
    "\n",
    "1. Pick a **snapshot boundary** (the newest release you allow).\n",
    "2. Convert into a stable namespace (usually Ensembl gene IDs) at that snapshot.\n",
    "3. Inspect failure + ambiguity rates.\n",
    "4. Only then convert into presentation-friendly labels (HGNC symbols) if needed.\n",
    "\n",
    "The next cell demonstrates a small, realistic \"mixed identifier\" rescue using the human API.\n",
    "\n",
    "> **Tip:** Legacy data rescue often involves older releases *and* older assemblies. The default human snapshot is multi-assembly and can map\n",
    "> GRCh37-derived identifiers into your chosen snapshot/primary assembly. Only rebuild with `genome_assembly=37` if your downstream\n",
    "> reference is GRCh37 and you want outputs anchored to that build (see Part 3).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77c4e75b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Legacy rescue demo (human)\n",
    "\n",
    "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "api.configure_logger()\n",
    "\n",
    "organism, latest_release = api.resolve_organism('human')\n",
    "api.build_graph(organism_name=organism, snapshot_release=latest_release, calculate_caches=False)\n",
    "\n",
    "legacy_ids = [\n",
    "    'TP53',                # HGNC symbol (common)\n",
    "    'P53',                 # older/alias-like symbol (may or may not resolve)\n",
    "    'ENSG00000141510',     # Ensembl gene ID\n",
    "    'ENSG00000139618',     # BRCA2\n",
    "    'BRCA1',               # symbol\n",
    "    'NOT_A_REAL_GENE',     # should become a clean 1→0 example\n",
    "]\n",
    "\n",
    "# Convert into Ensembl gene IDs at the snapshot boundary (stable backbone)\n",
    "results = api.convert_identifier_multiple(legacy_ids, to_release=latest_release, final_database=None, strategy='all', verbose=False)\n",
    "summary = api.classify_multiple_conversion(results)\n",
    "api.print_binned_conversion(summary)\n",
    "\n",
    "# Show a couple of raw results for inspection\n",
    "results[:3]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dda9c7d7",
   "metadata": {},
   "source": [
    "## 5.4 Summary\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}