{ "cells": [ { "cell_type": "markdown", "id": "3af32b3c", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "# Part 5 \u2014 Real-World Experiments: Harmonization\n", "\n", "*Last updated:* 2026-01-08\n", "\n", "This tutorial shows how to use IDTrack for **real-world dataset harmonization**.\n", "\n", "You will learn:\n", "- how to harmonize feature identifiers across multiple `.h5ad` datasets (HLCA-style)\n", "- how to interpret harmonization diagnostics (what changed, what failed, what is ambiguous)\n", "- how to choose between **union** vs **intersection** feature spaces\n", "- how to approach **legacy data rescue** (older identifiers, mixed namespaces)\n", "\n", "> **Tip:** Start with the toy demo first. The exact same logic scales to large datasets.\n", "\n" ] }, { "cell_type": "markdown", "id": "f06d6790", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "## 5.0 \u2014 Why harmonization matters (plain language)\n", "\n", "When you merge datasets, you implicitly assume that feature `X` in dataset A is the same biological entity\n", "as feature `X` in dataset B.\n", "\n", "This breaks when:\n", "- datasets use different Ensembl releases (IDs changed)\n", "- one dataset uses HGNC symbols and the other uses Ensembl IDs\n", "- some features map 1\u2192n (ambiguity) or 1\u21920 (no match)\n", "\n", "IDTrack makes these cases explicit and gives you reproducible conversions anchored to a graph snapshot.\n" ] }, { "cell_type": "markdown", "id": "484981c1", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "## 5.0.1 Pre-requisites\n", "\n", "- You can run `03_initialization_graph.ipynb` for human (graph snapshot exists in your local repository).\n", "- You have `anndata` installed (it is an IDTrack dependency).\n", "- For the HLCA section, you need access to the HLCA `.h5ad` files (not bundled here).\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "e1c0a9a6", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [], "source": [ "# Load notebook utilities (collapsible output magic for tutorials)\n", "%load_ext _notebook_utils\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "25cd6840", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "IDTrack local repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache\n" ] } ], "source": [ "# 1) Setup\n", "from __future__ import annotations\n", "\n", "import os\n", "from pathlib import Path\n", "\n", "import anndata as ad\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import sparse\n", "\n", "import idtrack\n", "\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n", "\n", "print('IDTrack local repository:', LOCAL_REPOSITORY)\n" ] }, { "cell_type": "markdown", "id": "e2b5a81a", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "## 5.1 \u2014 Toy Harmonization (start small, then scale)\n", "\n", "We will create two small `AnnData` objects with overlapping genes but different identifier styles.\n", "This shows the workflow without requiring large data downloads.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "8b6272bd", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "data": { "text/plain": [ "(PosixPath('/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/toy_harmonization_demo/toy_A_symbols.h5ad'),\n", " PosixPath('/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/toy_harmonization_demo/toy_B_mixed.h5ad'))" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2.1 Create toy datasets\n", "\n", "toy_dir = LOCAL_REPOSITORY / 'toy_harmonization_demo'\n", "toy_dir.mkdir(parents=True, exist_ok=True)\n", "\n", "# Dataset A: HGNC symbols (common in many wet-lab exports)\n", "genes_a = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']\n", "X_a = sparse.random(50, len(genes_a), density=0.2, format='csr', random_state=0)\n", "adata_a = ad.AnnData(X=X_a, obs=pd.DataFrame(index=[f'cellA_{i}' for i in range(X_a.shape[0])]), var=pd.DataFrame(index=genes_a))\n", "\n", "# Dataset B: mix of HGNC + an Ensembl stable ID (realistic messy scenario)\n", "genes_b = ['TP53', 'ENSG00000141510', 'BRCA1', 'NOT_A_REAL_GENE']\n", "X_b = sparse.random(60, len(genes_b), density=0.2, format='csr', random_state=1)\n", "adata_b = ad.AnnData(X=X_b, obs=pd.DataFrame(index=[f'cellB_{i}' for i in range(X_b.shape[0])]), var=pd.DataFrame(index=genes_b))\n", "\n", "path_a = toy_dir / 'toy_A_symbols.h5ad'\n", "path_b = toy_dir / 'toy_B_mixed.h5ad'\n", "adata_a.write_h5ad(path_a)\n", "adata_b.write_h5ad(path_b)\n", "\n", "path_a, path_b\n" ] }, { "cell_type": "markdown", "id": "c5c339e9", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.1.1 Run the harmonizer\n", "\n", "IDTrack provides `idtrack.HarmonizeFeatures`.\n", "\n", "Key parameters (interpretation):\n", "- `idtrack_local_repository`: where your built graph snapshot lives\n", "- `graph_last_ensembl_release`: which snapshot release the graph contains (must match your build)\n", "- `target_ensembl_release`: the release you want to harmonize *to*\n", "- `final_database`: what you want to keep as feature IDs (HGNC, Ensembl, \u2026)\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "37affa4f", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "data": { "text/html": [ "
\n", "Click to show conversion logs\n", "
2026-01-17 19:15:44 ERROR:ipykernel.comm: No such comm target registered: jupyter.widget.control
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
\n", "Click to show conversion logs\n", "
(no output)
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%collapse Click to show conversion logs\n", "# Included for tutorial purposes only.\n", "\n", "# 2.2 Harmonize the toy datasets\n", "\n", "data_h5ad_dict = {\n", " 'toy_A': str(path_a),\n", " 'toy_B': str(path_b),\n", "}\n", "\n", "project_out = toy_dir / 'outputs'\n", "project_out.mkdir(parents=True, exist_ok=True)\n", "\n", "organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')\n", "\n", "harmonizer = idtrack.HarmonizeFeatures(\n", " project_name='toy_demo',\n", " data_h5ad_dict=data_h5ad_dict,\n", " project_local_repository=str(project_out),\n", " idtrack_local_repository=str(LOCAL_REPOSITORY),\n", " target_ensembl_release=latest_release,\n", " final_database='HGNC Symbol',\n", " organism_name=organism_name,\n", " graph_last_ensembl_release=latest_release,\n", " verbose_level=1,\n", ")\n", "\n", "harmonizer\n" ] }, { "cell_type": "markdown", "id": "3711c22e", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.1.2 Inspect what happened\n", "\n", "Useful things to look at:\n", "- which identifiers failed conversion\n", "- which identifiers were ambiguous\n", "- per-dataset result pickle files written under `project_local_repository`\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "3a87f98d", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Conversion failed (any dataset): ['ENSG00000141510', 'NOT_A_REAL_GENE']\n", "Conversion failed but consistent: []\n", "Converted IDs with multiple Ensembl possibilities (collapsed): ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']\n" ] } ], "source": [ "print('Conversion failed (any dataset):', sorted(list(harmonizer.conversion_failed_identifiers))[:20])\n", "print('Conversion failed but consistent:', sorted(list(harmonizer.conversion_failed_but_consistent_identifiers))[:20])\n", "print('Converted IDs with multiple Ensembl possibilities (collapsed):', list(harmonizer.multiple_ensembl_dict)[:10])\n" ] }, { "cell_type": "markdown", "id": "1bb06dbe", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.1.3 Produce a unified AnnData (union or intersection)\n", "\n", "After harmonization, you often want a single merged dataset for downstream analysis.\n", "\n", "IDTrack\u2019s harmonizer can merge datasets in two modes:\n", "- `mode='union'` (default): keep the union of all features (missing genes become zeros)\n", "- `mode='intersect'`: keep only features present in every dataset\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "258450c2", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:01<00:00, 1.73it/s, dataset=toy_B, study_var=2, union_var=5, dbh=2]\n", "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 4.91it/s, dataset=toy_B, study_var=2, union_var=5, dbh=2]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Union shape: (110, 5)\n", "Intersect shape: (110, 2)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Merge the toy datasets (this uses AnnData.concat under the hood)\n", "unified_union = harmonizer.unify_multiple_anndatas(mode='union')\n", "unified_intersect = harmonizer.unify_multiple_anndatas(mode='intersect')\n", "\n", "print('Union shape:', unified_union.shape)\n", "print('Intersect shape:', unified_intersect.shape)\n" ] }, { "cell_type": "markdown", "id": "5111dc60", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "At HLCA scale, `union` can create a very large feature matrix.\n", "If you plan to run memory-heavy models, consider starting with `intersect` or filtering genes first.\n" ] }, { "cell_type": "markdown", "id": "549c8c6b", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "## 5.2 \u2014 Multi-Dataset Integration (best practices)\n", "\n", "Once you can harmonize identifiers, you can integrate datasets *without accidentally mixing incompatible feature definitions*.\n", "\n", "Practical best practices:\n", "\n", "- **Start by harmonizing into a stable backbone** (usually Ensembl gene IDs). Convert to symbols only for reporting.\n", "- **Treat 1\u2192n mappings as real biology/data ambiguity**, not as a nuisance. Decide a policy:\n", " - drop ambiguous features\n", " - keep all candidates (inflates feature space)\n", " - collapse to a single representative (requires an explicit rule)\n", "- **Pilot first**: run harmonization on a small subset and inspect diagnostics before scaling up.\n", "- **Union vs intersection**:\n", " - `union` keeps more features but can produce a very wide matrix.\n", " - `intersect` is stricter and often improves comparability, but may drop biologically important genes.\n", "\n", "> **Expected result:** you should be able to justify (and report) your integration choice, not just run a tool.\n", "\n" ] }, { "cell_type": "markdown", "id": "9bea2eee", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.2.1 Scaling up to HLCA (the real use case)\n", "\n", "HLCA-scale harmonization is the same logic, but you need to manage:\n", "- many datasets (often *multiple files per study*)\n", "- large gene lists (tens of thousands of identifiers)\n", "- choosing a snapshot release (what you harmonize *to*)\n", "- disk for intermediate artifacts (pickles/logs)\n", "\n", "### 5.2.2 HLCA datasets used in this tutorial\n", "\n", "HLCA `.h5ad` files are not shipped with IDTrack. In this tutorial we simply *point to existing files* (no downloads).\n", "\n", "The next cell defines a curated HLCA list (study \u2192 list of `.h5ad` paths).\n", "If you're running this notebook on a different system, edit `base_path` in the next cell to match where the HLCA data lives.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "4d578806", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HLCA data root: /lustre/groups/ml01/projects/2023_HLCA_LSikkema/HLCA_reproducibility/data\n", "Studies in curated list: 27\n" ] } ], "source": [ "# 5.2.2 HLCA dataset paths used in this tutorial\n", "#\n", "# This notebook is meant to be easy to follow: edit *one* string if your HLCA data lives elsewhere.\n", "\n", "base_path = '/lustre/groups/ml01/projects/2023_HLCA_LSikkema/HLCA_reproducibility/data'\n", "dset0_dir = os.path.join(base_path, 'HLCA_extended/extension_datasets/ready/full')\n", "dset1_dir = os.path.join(base_path, 'HLCA_extended/extension_datasets/raw')\n", "\n", "# Curated HLCA AnnData list (study \u2192 list of .h5ad files)\n", "hlca_adata_dict = {\n", " 'Kaminski_2020': [f'{dset0_dir}/adams.h5ad'],\n", " 'Meyer_2021': [f'{dset0_dir}/meyer_2021.h5ad'],\n", " 'MeyerNikolic_unpubl': [f'{dset0_dir}/meyer_nikolic_unpubl.h5ad'],\n", " 'Barbry_unpubl': [f'{dset0_dir}/barbry.h5ad'],\n", " 'Regev_2021': [\n", " f'{dset0_dir}/delorey_cryo.h5ad',\n", " f'{dset0_dir}/delorey_fresh.h5ad',\n", " f'{dset0_dir}/delorey_nuclei.h5ad',\n", " ],\n", " 'Thienpont_2018': [f'{dset1_dir}/Lambrechts/lambrechts.h5ad'],\n", " 'Budinger_2020': [f'{dset0_dir}/bharat.h5ad'],\n", " 'Banovich_Kropski_2020': [f'{dset0_dir}/haberman.h5ad'],\n", " 'Sheppard_2020': [f'{dset0_dir}/tsukui.h5ad'],\n", " 'Wunderink_2021': [f'{dset0_dir}/grant_cryo.h5ad', f'{dset0_dir}/grant_fresh.h5ad'],\n", " 'Lambrechts_2021': [f'{dset0_dir}/wouters.h5ad'],\n", " 'Zhang_2021': [f'{dset1_dir}/Liao/covid_for_publish.h5ad'],\n", " 'Duong_lungMAP_unpubl': [f'{dset0_dir}/duong.h5ad'],\n", " 'Janssen_2020': [f'{dset0_dir}/mould.h5ad'],\n", " 'Sun_2020': [\n", " f'{dset0_dir}/wang_sub_batch1.h5ad',\n", " f'{dset0_dir}/wang_sub_batch2.h5ad',\n", " f'{dset0_dir}/wang_sub_batch3.h5ad',\n", " f'{dset0_dir}/wang_sub_batch4.h5ad',\n", " ],\n", " 'Gomperts_2021': [\n", " f'{dset0_dir}/carraro_ucla.h5ad',\n", " f'{dset0_dir}/carraro_cff.h5ad',\n", " f'{dset0_dir}/carraro_csmc.h5ad',\n", " ],\n", " 'Eils_2020': [f'{dset0_dir}/lukassen.h5ad'],\n", " 'Schiller_2020': [f'{dset0_dir}/mayr.h5ad'],\n", " 'Misharin_Budinger_2018': [f'{dset0_dir}/reyfman_disease.h5ad'],\n", " 'Shalek_2018': [f'{dset0_dir}/ordovasmontanes.h5ad'],\n", " 'Schiller_2021': [f'{dset0_dir}/schiller_discovair.h5ad'],\n", " 'Peer_Massague_2020': [f'{dset0_dir}/laughney.h5ad'],\n", " 'Lafyatis_2019': [f'{dset0_dir}/valenzi.h5ad'],\n", " 'Tata_unpubl': [f'{dset0_dir}/tata_unpubl.h5ad'],\n", " 'Xu_2020': [f'{dset0_dir}/guo.h5ad'],\n", " 'Sims_2019': [f'{dset0_dir}/szabo.h5ad'],\n", " 'Schultze_unpubl': [f'{dset0_dir}/schultze_unpubl.h5ad'],\n", "}\n", "\n", "print('HLCA data root:', base_path)\n", "print('Studies in curated list:', len(hlca_adata_dict))\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "hlca_validate_paths", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HLCA files in curated list: 35\n", "Missing files: 0\n" ] } ], "source": [ "rows = []\n", "for study, paths in hlca_adata_dict.items():\n", " for p in paths:\n", " rows.append({'study': study, 'file': os.path.basename(p), 'path': p, 'exists': os.path.exists(p)})\n", "\n", "hlca_path_status = pd.DataFrame(rows)\n", "missing = hlca_path_status[~hlca_path_status['exists']].copy()\n", "\n", "print(f\"HLCA files in curated list: {len(hlca_path_status)}\")\n", "print(f\"Missing files: {len(missing)}\")\n", "\n", "if not missing.empty:\n", " missing.sort_values(['study', 'file'])\n" ] }, { "cell_type": "markdown", "id": "0c1f2a54", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.2.3 Tutorial subset (4 studies; easy to inspect)\n", "\n", "The full `hlca_adata_dict` contains many studies (and some have multiple `.h5ad` files).\n", "For a *tutorial notebook*, we intentionally keep the example small so it is easy to read, debug, and reproduce.\n", "\n", "We will only use these 4 single-file studies:\n", "- `Kaminski_2020`\n", "- `Meyer_2021`\n", "- `MeyerNikolic_unpubl`\n", "- `Barbry_unpubl`\n", "\n", "Important: `idtrack.HarmonizeFeatures` expects **`dict[str, str]`** (`dataset_alias \u2192 single .h5ad path`).\n", "So we convert the curated `hlca_adata_dict` (study \u2192 list[path]) into a small tutorial dictionary.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "c9b6c6b2", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tutorial datasets found: 4 / 4\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pathexistsnote
study
Kaminski_2020/lustre/groups/ml01/projects/2023_HLCA_LSikkem...True
Meyer_2021/lustre/groups/ml01/projects/2023_HLCA_LSikkem...True
MeyerNikolic_unpubl/lustre/groups/ml01/projects/2023_HLCA_LSikkem...True
Barbry_unpubl/lustre/groups/ml01/projects/2023_HLCA_LSikkem...True
\n", "
" ], "text/plain": [ " path \\\n", "study \n", "Kaminski_2020 /lustre/groups/ml01/projects/2023_HLCA_LSikkem... \n", "Meyer_2021 /lustre/groups/ml01/projects/2023_HLCA_LSikkem... \n", "MeyerNikolic_unpubl /lustre/groups/ml01/projects/2023_HLCA_LSikkem... \n", "Barbry_unpubl /lustre/groups/ml01/projects/2023_HLCA_LSikkem... \n", "\n", " exists note \n", "study \n", "Kaminski_2020 True \n", "Meyer_2021 True \n", "MeyerNikolic_unpubl True \n", "Barbry_unpubl True " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HLCA_TUTORIAL_STUDIES = [\n", " 'Kaminski_2020',\n", " 'Meyer_2021',\n", " 'MeyerNikolic_unpubl',\n", " 'Barbry_unpubl',\n", "]\n", "\n", "rows = []\n", "data_h5ad_dict_hlca_tutorial: dict[str, str] = {}\n", "\n", "for study in HLCA_TUTORIAL_STUDIES:\n", " paths = hlca_adata_dict.get(study, [])\n", " if len(paths) != 1:\n", " # The tutorial subset is intentionally restricted to single-file studies.\n", " # If this triggers, update HLCA_TUTORIAL_STUDIES to pick a different study.\n", " rows.append((study, None, False, f'Expected 1 file, got {len(paths)}'))\n", " continue\n", "\n", " p = paths[0]\n", " exists = os.path.exists(p)\n", " rows.append((study, p, exists, ''))\n", " if exists:\n", " data_h5ad_dict_hlca_tutorial[study] = p\n", "\n", "hlca_tutorial_status = pd.DataFrame(rows, columns=['study', 'path', 'exists', 'note']).set_index('study')\n", "print('Tutorial datasets found:', len(data_h5ad_dict_hlca_tutorial), '/', len(HLCA_TUTORIAL_STUDIES))\n", "\n", "if len(data_h5ad_dict_hlca_tutorial) == 0:\n", " print('\\nNo tutorial HLCA files found. Edit `base_path` in section 5.2.2 (or update the paths) and re-run.')\n", "\n", "hlca_tutorial_status\n" ] }, { "cell_type": "markdown", "id": "ca90336b", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.2.4 Run harmonization (HGNC target; tutorial subset)\n", "\n", "This can take time (and is best run on a workstation/server). IDTrack will:\n", "- load/build the human graph snapshot (cached in your local repository)\n", "- run identifier conversion for each dataset\n", "- write per-dataset mapping results as pickles under the project output directory\n", "\n", "This cell is wrapped in a collapsible block because conversion logs can be long.\n", "\n", "Note on outputs: when you build a merged AnnData (section 5.2.6), `.var_names` are **Ensembl gene IDs**.\n", "Mapped HGNC symbols are stored in `.var['converted_id']` for readable reporting.\n", "\n", "> **Reproducibility tip:** pin the release (set `TARGET_RELEASE` explicitly) for manuscripts and pipelines.\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "b6bcdc04", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "Click to show conversion logs\n", "
(no output)
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%collapse Click to show conversion logs\n", "# Included for tutorial purposes only.\n", "\n", "# HLCA tutorial harmonization (4 datasets)\n", "#\n", "# Note: this is best run on a workstation/server because it can download/build caches on first run.\n", "\n", "if not data_h5ad_dict_hlca_tutorial:\n", " print('No tutorial datasets found; skipping harmonization. Edit `base_path` in section 5.2.2 (or update the paths) and re-run.')\n", " harmonizer_hlca = None\n", "else:\n", " project_out_hlca = (LOCAL_REPOSITORY / 'hlca_tutorial_outputs').resolve()\n", " project_out_hlca.mkdir(parents=True, exist_ok=True)\n", "\n", " # Resolve the organism and pick a snapshot boundary\n", " api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", " organism_name, latest_release = api.resolve_organism('human')\n", "\n", " # Option A (default): always use the latest release available in your graph\n", " TARGET_RELEASE = latest_release\n", "\n", " # Option B (recommended for pipelines/papers): pin the release explicitly\n", " # TARGET_RELEASE = 110\n", "\n", " harmonizer_hlca = idtrack.HarmonizeFeatures(\n", " project_name='hlca_tutorial',\n", " data_h5ad_dict=data_h5ad_dict_hlca_tutorial,\n", " project_local_repository=str(project_out_hlca),\n", " idtrack_local_repository=str(LOCAL_REPOSITORY),\n", " target_ensembl_release=TARGET_RELEASE,\n", " final_database='HGNC Symbol',\n", " organism_name=organism_name,\n", " graph_last_ensembl_release=TARGET_RELEASE,\n", " verbose_level=1,\n", " )\n" ] }, { "cell_type": "markdown", "id": "19928eb0", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.2.5 Inspect what happened (simple tables)\n", "\n", "After the harmonizer is created, it exposes diagnostics that tell you whether harmonization is trustworthy.\n", "\n", "In this tutorial we focus on a few practical checks:\n", "- how many identifiers mapped cleanly (1\u21921)\n", "- how many failed (1\u21920)\n", "- how many were ambiguous (1\u2192n)\n", "- how many HGNC targets were actually **fallbacks** (no symbol available at the snapshot)\n", "\n", "We keep the output as small tables so it is easy to read in a tutorial notebook.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "hlca_summary_tables", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
input_idsone_to_one_targetfallback_1_to_1one_to_many_targetfallback_1_to_none_to_nonesuccess_ratefailure_rate
dataset
Barbry_unpubl16859157211095412270.9983980.001602
Kaminski_20204594734808110021919990.9978450.002155
MeyerNikolic_unpubl3358227722568922231260.9962480.003752
Meyer_2021209222059131210180.9991400.000860
\n", "
" ], "text/plain": [ " input_ids one_to_one_target fallback_1_to_1 \\\n", "dataset \n", "Barbry_unpubl 16859 15721 1095 \n", "Kaminski_2020 45947 34808 11002 \n", "MeyerNikolic_unpubl 33582 27722 5689 \n", "Meyer_2021 20922 20591 312 \n", "\n", " one_to_many_target fallback_1_to_n one_to_none \\\n", "dataset \n", "Barbry_unpubl 4 12 27 \n", "Kaminski_2020 19 19 99 \n", "MeyerNikolic_unpubl 22 23 126 \n", "Meyer_2021 1 0 18 \n", "\n", " success_rate failure_rate \n", "dataset \n", "Barbry_unpubl 0.998398 0.001602 \n", "Kaminski_2020 0.997845 0.002155 \n", "MeyerNikolic_unpubl 0.996248 0.003752 \n", "Meyer_2021 0.999140 0.000860 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Per-dataset conversion summary (HGNC target)\n", "\n", "if 'harmonizer_hlca' not in globals() or harmonizer_hlca is None:\n", " print('HLCA harmonizer not created; run section 5.2.4 first.')\n", "else:\n", " matchings_by_dataset = harmonizer_hlca.get_idtrack_matchings_for_all_datasets()\n", "\n", " def _summarize_matchings(matchings: list[dict]) -> dict[str, int]:\n", " c = {\n", " 'input_ids': 0,\n", " 'one_to_one': 0,\n", " 'one_to_many': 0,\n", " 'one_to_none': 0,\n", " 'fallback_1_to_1': 0,\n", " 'fallback_1_to_n': 0,\n", " }\n", "\n", " for m in matchings:\n", " c['input_ids'] += 1\n", "\n", " if m.get('no_corresponding') or m.get('no_conversion'):\n", " c['one_to_none'] += 1\n", " continue\n", "\n", " target_ids = m.get('target_id', [])\n", " n_targets = len(target_ids)\n", "\n", " if n_targets == 1:\n", " c['one_to_one'] += 1\n", " if m.get('no_target'):\n", " c['fallback_1_to_1'] += 1\n", " elif n_targets > 1:\n", " c['one_to_many'] += 1\n", " if m.get('no_target'):\n", " c['fallback_1_to_n'] += 1\n", " else:\n", " # Defensive fallback: treat empty target list as 1\u21920\n", " c['one_to_none'] += 1\n", "\n", " return c\n", "\n", " rows = []\n", " for dataset, matchings in matchings_by_dataset.items():\n", " s = _summarize_matchings(matchings)\n", " rows.append({'dataset': dataset, **s})\n", "\n", " df = pd.DataFrame(rows).set_index('dataset').sort_index()\n", " df['one_to_one_target'] = df['one_to_one'] - df['fallback_1_to_1']\n", " df['one_to_many_target'] = df['one_to_many'] - df['fallback_1_to_n']\n", " df['success_rate'] = (df['one_to_one'] + df['one_to_many']) / df['input_ids']\n", " df['failure_rate'] = df['one_to_none'] / df['input_ids']\n", "\n", " # Keep the tutorial table compact\n", " df = df[[\n", " 'input_ids',\n", " 'one_to_one_target',\n", " 'fallback_1_to_1',\n", " 'one_to_many_target',\n", " 'fallback_1_to_n',\n", " 'one_to_none',\n", " 'success_rate',\n", " 'failure_rate',\n", " ]]\n", " display(df)\n" ] }, { "cell_type": "markdown", "id": "hlca_merge_union_intersect_md", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "### 5.2.6 Build harmonized AnnData (union vs intersect)\n", "\n", "Now that identifiers are harmonized, we can actually merge the datasets into a single AnnData object.\n", "\n", "IDTrack provides two practical merge modes:\n", "- `mode='union'`: keep the union of genes (missing genes become zeros; wider matrix)\n", "- `mode='intersect'`: keep only genes shared by *all* datasets (stricter; narrower matrix)\n", "\n", "Feature naming reminder: `.var_names` are Ensembl gene IDs and `.var['converted_id']` holds the HGNC symbol.\n", "\n", "This step loads the datasets into memory, so run it on a workstation/server.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "hlca_merge_union_intersect_code", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
n_cellsn_genesgenes_retained_vs_union
union774178471881.000000
intersect774178134050.284076
\n", "
" ], "text/plain": [ " n_cells n_genes genes_retained_vs_union\n", "union 774178 47188 1.000000\n", "intersect 774178 13405 0.284076" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "Click to show merge logs\n", "
Genes present in all datasets (from union .var['intersection']): 13405\n",
       "\r\n",
       "  0%|                                                                                                                 | 0/4 [00:00<?, ?it/s]\r\n",
       "  0%|                                                  | 0/4 [00:10<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]\r\n",
       "  0%|                                                  | 0/4 [00:10<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]\r\n",
       " 25%|##########5                               | 1/4 [00:12<00:38, 12.91s/it, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]\r\n",
       " 25%|##########5                               | 1/4 [00:16<00:38, 12.91s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]\r\n",
       " 25%|##########5                               | 1/4 [00:16<00:38, 12.91s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]\r\n",
       " 50%|#####################                     | 2/4 [00:30<00:31, 15.58s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]\r\n",
       " 50%|################                | 2/4 [00:38<00:31, 15.58s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]\r\n",
       " 50%|################                | 2/4 [00:38<00:31, 15.58s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]\r\n",
       " 75%|########################        | 3/4 [01:00<00:22, 22.21s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]\r\n",
       " 75%|#############################2         | 3/4 [01:08<00:22, 22.21s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]\r\n",
       " 75%|#############################2         | 3/4 [01:08<00:22, 22.21s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]\r\n",
       "100%|#######################################| 4/4 [01:39<00:00, 28.79s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]\r\n",
       "100%|#######################################| 4/4 [01:39<00:00, 24.83s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]\n",
       "\r\n",
       "  0%|                                                                                                                 | 0/4 [00:00<?, ?it/s]\r\n",
       "  0%|                                                  | 0/4 [00:09<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]\r\n",
       "  0%|                                                  | 0/4 [00:09<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]\r\n",
       " 25%|##########5                               | 1/4 [00:12<00:36, 12.08s/it, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]\r\n",
       " 25%|##########5                               | 1/4 [00:14<00:36, 12.08s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]\r\n",
       " 25%|##########5                               | 1/4 [00:14<00:36, 12.08s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]\r\n",
       " 50%|#####################                     | 2/4 [00:27<00:28, 14.12s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]\r\n",
       " 50%|################                | 2/4 [00:35<00:28, 14.12s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]\r\n",
       " 50%|################                | 2/4 [00:35<00:28, 14.12s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]\r\n",
       " 75%|########################        | 3/4 [00:45<00:16, 16.00s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]\r\n",
       " 75%|#############################2         | 3/4 [00:52<00:16, 16.00s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]\r\n",
       " 75%|#############################2         | 3/4 [00:52<00:16, 16.00s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]\r\n",
       "100%|#######################################| 4/4 [01:18<00:00, 22.73s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]\r\n",
       "100%|#######################################| 4/4 [01:18<00:00, 19.73s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%collapse Click to show merge logs\n", "# Included for tutorial purposes only.\n", "\n", "if 'harmonizer_hlca' not in globals() or harmonizer_hlca is None:\n", " print('HLCA harmonizer not created; run section 5.2.4 first.')\n", " hlca_union = None\n", " hlca_intersect = None\n", "else:\n", " hlca_union = harmonizer_hlca.unify_multiple_anndatas(mode='union')\n", " hlca_intersect = harmonizer_hlca.unify_multiple_anndatas(mode='intersect')\n", "\n", " comparison = pd.DataFrame(\n", " {\n", " 'n_cells': [hlca_union.n_obs, hlca_intersect.n_obs],\n", " 'n_genes': [hlca_union.n_vars, hlca_intersect.n_vars],\n", " },\n", " index=['union', 'intersect'],\n", " )\n", " if hlca_union.n_vars:\n", " comparison['genes_retained_vs_union'] = [1.0, hlca_intersect.n_vars / hlca_union.n_vars]\n", "\n", " # Quick sanity checks\n", " if 'intersection' in hlca_union.var.columns:\n", " print(\"Genes present in all datasets (from union .var['intersection']):\", int(hlca_union.var['intersection'].sum()))\n", "\n", " # You now have two harmonized AnnDatas to use downstream:\n", " # - hlca_union\n", " # - hlca_intersect\n", "\n", " # Optional: save to disk (can be large)\n", " # hlca_union.write_h5ad(project_out_hlca / 'hlca_union_harmonized.h5ad')\n", " # hlca_intersect.write_h5ad(project_out_hlca / 'hlca_intersect_harmonized.h5ad')\n", "\n", " display(comparison)\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "a14ac396-c346-4b49-a295-54c73bd32c4e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs \u00d7 n_vars = 774178 \u00d7 47188\n", " obs: 'handle_anndata'\n", " var: 'converted_id_Kaminski_2020', 'converted_id_Meyer_2021', 'converted_id_MeyerNikolic_unpubl', 'converted_id_Barbry_unpubl', 'converted_id', 'intersection'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "AnnData object with n_obs \u00d7 n_vars = 774178 \u00d7 13405\n", " obs: 'handle_anndata'\n", " var: 'converted_id'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(hlca_union)\n", "display(hlca_intersect)" ] }, { "cell_type": "markdown", "id": "dd17a0e2", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "## 5.3 \u2014 Legacy Data Rescue\n", "\n", "Legacy datasets often contain a mix of:\n", "- older Ensembl IDs (from older releases)\n", "- gene symbols (which may have changed)\n", "- project-specific aliases\n", "\n", "A safe, reproducible rescue workflow is:\n", "\n", "1. Pick a **snapshot boundary** (the newest release you allow).\n", "2. Convert into a stable namespace (usually Ensembl gene IDs) at that snapshot.\n", "3. Inspect failure + ambiguity rates.\n", "4. Only then convert into presentation-friendly labels (HGNC symbols) if needed.\n", "\n", "The next cell demonstrates a small, realistic \"mixed identifier\" rescue using the human API.\n", "\n", "**Tip:** Legacy data rescue often involves older releases *and* older assemblies. The default human snapshot is multi-assembly and can map GRCh37-derived identifiers into your chosen snapshot/primary assembly. Only rebuild with `genome_assembly=37` if your downstream reference is GRCh37 and you want outputs anchored to that build (see Part 3).\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "77c4e75b", "metadata": { "deletable": true, "editable": true, "frozen": false }, "outputs": [ { "data": { "text/plain": [ "[{'target_id': ['ENSG00000141510.20'],\n", " 'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],\n", " 'final_database': 'ensembl_gene',\n", " 'graph_id': 'TP53',\n", " 'query_id': 'TP53',\n", " 'no_corresponding': False,\n", " 'no_conversion': False,\n", " 'no_target': False},\n", " {'target_id': ['LRG_321.1'],\n", " 'last_node': [('LRG_321.1', 'LRG_321.1')],\n", " 'final_database': 'ensembl_gene',\n", " 'graph_id': 'P53',\n", " 'query_id': 'P53',\n", " 'no_corresponding': False,\n", " 'no_conversion': False,\n", " 'no_target': False},\n", " {'target_id': ['ENSG00000141510.20'],\n", " 'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],\n", " 'final_database': 'ensembl_gene',\n", " 'graph_id': 'ENSG00000141510',\n", " 'query_id': 'ENSG00000141510',\n", " 'no_corresponding': False,\n", " 'no_conversion': False,\n", " 'no_target': False}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Legacy rescue demo (human)\n", "\n", "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", "api.configure_logger()\n", "\n", "organism, latest_release = api.resolve_organism('human')\n", "api.build_graph(organism_name=organism, snapshot_release=latest_release, calculate_caches=False)\n", "\n", "legacy_ids = [\n", " 'TP53', # HGNC symbol (common)\n", " 'P53', # older/alias-like symbol (may or may not resolve)\n", " 'ENSG00000141510', # Ensembl gene ID\n", " 'ENSG00000139618', # BRCA2\n", " 'BRCA1', # symbol\n", " 'NOT_A_REAL_GENE', # should become a clean 1\u21920 example\n", "]\n", "\n", "# Convert into Ensembl gene IDs at the snapshot boundary (stable backbone)\n", "results = api.convert_identifier_multiple(legacy_ids, to_release=latest_release, final_database=None, strategy='all', verbose=False)\n", "summary = api.classify_multiple_conversion(results)\n", "api.print_binned_conversion(summary)\n", "\n", "# Show a couple of raw results for inspection\n", "results[:3]\n" ] }, { "cell_type": "markdown", "id": "dda9c7d7", "metadata": { "deletable": true, "editable": true, "frozen": false }, "source": [ "## 5.4 Summary\n", "\n", "You now have a tutorial workflow for *feature identifier harmonization* across multiple `.h5ad` datasets.\n", "\n", "**Core ideas to keep:**\n", "- Harmonize identifiers before integration; otherwise you risk mixing incompatible feature definitions.\n", "- Use Ensembl gene IDs as the stable integration backbone; keep HGNC symbols (or other externals) for readable reporting.\n", "- Treat 1\u21920 (no match) and 1\u2192n (ambiguity) as diagnostics: inspect and decide a policy.\n", "- Create both `union` and `intersect` merged AnnDatas early; they answer different downstream questions.\n", "\n", "**Practical next step:**\n", "Use `hlca_union` or `hlca_intersect` as input to your integration model of choice.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.12" } }, "nbformat": 4, "nbformat_minor": 5 }