{ "cells": [ { "cell_type": "markdown", "id": "9e76fd7d", "metadata": {}, "source": [ "# Part 6 — Cross-Species Workflows: Humanization\n", "\n", "*Last updated:* 2026-01-08\n", "\n", "This tutorial shows a practical **humanization** workflow: mapping mouse/pig genes into a human gene space so you can run\n", "human-centric downstream analyses (pathways, marker lists, integration, annotation).\n", "\n", "**Learning objectives**\n", "- Understand what humanization is (and when it is appropriate).\n", "- Run a step-by-step mouse → human and pig → human mapping.\n", "- Validate results and handle 1→n orthology ambiguity explicitly.\n", "- Prepare outputs in a tidy, analysis-friendly format for comparative workflows.\n", "\n", "> **Warning:** Orthology is not always one-to-one. This notebook focuses on making ambiguity visible and manageable.\n" ] }, { "cell_type": "markdown", "id": "77955920", "metadata": {}, "source": [ "## 6.1 — Humanization workflow (what it is, and when to use it)\n", "\n", "This notebook shows a **reproducible, step-by-step pipeline**:\n", "\n", "1. **Within-species cleanup (IDTrack)**\n", " - Map mouse/pig identifiers to consistent Ensembl gene IDs within that species.\n", "\n", "2. **Cross-species mapping (orthologs)**\n", " - Map mouse/pig Ensembl gene IDs to human ortholog Ensembl gene IDs.\n", "\n", "3. **Human naming (optional, IDTrack)**\n", " - Convert human Ensembl gene IDs into HGNC symbols (or other human namespaces).\n", "\n", "This notebook does **not** decide the ‘correct’ ortholog for you in complex families — it shows how to\n", "surface the candidates and (optionally) score them with sequence-based heuristics.\n" ] }, { "cell_type": "markdown", "id": "7d88b0c2", "metadata": {}, "source": [ "### 6.1.1 Pre-requisites\n", "\n", "### 6.1.1.1 IDTrack graphs\n", "You should have built graphs for the organisms you use:\n", "- `mus_musculus` and/or `sus_scrofa`\n", "- `homo_sapiens`\n", "\n", "### 6.1.1.2 Optional dependencies for ortholog utilities\n", "Ortholog utilities require optional packages. Install one of:\n", "- `pip install idtrack[ortholog]`\n", "- or `pip install idtrack[all-external]`\n", "\n", "If you don’t install these extras, the ortholog steps will raise a helpful error.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0dbad394", "metadata": {}, "outputs": [], "source": [ "# Check optional dependency status (ortholog utilities require gget + biopython)\n", "from __future__ import annotations\n", "\n", "from idtrack import _external_mappers\n", "\n", "dep_status = _external_mappers.check_optional_dependencies(warn=True)\n", "ORTHOLOG_OK = dep_status.get('gget', False) and dep_status.get('biopython', False)\n", "\n", "print('Optional dependency status:', dep_status)\n", "print('Ortholog utilities available:', ORTHOLOG_OK)\n" ] }, { "cell_type": "markdown", "id": "47417495", "metadata": {}, "source": [ "### 6.1.2 Step-by-step: mouse → human\n", "\n", "### 6.1.2.1 Convert a mouse identifier to a mouse Ensembl gene ID (IDTrack)\n", "\n", "Start with whatever you have (often an MGI symbol). Convert to a **base Ensembl gene ID** in mouse.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8ba01c6b", "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "import idtrack\n", "\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n", "\n", "api_mouse = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", "api_mouse.configure_logger()\n", "\n", "mouse_name, mouse_latest = api_mouse.resolve_organism('mouse')\n", "api_mouse.build_graph(organism_name=mouse_name, snapshot_release=mouse_latest)\n", "\n", "mouse_query = 'Trp53' # example MGI symbol; replace with your gene\n", "mouse_to_ensembl = api_mouse.convert_identifier(\n", " mouse_query,\n", " to_release=mouse_latest,\n", " final_database='base_ensembl_gene',\n", ")\n", "mouse_to_ensembl\n" ] }, { "cell_type": "markdown", "id": "dd07348f", "metadata": {}, "source": [ "Take one of the returned `target_id` entries as your mouse Ensembl gene ID.\n", "If you get multiple candidates, you are in a 1→n case — you may need to decide how to handle it.\n" ] }, { "cell_type": "markdown", "id": "1b1842d9", "metadata": {}, "source": [ "### 6.1.2.2 Find human ortholog(s) (ortholog utilities)\n", "\n", "We use Bgee orthologs via `gget` (optional dependency).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c1155eab", "metadata": {}, "outputs": [], "source": [ "human_ensembl_gene_ids = []\n", "\n", "if not mouse_to_ensembl.get('target_id'):\n", " print('No mouse Ensembl gene ID found; check your input ID and mouse graph/YAML.')\n", "elif not ORTHOLOG_OK:\n", " print('Ortholog utilities are not available (install extras: `pip install idtrack[ortholog]`).')\n", "else:\n", " from idtrack._external_mappers import get_ortholog_table, get_ortholog_ids_for_species\n", "\n", " mouse_ensembl_gene_id = mouse_to_ensembl['target_id'][0] # choose one\n", " ortholog_df = get_ortholog_table(mouse_ensembl_gene_id, verbose=True)\n", "\n", " human_ensembl_gene_ids = sorted(get_ortholog_ids_for_species(ortholog_df, target_species='human'))\n", "\n", "human_ensembl_gene_ids\n" ] }, { "cell_type": "markdown", "id": "9f4d2952", "metadata": {}, "source": [ "### 6.1.2.3 Convert human Ensembl IDs into HGNC symbols (IDTrack)\n", "\n", "Now we switch to the human graph and convert the human Ensembl IDs into HGNC symbols (optional but common).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8939e41c", "metadata": {}, "outputs": [], "source": [ "if not human_ensembl_gene_ids:\n", " print('No human ortholog IDs available; skipping human conversion step.')\n", "else:\n", " api_human = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", " api_human.configure_logger()\n", "\n", " human_name, human_latest = api_human.resolve_organism('human')\n", " api_human.build_graph(organism_name=human_name, snapshot_release=human_latest)\n", "\n", " # Convert all candidate orthologs (if many-to-many, you will see it here)\n", " human_results = api_human.convert_identifier_multiple(\n", " list(human_ensembl_gene_ids),\n", " to_release=human_latest,\n", " final_database='HGNC Symbol',\n", " )\n", " human_results\n" ] }, { "cell_type": "markdown", "id": "6dd04685", "metadata": {}, "source": [ "### 6.1.3 Step-by-step: pig → human (same pattern)\n", "\n", "Replace `api_mouse` with a pig API instance and start from your pig identifiers.\n", "Many pig pipelines start from Ensembl IDs or Entrez IDs; adjust `final_database` accordingly.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1b639fa5", "metadata": {}, "outputs": [], "source": [ "api_pig = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", "api_pig.configure_logger()\n", "\n", "pig_name, pig_latest = api_pig.resolve_organism('pig')\n", "api_pig.build_graph(organism_name=pig_name, snapshot_release=pig_latest)\n", "\n", "pig_query = 'TP53' # example; replace with your pig gene symbol or ID\n", "pig_to_ensembl = api_pig.convert_identifier(\n", " pig_query,\n", " to_release=pig_latest,\n", " final_database='base_ensembl_gene',\n", ")\n", "pig_to_ensembl\n" ] }, { "cell_type": "markdown", "id": "6f5fe9ab", "metadata": {}, "source": [ "From here, reuse the same ortholog + human conversion steps as in the mouse section.\n" ] }, { "cell_type": "markdown", "id": "8a3b39de", "metadata": {}, "source": [ "### 6.1.4 Advanced: choose among multiple ortholog candidates\n", "\n", "When you have multiple orthologs, IDTrack can optionally compute additional features for ranking.\n", "This is for advanced use and requires extra dependencies.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0facb2bb", "metadata": {}, "outputs": [], "source": [ "# Example (advanced): compute sequence-based alignment features for each ortholog candidate\n", "# from idtrack._external_mappers import align_ortholog_pair_with_features\n", "#\n", "# features = align_ortholog_pair_with_features(mouse_ensembl_gene_id, target_species='human', verbose=True)\n", "# features\n" ] }, { "cell_type": "markdown", "id": "91aa07b2", "metadata": {}, "source": [ "### 6.1.5 Practical cautions (please read)\n", "\n", "1. **Orthology is context-dependent**: paralogs, gene family expansions, and annotation differences matter.\n", "2. **Do not silently pick one** in many-to-many cases without recording the rule you used.\n", "3. **Record provenance**: snapshot releases + YAML configs + ortholog source/version.\n" ] }, { "cell_type": "markdown", "id": "5844b4e5", "metadata": {}, "source": [ "## 6.2 — Comparative Analysis Preparation\n", "\n", "Once you have humanized identifiers, you can prepare a comparative workflow that is **reproducible** and **auditable**.\n", "\n", "Recommended preparation steps:\n", "\n", "1. **Within each species:** harmonize identifiers into stable Ensembl gene IDs at a fixed snapshot boundary.\n", "2. **Across species:** map to human orthologs, but **keep ambiguity visible** (store 1→n mappings as lists).\n", "3. **Record provenance:** snapshot boundaries, assemblies, and the orthology source/method.\n", "4. **Define a policy for ambiguous cases:** drop / keep-all / choose-best (and justify it).\n", "\n", "A practical output format is a tidy table with one row per input gene and explicit columns for provenance.\n", "The next cell shows a minimal schema you can reuse.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "66fe47ab", "metadata": {}, "outputs": [], "source": [ "# Example: a tidy, audit-friendly mapping table schema\n", "\n", "import pandas as pd\n", "\n", "mapping_table = pd.DataFrame(\n", " [\n", " {\n", " 'source_species': 'mus_musculus',\n", " 'source_namespace': 'MGI Symbol',\n", " 'source_id': 'Trp53',\n", " 'human_ensembl_gene_id': 'ENSG00000141510',\n", " 'human_hgnc_symbol': 'TP53',\n", " 'orthology_candidates': ['ENSG00000141510'],\n", " 'snapshot_release_human': None,\n", " 'snapshot_release_source': None,\n", " 'notes': 'Example row; fill with your real results.'\n", " },\n", " {\n", " 'source_species': 'sus_scrofa',\n", " 'source_namespace': 'Ensembl Gene ID',\n", " 'source_id': 'ENSSSCG00000000001',\n", " 'human_ensembl_gene_id': None,\n", " 'human_hgnc_symbol': None,\n", " 'orthology_candidates': [],\n", " 'snapshot_release_human': None,\n", " 'snapshot_release_source': None,\n", " 'notes': 'Example 1→0 / not found; keep these rows for reporting.'\n", " },\n", " ]\n", ")\n", "\n", "mapping_table\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }