{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9e76fd7d",
   "metadata": {},
   "source": [
    "# Part 6 — Cross-Species Workflows: Humanization\n",
    "\n",
    "*Last updated:* 2026-01-08\n",
    "\n",
    "This tutorial shows a practical **humanization** workflow: mapping mouse/pig genes into a human gene space so you can run\n",
    "human-centric downstream analyses (pathways, marker lists, integration, annotation).\n",
    "\n",
    "**Learning objectives**\n",
    "- Understand what humanization is (and when it is appropriate).\n",
    "- Run a step-by-step mouse → human and pig → human mapping.\n",
    "- Validate results and handle 1→n orthology ambiguity explicitly.\n",
    "- Prepare outputs in a tidy, analysis-friendly format for comparative workflows.\n",
    "\n",
    "> **Warning:** Orthology is not always one-to-one. This notebook focuses on making ambiguity visible and manageable.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77955920",
   "metadata": {},
   "source": [
    "## 6.1 — Humanization workflow (what it is, and when to use it)\n",
    "\n",
    "This notebook shows a **reproducible, step-by-step pipeline**:\n",
    "\n",
    "1. **Within-species cleanup (IDTrack)**\n",
    "   - Map mouse/pig identifiers to consistent Ensembl gene IDs within that species.\n",
    "\n",
    "2. **Cross-species mapping (orthologs)**\n",
    "   - Map mouse/pig Ensembl gene IDs to human ortholog Ensembl gene IDs.\n",
    "\n",
    "3. **Human naming (optional, IDTrack)**\n",
    "   - Convert human Ensembl gene IDs into HGNC symbols (or other human namespaces).\n",
    "\n",
    "This notebook does **not** decide the ‘correct’ ortholog for you in complex families — it shows how to\n",
    "surface the candidates and (optionally) score them with sequence-based heuristics.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d88b0c2",
   "metadata": {},
   "source": [
    "### 6.1.1 Pre-requisites\n",
    "\n",
    "### 6.1.1.1 IDTrack graphs\n",
    "You should have built graphs for the organisms you use:\n",
    "- `mus_musculus` and/or `sus_scrofa`\n",
    "- `homo_sapiens`\n",
    "\n",
    "### 6.1.1.2 Optional dependencies for ortholog utilities\n",
    "Ortholog utilities require optional packages. Install one of:\n",
    "- `pip install idtrack[ortholog]`\n",
    "- or `pip install idtrack[all-external]`\n",
    "\n",
    "If you don’t install these extras, the ortholog steps will raise a helpful error.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dbad394",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check optional dependency status (ortholog utilities require gget + biopython)\n",
    "from __future__ import annotations\n",
    "\n",
    "from idtrack import _external_mappers\n",
    "\n",
    "dep_status = _external_mappers.check_optional_dependencies(warn=True)\n",
    "ORTHOLOG_OK = dep_status.get('gget', False) and dep_status.get('biopython', False)\n",
    "\n",
    "print('Optional dependency status:', dep_status)\n",
    "print('Ortholog utilities available:', ORTHOLOG_OK)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47417495",
   "metadata": {},
   "source": [
    "### 6.1.2 Step-by-step: mouse → human\n",
    "\n",
    "### 6.1.2.1 Convert a mouse identifier to a mouse Ensembl gene ID (IDTrack)\n",
    "\n",
    "Start with whatever you have (often an MGI symbol). Convert to a **base Ensembl gene ID** in mouse.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ba01c6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "import idtrack\n",
    "\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "api_mouse = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "api_mouse.configure_logger()\n",
    "\n",
    "mouse_name, mouse_latest = api_mouse.resolve_organism('mouse')\n",
    "api_mouse.build_graph(organism_name=mouse_name, snapshot_release=mouse_latest)\n",
    "\n",
    "mouse_query = 'Trp53'  # example MGI symbol; replace with your gene\n",
    "mouse_to_ensembl = api_mouse.convert_identifier(\n",
    "    mouse_query,\n",
    "    to_release=mouse_latest,\n",
    "    final_database='base_ensembl_gene',\n",
    ")\n",
    "mouse_to_ensembl\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd07348f",
   "metadata": {},
   "source": [
    "Take one of the returned `target_id` entries as your mouse Ensembl gene ID.\n",
    "If you get multiple candidates, you are in a 1→n case — you may need to decide how to handle it.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b1842d9",
   "metadata": {},
   "source": [
    "### 6.1.2.2 Find human ortholog(s) (ortholog utilities)\n",
    "\n",
    "We use Bgee orthologs via `gget` (optional dependency).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1155eab",
   "metadata": {},
   "outputs": [],
   "source": [
    "human_ensembl_gene_ids = []\n",
    "\n",
    "if not mouse_to_ensembl.get('target_id'):\n",
    "    print('No mouse Ensembl gene ID found; check your input ID and mouse graph/YAML.')\n",
    "elif not ORTHOLOG_OK:\n",
    "    print('Ortholog utilities are not available (install extras: `pip install idtrack[ortholog]`).')\n",
    "else:\n",
    "    from idtrack._external_mappers import get_ortholog_table, get_ortholog_ids_for_species\n",
    "\n",
    "    mouse_ensembl_gene_id = mouse_to_ensembl['target_id'][0]  # choose one\n",
    "    ortholog_df = get_ortholog_table(mouse_ensembl_gene_id, verbose=True)\n",
    "\n",
    "    human_ensembl_gene_ids = sorted(get_ortholog_ids_for_species(ortholog_df, target_species='human'))\n",
    "\n",
    "human_ensembl_gene_ids\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f4d2952",
   "metadata": {},
   "source": [
    "### 6.1.2.3 Convert human Ensembl IDs into HGNC symbols (IDTrack)\n",
    "\n",
    "Now we switch to the human graph and convert the human Ensembl IDs into HGNC symbols (optional but common).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8939e41c",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not human_ensembl_gene_ids:\n",
    "    print('No human ortholog IDs available; skipping human conversion step.')\n",
    "else:\n",
    "    api_human = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "    api_human.configure_logger()\n",
    "\n",
    "    human_name, human_latest = api_human.resolve_organism('human')\n",
    "    api_human.build_graph(organism_name=human_name, snapshot_release=human_latest)\n",
    "\n",
    "    # Convert all candidate orthologs (if many-to-many, you will see it here)\n",
    "    human_results = api_human.convert_identifier_multiple(\n",
    "        list(human_ensembl_gene_ids),\n",
    "        to_release=human_latest,\n",
    "        final_database='HGNC Symbol',\n",
    "    )\n",
    "    human_results\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dd04685",
   "metadata": {},
   "source": [
    "### 6.1.3 Step-by-step: pig → human (same pattern)\n",
    "\n",
    "Replace `api_mouse` with a pig API instance and start from your pig identifiers.\n",
    "Many pig pipelines start from Ensembl IDs or Entrez IDs; adjust `final_database` accordingly.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b639fa5",
   "metadata": {},
   "outputs": [],
   "source": [
    "api_pig = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "api_pig.configure_logger()\n",
    "\n",
    "pig_name, pig_latest = api_pig.resolve_organism('pig')\n",
    "api_pig.build_graph(organism_name=pig_name, snapshot_release=pig_latest)\n",
    "\n",
    "pig_query = 'TP53'  # example; replace with your pig gene symbol or ID\n",
    "pig_to_ensembl = api_pig.convert_identifier(\n",
    "    pig_query,\n",
    "    to_release=pig_latest,\n",
    "    final_database='base_ensembl_gene',\n",
    ")\n",
    "pig_to_ensembl\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f5fe9ab",
   "metadata": {},
   "source": [
    "From here, reuse the same ortholog + human conversion steps as in the mouse section.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a3b39de",
   "metadata": {},
   "source": [
    "### 6.1.4 Advanced: choose among multiple ortholog candidates\n",
    "\n",
    "When you have multiple orthologs, IDTrack can optionally compute additional features for ranking.\n",
    "This is for advanced use and requires extra dependencies.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0facb2bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example (advanced): compute sequence-based alignment features for each ortholog candidate\n",
    "# from idtrack._external_mappers import align_ortholog_pair_with_features\n",
    "#\n",
    "# features = align_ortholog_pair_with_features(mouse_ensembl_gene_id, target_species='human', verbose=True)\n",
    "# features\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91aa07b2",
   "metadata": {},
   "source": [
    "### 6.1.5 Practical cautions (please read)\n",
    "\n",
    "1. **Orthology is context-dependent**: paralogs, gene family expansions, and annotation differences matter.\n",
    "2. **Do not silently pick one** in many-to-many cases without recording the rule you used.\n",
    "3. **Record provenance**: snapshot releases + YAML configs + ortholog source/version.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5844b4e5",
   "metadata": {},
   "source": [
    "## 6.2 — Comparative Analysis Preparation\n",
    "\n",
    "Once you have humanized identifiers, you can prepare a comparative workflow that is **reproducible** and **auditable**.\n",
    "\n",
    "Recommended preparation steps:\n",
    "\n",
    "1. **Within each species:** harmonize identifiers into stable Ensembl gene IDs at a fixed snapshot boundary.\n",
    "2. **Across species:** map to human orthologs, but **keep ambiguity visible** (store 1→n mappings as lists).\n",
    "3. **Record provenance:** snapshot boundaries, assemblies, and the orthology source/method.\n",
    "4. **Define a policy for ambiguous cases:** drop / keep-all / choose-best (and justify it).\n",
    "\n",
    "A practical output format is a tidy table with one row per input gene and explicit columns for provenance.\n",
    "The next cell shows a minimal schema you can reuse.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66fe47ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example: a tidy, audit-friendly mapping table schema\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "mapping_table = pd.DataFrame(\n",
    "    [\n",
    "        {\n",
    "            'source_species': 'mus_musculus',\n",
    "            'source_namespace': 'MGI Symbol',\n",
    "            'source_id': 'Trp53',\n",
    "            'human_ensembl_gene_id': 'ENSG00000141510',\n",
    "            'human_hgnc_symbol': 'TP53',\n",
    "            'orthology_candidates': ['ENSG00000141510'],\n",
    "            'snapshot_release_human': None,\n",
    "            'snapshot_release_source': None,\n",
    "            'notes': 'Example row; fill with your real results.'\n",
    "        },\n",
    "        {\n",
    "            'source_species': 'sus_scrofa',\n",
    "            'source_namespace': 'Ensembl Gene ID',\n",
    "            'source_id': 'ENSSSCG00000000001',\n",
    "            'human_ensembl_gene_id': None,\n",
    "            'human_hgnc_symbol': None,\n",
    "            'orthology_candidates': [],\n",
    "            'snapshot_release_human': None,\n",
    "            'snapshot_release_source': None,\n",
    "            'notes': 'Example 1→0 / not found; keep these rows for reporting.'\n",
    "        },\n",
    "    ]\n",
    ")\n",
    "\n",
    "mapping_table\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}