Part 6 — Cross-Species Workflows: Humanization

Last updated: 2026-01-08

This tutorial shows a practical humanization workflow: mapping mouse/pig genes into a human gene space so you can run human-centric downstream analyses (pathways, marker lists, integration, annotation).

Learning objectives

Understand what humanization is (and when it is appropriate).
Run a step-by-step mouse → human and pig → human mapping.
Validate results and handle 1→n orthology ambiguity explicitly.
Prepare outputs in a tidy, analysis-friendly format for comparative workflows.

Warning: Orthology is not always one-to-one. This notebook focuses on making ambiguity visible and manageable.

6.1 — Humanization workflow (what it is, and when to use it)

This notebook shows a reproducible, step-by-step pipeline:

Within-species cleanup (IDTrack)
- Map mouse/pig identifiers to consistent Ensembl gene IDs within that species.
Cross-species mapping (orthologs)
- Map mouse/pig Ensembl gene IDs to human ortholog Ensembl gene IDs.
Human naming (optional, IDTrack)
- Convert human Ensembl gene IDs into HGNC symbols (or other human namespaces).

This notebook does not decide the ‘correct’ ortholog for you in complex families — it shows how to surface the candidates and (optionally) score them with sequence-based heuristics.

6.1.1 Pre-requisites

6.1.1.1 IDTrack graphs

You should have built graphs for the organisms you use:

mus_musculus and/or sus_scrofa
homo_sapiens

6.1.1.2 Optional dependencies for ortholog utilities

Ortholog utilities require optional packages. Install one of:

pip install idtrack[ortholog]
or pip install idtrack[all-external]

If you don’t install these extras, the ortholog steps will raise a helpful error.

# Check optional dependency status (ortholog utilities require gget + biopython)
from __future__ import annotations

from idtrack import _external_mappers

dep_status = _external_mappers.check_optional_dependencies(warn=True)
ORTHOLOG_OK = dep_status.get('gget', False) and dep_status.get('biopython', False)

print('Optional dependency status:', dep_status)
print('Ortholog utilities available:', ORTHOLOG_OK)

Optional dependency status: {'gget': True, 'mygene': True, 'pybiomart': True, 'gprofiler-official': True, 'biopython': True}
Ortholog utilities available: True

6.1.2 Step-by-step: mouse → human

6.1.2.1 Convert a mouse identifier to a mouse Ensembl gene ID (IDTrack)

Start with whatever you have (often an MGI symbol). Convert to a base Ensembl gene ID in mouse.

import os
from pathlib import Path
import idtrack

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

api_mouse = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api_mouse.configure_logger()

mouse_name, mouse_latest = api_mouse.resolve_organism('mouse')
api_mouse.build_graph(organism_name=mouse_name, snapshot_release=mouse_latest)

mouse_query = 'Trp53'  # example MGI symbol; replace with your gene
mouse_to_ensembl = api_mouse.convert_identifier(
    mouse_query,
    to_release=mouse_latest,
    final_database='base_ensembl_gene',
)
mouse_to_ensembl

2026-01-17 14:57:40 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
2026-01-17 14:58:05 INFO:database_manager: Using assembly-specific release range for mus_musculus assembly 39: releases 103-115 (from config [103, None])
2026-01-17 14:58:55 INFO:graph_maker: The graph is being read: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/graph_mus_musculus_min48_max115_narrow.pickle
2026-01-17 15:01:06 INFO:the_graph: Cached properties being calculated: available_genome_assemblies
2026-01-17 15:01:06 INFO:the_graph: Cached properties being calculated: combined_edges
2026-01-17 15:02:36 INFO:the_graph: Cached properties being calculated: combined_edges_genes
2026-01-17 15:04:36 INFO:the_graph: Cached properties being calculated: combined_edges_assembly_specific_genes
2026-01-17 15:04:44 INFO:the_graph: Cached properties being calculated: lower_chars_graph
2026-01-17 15:04:47 INFO:the_graph: Cached properties being calculated: get_active_ranges_of_id
2026-01-17 15:06:36 INFO:the_graph: Cached properties being calculated: available_external_databases
2026-01-17 15:06:41 INFO:the_graph: Cached properties being calculated: available_external_databases_assembly
2026-01-17 15:06:46 INFO:the_graph: Cached properties being calculated: node_trios
2026-01-17 15:08:37 INFO:the_graph: Cached properties being calculated: hyperconnective_nodes

{'target_id': ['ENSMUSG00000059552'],
 'last_node': [('ENSMUSG00000059552.14', 'ENSMUSG00000059552')],
 'final_database': 'base_ensembl_gene',
 'graph_id': 'TRP53',
 'query_id': 'Trp53',
 'no_corresponding': False,
 'no_conversion': False,
 'no_target': False}

Take one of the returned target_id entries as your mouse Ensembl gene ID. If you get multiple candidates, you are in a 1→n case — you may need to decide how to handle it.

6.1.2.2 Find human ortholog(s) (ortholog utilities)

We use Bgee orthologs via gget (optional dependency).

human_ensembl_gene_ids = []

if not mouse_to_ensembl.get('target_id'):
    print('No mouse Ensembl gene ID found; check your input ID and mouse graph/YAML.')
elif not ORTHOLOG_OK:
    print('Ortholog utilities are not available (install extras: `pip install idtrack[ortholog]`).')
else:
    from idtrack._external_mappers import get_ortholog_table, get_ortholog_ids_for_species

    mouse_ensembl_gene_id = mouse_to_ensembl['target_id'][0]  # choose one
    ortholog_df = get_ortholog_table(mouse_ensembl_gene_id, verbose=True)

    human_ensembl_gene_ids = sorted(get_ortholog_ids_for_species(ortholog_df, target_species='human'))

human_ensembl_gene_ids

15:10:53 - INFO - Getting species ID for gene ENSMUSG00000059552 from Bgee
2026-01-17 15:10:53 INFO:gget.utils: Getting species ID for gene ENSMUSG00000059552 from Bgee
15:10:53 - INFO - Getting orthologs for gene ENSMUSG00000059552 from Bgee
2026-01-17 15:10:53 INFO:gget.utils: Getting orthologs for gene ENSMUSG00000059552 from Bgee

['ENSG00000141510']

6.1.2.3 Convert human Ensembl IDs into HGNC symbols (IDTrack)

Now we switch to the human graph and convert the human Ensembl IDs into HGNC symbols (optional but common).

if not human_ensembl_gene_ids:
    print('No human ortholog IDs available; skipping human conversion step.')
else:
    api_human = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
    api_human.configure_logger()

    human_name, human_latest = api_human.resolve_organism('human')
    api_human.build_graph(organism_name=human_name, snapshot_release=human_latest)

    # Convert all candidate orthologs (if many-to-many, you will see it here)
    human_results = api_human.convert_identifier_multiple(
        list(human_ensembl_gene_ids),
        to_release=human_latest,
        final_database='HGNC Symbol',
    )
    human_results

2026-01-17 15:10:54 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
2026-01-17 15:11:18 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 38: releases 76-115 (from config [76, None])
2026-01-17 15:12:13 INFO:graph_maker: The graph is being read: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/graph_homo_sapiens_min48_max115_narrow.pickle
2026-01-17 15:16:21 INFO:the_graph: Cached properties being calculated: available_genome_assemblies
2026-01-17 15:16:21 INFO:the_graph: Cached properties being calculated: combined_edges
2026-01-17 15:17:43 INFO:the_graph: Cached properties being calculated: combined_edges_genes
2026-01-17 15:20:47 INFO:the_graph: Cached properties being calculated: combined_edges_assembly_specific_genes
2026-01-17 15:20:52 INFO:the_graph: Cached properties being calculated: lower_chars_graph
2026-01-17 15:20:55 INFO:the_graph: Cached properties being calculated: get_active_ranges_of_id
2026-01-17 15:22:21 INFO:the_graph: Cached properties being calculated: available_external_databases
2026-01-17 15:22:26 INFO:the_graph: Cached properties being calculated: available_external_databases_assembly
2026-01-17 15:22:30 INFO:the_graph: Cached properties being calculated: node_trios
2026-01-17 15:23:35 INFO:the_graph: Cached properties being calculated: hyperconnective_nodes
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 10.34it/s, ID:ENSG00000141510]

6.1.3 Step-by-step: pig → human (same pattern)

Replace api_mouse with a pig API instance and start from your pig identifiers. Many pig pipelines start from Ensembl IDs or Entrez IDs; adjust final_database accordingly.

api_pig = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api_pig.configure_logger()

pig_name, pig_latest = api_pig.resolve_organism('sus_scrofa')
api_pig.build_graph(organism_name=pig_name, snapshot_release=pig_latest)

pig_query = 'TP53'  # example; replace with your pig gene symbol or ID
pig_to_ensembl = api_pig.convert_identifier(
    pig_query,
    to_release=pig_latest,
    final_database='base_ensembl_gene',
)
pig_to_ensembl

2026-01-17 15:26:45 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
2026-01-17 15:27:10 INFO:database_manager: Using assembly-specific release range for sus_scrofa assembly 111: releases 90-115 (from config [90, None])
2026-01-17 15:28:02 INFO:graph_maker: The graph is being read: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/graph_sus_scrofa_min48_max115_narrow.pickle
2026-01-17 15:28:18 INFO:the_graph: Cached properties being calculated: available_genome_assemblies
2026-01-17 15:28:18 INFO:the_graph: Cached properties being calculated: combined_edges
2026-01-17 15:28:33 INFO:the_graph: Cached properties being calculated: combined_edges_genes
2026-01-17 15:31:27 INFO:the_graph: Cached properties being calculated: combined_edges_assembly_specific_genes
2026-01-17 15:31:28 INFO:the_graph: Cached properties being calculated: lower_chars_graph
2026-01-17 15:31:29 INFO:the_graph: Cached properties being calculated: get_active_ranges_of_id
2026-01-17 15:31:51 INFO:the_graph: Cached properties being calculated: available_external_databases
2026-01-17 15:31:52 INFO:the_graph: Cached properties being calculated: available_external_databases_assembly
2026-01-17 15:31:53 INFO:the_graph: Cached properties being calculated: node_trios
2026-01-17 15:32:08 INFO:the_graph: Cached properties being calculated: hyperconnective_nodes

{'target_id': ['ENSSSCG00000017950'],
 'last_node': [('ENSSSCG00000017950.5', 'ENSSSCG00000017950')],
 'final_database': 'base_ensembl_gene',
 'graph_id': 'TP53',
 'query_id': 'TP53',
 'no_corresponding': False,
 'no_conversion': False,
 'no_target': False}

From here, reuse the same ortholog + human conversion steps as in the mouse section.

6.1.4 Advanced: choose among multiple ortholog candidates

When you have multiple orthologs, IDTrack can optionally compute additional features for ranking. This is for advanced use and requires extra dependencies.

# Example (advanced): compute sequence-based alignment features for each ortholog candidate
from idtrack._external_mappers import align_ortholog_pair_with_features

features = align_ortholog_pair_with_features(mouse_ensembl_gene_id, target_species='human', verbose=True)
features

15:32:11 - INFO - Getting species ID for gene ENSMUSG00000059552 from Bgee
2026-01-17 15:32:11 INFO:gget.utils: Getting species ID for gene ENSMUSG00000059552 from Bgee
15:32:11 - INFO - Getting orthologs for gene ENSMUSG00000059552 from Bgee
2026-01-17 15:32:11 INFO:gget.utils: Getting orthologs for gene ENSMUSG00000059552 from Bgee

[INFO] Found 1 ortholog(s) for 'ENSMUSG00000059552' in target species 'human' (canonical='hsapiens', genus='Homo', species='sapiens').

15:32:13 - INFO - Requesting amino acid sequence of the canonical transcript ENSMUST00000108658 of gene ENSMUSG00000059552 from UniProt.
2026-01-17 15:32:13 INFO:gget.utils: Requesting amino acid sequence of the canonical transcript ENSMUST00000108658 of gene ENSMUSG00000059552 from UniProt.

[INFO] Aligning 'ENSMUSG00000059552' -> 'ENSG00000141510' ('human')

15:32:15 - INFO - Requesting amino acid sequence of the canonical transcript ENST00000269305 of gene ENSG00000141510 from UniProt.
2026-01-17 15:32:15 INFO:gget.utils: Requesting amino acid sequence of the canonical transcript ENST00000269305 of gene ENSG00000141510 from UniProt.
15:32:15 - INFO - MUSCLE compiled.
2026-01-17 15:32:15 INFO:gget.utils: MUSCLE compiled.
15:32:15 - INFO - MUSCLE aligning...
2026-01-17 15:32:15 INFO:gget.utils: MUSCLE aligning...

muscle 5.2.linux64 [00617b]  1056Gb RAM, 96 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 392, max 393

00:00 7.8Mb  CPU has 96 cores, defaulting to 20 threads

WARNING: Max OMP threads 2

00:00 16Mb    100.0% Calc posteriors
00:00 16Mb    100.0% UPGMA5
15:32:15 - INFO - MUSCLE alignment complete. Alignment time: 0.1 seconds
2026-01-17 15:32:15 INFO:gget.utils: MUSCLE alignment complete. Alignment time: 0.1 seconds



ENSMUSG00000059552
         MTAMEESQSDISLELPLSQETFSGLWKLLPPEDIL-PSP-HCMDDLLL-PQDVEEFFE---GPSEALRVSGAPAAQDPVT
ENSG00000141510
         ---MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAP


ENSMUSG00000059552
         ETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRV
ENSG00000141510
         AAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRV


ENSMUSG00000059552
         RAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYM
ENSG00000141510
         RAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYM


ENSMUSG00000059552
         CNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPP
ENSG00000141510
         CNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQ


ENSMUSG00000059552
         QKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD
ENSG00000141510
         PKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
[WARN] MUSCLE failed for ENSG00000141510: 'NoneType' object has no attribute 'splitlines'

{}

6.1.5 Practical cautions (please read)

Orthology is context-dependent: paralogs, gene family expansions, and annotation differences matter.
Do not silently pick one in many-to-many cases without recording the rule you used.
Record provenance: snapshot releases + YAML configs + ortholog source/version.

6.2 — Comparative Analysis Preparation

Once you have humanized identifiers, you can prepare a comparative workflow that is reproducible and auditable.

Recommended preparation steps:

Within each species: harmonize identifiers into stable Ensembl gene IDs at a fixed snapshot boundary.
Across species: map to human orthologs, but keep ambiguity visible (store 1→n mappings as lists).
Record provenance: snapshot boundaries, assemblies, and the orthology source/method.
Define a policy for ambiguous cases: drop / keep-all / choose-best (and justify it).

A practical output format is a tidy table with one row per input gene and explicit columns for provenance. The next cell shows a minimal schema you can reuse.

# Example: a tidy, audit-friendly mapping table schema

import pandas as pd

mapping_table = pd.DataFrame(
    [
        {
            'source_species': 'mus_musculus',
            'source_namespace': 'MGI Symbol',
            'source_id': 'Trp53',
            'human_ensembl_gene_id': 'ENSG00000141510',
            'human_hgnc_symbol': 'TP53',
            'orthology_candidates': ['ENSG00000141510'],
            'snapshot_release_human': None,
            'snapshot_release_source': None,
            'notes': 'Example row; fill with your real results.'
        },
        {
            'source_species': 'sus_scrofa',
            'source_namespace': 'Ensembl Gene ID',
            'source_id': 'ENSSSCG00000000001',
            'human_ensembl_gene_id': None,
            'human_hgnc_symbol': None,
            'orthology_candidates': [],
            'snapshot_release_human': None,
            'snapshot_release_source': None,
            'notes': 'Example 1→0 / not found; keep these rows for reporting.'
        },
    ]
)

mapping_table

	source_species	source_namespace	source_id	human_ensembl_gene_id	human_hgnc_symbol	orthology_candidates	snapshot_release_human	snapshot_release_source	notes
0	mus_musculus	MGI Symbol	Trp53	ENSG00000141510	TP53	[ENSG00000141510]	None	None	Example row; fill with your real results.
1	sus_scrofa	Ensembl Gene ID	ENSSSCG00000000001	None	None	[]	None	None	Example 1→0 / not found; keep these rows for r...