Part 5 — Real-World Experiments: Harmonization
Last updated: 2026-01-08
This tutorial shows how to use IDTrack for real-world dataset harmonization.
You will learn:
how to harmonize feature identifiers across multiple
.h5addatasets (HLCA-style)how to interpret harmonization diagnostics (what changed, what failed, what is ambiguous)
how to choose between union vs intersection feature spaces
how to approach legacy data rescue (older identifiers, mixed namespaces)
Tip: Start with the toy demo first. The exact same logic scales to large datasets.
5.0 — Why harmonization matters (plain language)
When you merge datasets, you implicitly assume that feature X in dataset A is the same biological entity as feature X in dataset B.
This breaks when:
datasets use different Ensembl releases (IDs changed)
one dataset uses HGNC symbols and the other uses Ensembl IDs
some features map 1→n (ambiguity) or 1→0 (no match)
IDTrack makes these cases explicit and gives you reproducible conversions anchored to a graph snapshot.
5.0.1 Pre-requisites
You can run
03_initialization_graph.ipynbfor human (graph snapshot exists in your local repository).You have
anndatainstalled (it is an IDTrack dependency).For the HLCA section, you need access to the HLCA
.h5adfiles (not bundled here).
1
# Load notebook utilities (collapsible output magic for tutorials)
%load_ext _notebook_utils
2
# 1) Setup
from __future__ import annotations
import os
from pathlib import Path
import anndata as ad
import numpy as np
import pandas as pd
from scipy import sparse
import idtrack
LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)
print('IDTrack local repository:', LOCAL_REPOSITORY)
IDTrack local repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache
5.1 — Toy Harmonization (start small, then scale)
We will create two small AnnData objects with overlapping genes but different identifier styles. This shows the workflow without requiring large data downloads.
3
# 2.1 Create toy datasets
toy_dir = LOCAL_REPOSITORY / 'toy_harmonization_demo'
toy_dir.mkdir(parents=True, exist_ok=True)
# Dataset A: HGNC symbols (common in many wet-lab exports)
genes_a = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']
X_a = sparse.random(50, len(genes_a), density=0.2, format='csr', random_state=0)
adata_a = ad.AnnData(X=X_a, obs=pd.DataFrame(index=[f'cellA_{i}' for i in range(X_a.shape[0])]), var=pd.DataFrame(index=genes_a))
# Dataset B: mix of HGNC + an Ensembl stable ID (realistic messy scenario)
genes_b = ['TP53', 'ENSG00000141510', 'BRCA1', 'NOT_A_REAL_GENE']
X_b = sparse.random(60, len(genes_b), density=0.2, format='csr', random_state=1)
adata_b = ad.AnnData(X=X_b, obs=pd.DataFrame(index=[f'cellB_{i}' for i in range(X_b.shape[0])]), var=pd.DataFrame(index=genes_b))
path_a = toy_dir / 'toy_A_symbols.h5ad'
path_b = toy_dir / 'toy_B_mixed.h5ad'
adata_a.write_h5ad(path_a)
adata_b.write_h5ad(path_b)
path_a, path_b
3
(PosixPath('/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/toy_harmonization_demo/toy_A_symbols.h5ad'),
PosixPath('/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/toy_harmonization_demo/toy_B_mixed.h5ad'))
5.1.1 Run the harmonizer
IDTrack provides idtrack.HarmonizeFeatures.
Key parameters (interpretation):
idtrack_local_repository: where your built graph snapshot livesgraph_last_ensembl_release: which snapshot release the graph contains (must match your build)target_ensembl_release: the release you want to harmonize tofinal_database: what you want to keep as feature IDs (HGNC, Ensembl, …)
4
%%collapse Click to show conversion logs
# Included for tutorial purposes only.
# 2.2 Harmonize the toy datasets
data_h5ad_dict = {
'toy_A': str(path_a),
'toy_B': str(path_b),
}
project_out = toy_dir / 'outputs'
project_out.mkdir(parents=True, exist_ok=True)
organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')
harmonizer = idtrack.HarmonizeFeatures(
project_name='toy_demo',
data_h5ad_dict=data_h5ad_dict,
project_local_repository=str(project_out),
idtrack_local_repository=str(LOCAL_REPOSITORY),
target_ensembl_release=latest_release,
final_database='HGNC Symbol',
organism_name=organism_name,
graph_last_ensembl_release=latest_release,
verbose_level=1,
)
harmonizer
Click to show conversion logs
2026-01-17 19:15:44 ERROR:ipykernel.comm: No such comm target registered: jupyter.widget.control
4
<idtrack._harmonize_features.HarmonizeFeatures at 0x7f326447b190>
Click to show conversion logs
(no output)
<idtrack._harmonize_features.HarmonizeFeatures at 0x7f326447b190>
5.1.2 Inspect what happened
Useful things to look at:
which identifiers failed conversion
which identifiers were ambiguous
per-dataset result pickle files written under
project_local_repository
5
print('Conversion failed (any dataset):', sorted(list(harmonizer.conversion_failed_identifiers))[:20])
print('Conversion failed but consistent:', sorted(list(harmonizer.conversion_failed_but_consistent_identifiers))[:20])
print('Converted IDs with multiple Ensembl possibilities (collapsed):', list(harmonizer.multiple_ensembl_dict)[:10])
Conversion failed (any dataset): ['ENSG00000141510', 'NOT_A_REAL_GENE']
Conversion failed but consistent: []
Converted IDs with multiple Ensembl possibilities (collapsed): ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']
5.1.3 Produce a unified AnnData (union or intersection)
After harmonization, you often want a single merged dataset for downstream analysis.
IDTrack’s harmonizer can merge datasets in two modes:
mode='union'(default): keep the union of all features (missing genes become zeros)mode='intersect': keep only features present in every dataset
6
# Merge the toy datasets (this uses AnnData.concat under the hood)
unified_union = harmonizer.unify_multiple_anndatas(mode='union')
unified_intersect = harmonizer.unify_multiple_anndatas(mode='intersect')
print('Union shape:', unified_union.shape)
print('Intersect shape:', unified_intersect.shape)
100%|█████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.73it/s, dataset=toy_B, study_var=2, union_var=5, dbh=2]
100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4.91it/s, dataset=toy_B, study_var=2, union_var=5, dbh=2]
Union shape: (110, 5)
Intersect shape: (110, 2)
At HLCA scale, union can create a very large feature matrix. If you plan to run memory-heavy models, consider starting with intersect or filtering genes first.
5.2 — Multi-Dataset Integration (best practices)
Once you can harmonize identifiers, you can integrate datasets without accidentally mixing incompatible feature definitions.
Practical best practices:
Start by harmonizing into a stable backbone (usually Ensembl gene IDs). Convert to symbols only for reporting.
Treat 1→n mappings as real biology/data ambiguity, not as a nuisance. Decide a policy:
drop ambiguous features
keep all candidates (inflates feature space)
collapse to a single representative (requires an explicit rule)
Pilot first: run harmonization on a small subset and inspect diagnostics before scaling up.
Union vs intersection:
unionkeeps more features but can produce a very wide matrix.intersectis stricter and often improves comparability, but may drop biologically important genes.
Expected result: you should be able to justify (and report) your integration choice, not just run a tool.
5.2.1 Scaling up to HLCA (the real use case)
HLCA-scale harmonization is the same logic, but you need to manage:
many datasets (often multiple files per study)
large gene lists (tens of thousands of identifiers)
choosing a snapshot release (what you harmonize to)
disk for intermediate artifacts (pickles/logs)
5.2.2 HLCA datasets used in this tutorial
HLCA .h5ad files are not shipped with IDTrack. In this tutorial we simply point to existing files (no downloads).
The next cell defines a curated HLCA list (study → list of .h5ad paths). If you’re running this notebook on a different system, edit base_path in the next cell to match where the HLCA data lives.
7
# 5.2.2 HLCA dataset paths used in this tutorial
#
# This notebook is meant to be easy to follow: edit *one* string if your HLCA data lives elsewhere.
base_path = '/lustre/groups/ml01/projects/2023_HLCA_LSikkema/HLCA_reproducibility/data'
dset0_dir = os.path.join(base_path, 'HLCA_extended/extension_datasets/ready/full')
dset1_dir = os.path.join(base_path, 'HLCA_extended/extension_datasets/raw')
# Curated HLCA AnnData list (study → list of .h5ad files)
hlca_adata_dict = {
'Kaminski_2020': [f'{dset0_dir}/adams.h5ad'],
'Meyer_2021': [f'{dset0_dir}/meyer_2021.h5ad'],
'MeyerNikolic_unpubl': [f'{dset0_dir}/meyer_nikolic_unpubl.h5ad'],
'Barbry_unpubl': [f'{dset0_dir}/barbry.h5ad'],
'Regev_2021': [
f'{dset0_dir}/delorey_cryo.h5ad',
f'{dset0_dir}/delorey_fresh.h5ad',
f'{dset0_dir}/delorey_nuclei.h5ad',
],
'Thienpont_2018': [f'{dset1_dir}/Lambrechts/lambrechts.h5ad'],
'Budinger_2020': [f'{dset0_dir}/bharat.h5ad'],
'Banovich_Kropski_2020': [f'{dset0_dir}/haberman.h5ad'],
'Sheppard_2020': [f'{dset0_dir}/tsukui.h5ad'],
'Wunderink_2021': [f'{dset0_dir}/grant_cryo.h5ad', f'{dset0_dir}/grant_fresh.h5ad'],
'Lambrechts_2021': [f'{dset0_dir}/wouters.h5ad'],
'Zhang_2021': [f'{dset1_dir}/Liao/covid_for_publish.h5ad'],
'Duong_lungMAP_unpubl': [f'{dset0_dir}/duong.h5ad'],
'Janssen_2020': [f'{dset0_dir}/mould.h5ad'],
'Sun_2020': [
f'{dset0_dir}/wang_sub_batch1.h5ad',
f'{dset0_dir}/wang_sub_batch2.h5ad',
f'{dset0_dir}/wang_sub_batch3.h5ad',
f'{dset0_dir}/wang_sub_batch4.h5ad',
],
'Gomperts_2021': [
f'{dset0_dir}/carraro_ucla.h5ad',
f'{dset0_dir}/carraro_cff.h5ad',
f'{dset0_dir}/carraro_csmc.h5ad',
],
'Eils_2020': [f'{dset0_dir}/lukassen.h5ad'],
'Schiller_2020': [f'{dset0_dir}/mayr.h5ad'],
'Misharin_Budinger_2018': [f'{dset0_dir}/reyfman_disease.h5ad'],
'Shalek_2018': [f'{dset0_dir}/ordovasmontanes.h5ad'],
'Schiller_2021': [f'{dset0_dir}/schiller_discovair.h5ad'],
'Peer_Massague_2020': [f'{dset0_dir}/laughney.h5ad'],
'Lafyatis_2019': [f'{dset0_dir}/valenzi.h5ad'],
'Tata_unpubl': [f'{dset0_dir}/tata_unpubl.h5ad'],
'Xu_2020': [f'{dset0_dir}/guo.h5ad'],
'Sims_2019': [f'{dset0_dir}/szabo.h5ad'],
'Schultze_unpubl': [f'{dset0_dir}/schultze_unpubl.h5ad'],
}
print('HLCA data root:', base_path)
print('Studies in curated list:', len(hlca_adata_dict))
HLCA data root: /lustre/groups/ml01/projects/2023_HLCA_LSikkema/HLCA_reproducibility/data
Studies in curated list: 27
8
rows = []
for study, paths in hlca_adata_dict.items():
for p in paths:
rows.append({'study': study, 'file': os.path.basename(p), 'path': p, 'exists': os.path.exists(p)})
hlca_path_status = pd.DataFrame(rows)
missing = hlca_path_status[~hlca_path_status['exists']].copy()
print(f"HLCA files in curated list: {len(hlca_path_status)}")
print(f"Missing files: {len(missing)}")
if not missing.empty:
missing.sort_values(['study', 'file'])
HLCA files in curated list: 35
Missing files: 0
5.2.3 Tutorial subset (4 studies; easy to inspect)
The full hlca_adata_dict contains many studies (and some have multiple .h5ad files). For a tutorial notebook, we intentionally keep the example small so it is easy to read, debug, and reproduce.
We will only use these 4 single-file studies:
Kaminski_2020Meyer_2021MeyerNikolic_unpublBarbry_unpubl
Important: idtrack.HarmonizeFeatures expects ``dict[str, str]`` (dataset_alias → single .h5ad path). So we convert the curated hlca_adata_dict (study → list[path]) into a small tutorial dictionary.
9
HLCA_TUTORIAL_STUDIES = [
'Kaminski_2020',
'Meyer_2021',
'MeyerNikolic_unpubl',
'Barbry_unpubl',
]
rows = []
data_h5ad_dict_hlca_tutorial: dict[str, str] = {}
for study in HLCA_TUTORIAL_STUDIES:
paths = hlca_adata_dict.get(study, [])
if len(paths) != 1:
# The tutorial subset is intentionally restricted to single-file studies.
# If this triggers, update HLCA_TUTORIAL_STUDIES to pick a different study.
rows.append((study, None, False, f'Expected 1 file, got {len(paths)}'))
continue
p = paths[0]
exists = os.path.exists(p)
rows.append((study, p, exists, ''))
if exists:
data_h5ad_dict_hlca_tutorial[study] = p
hlca_tutorial_status = pd.DataFrame(rows, columns=['study', 'path', 'exists', 'note']).set_index('study')
print('Tutorial datasets found:', len(data_h5ad_dict_hlca_tutorial), '/', len(HLCA_TUTORIAL_STUDIES))
if len(data_h5ad_dict_hlca_tutorial) == 0:
print('\nNo tutorial HLCA files found. Edit `base_path` in section 5.2.2 (or update the paths) and re-run.')
hlca_tutorial_status
Tutorial datasets found: 4 / 4
9
| path | exists | note | |
|---|---|---|---|
| study | |||
| Kaminski_2020 | /lustre/groups/ml01/projects/2023_HLCA_LSikkem... | True | |
| Meyer_2021 | /lustre/groups/ml01/projects/2023_HLCA_LSikkem... | True | |
| MeyerNikolic_unpubl | /lustre/groups/ml01/projects/2023_HLCA_LSikkem... | True | |
| Barbry_unpubl | /lustre/groups/ml01/projects/2023_HLCA_LSikkem... | True |
5.2.4 Run harmonization (HGNC target; tutorial subset)
This can take time (and is best run on a workstation/server). IDTrack will:
load/build the human graph snapshot (cached in your local repository)
run identifier conversion for each dataset
write per-dataset mapping results as pickles under the project output directory
This cell is wrapped in a collapsible block because conversion logs can be long.
Note on outputs: when you build a merged AnnData (section 5.2.6), .var_names are Ensembl gene IDs. Mapped HGNC symbols are stored in .var['converted_id'] for readable reporting.
Reproducibility tip: pin the release (set
TARGET_RELEASEexplicitly) for manuscripts and pipelines.
10
%%collapse Click to show conversion logs
# Included for tutorial purposes only.
# HLCA tutorial harmonization (4 datasets)
#
# Note: this is best run on a workstation/server because it can download/build caches on first run.
if not data_h5ad_dict_hlca_tutorial:
print('No tutorial datasets found; skipping harmonization. Edit `base_path` in section 5.2.2 (or update the paths) and re-run.')
harmonizer_hlca = None
else:
project_out_hlca = (LOCAL_REPOSITORY / 'hlca_tutorial_outputs').resolve()
project_out_hlca.mkdir(parents=True, exist_ok=True)
# Resolve the organism and pick a snapshot boundary
api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
organism_name, latest_release = api.resolve_organism('human')
# Option A (default): always use the latest release available in your graph
TARGET_RELEASE = latest_release
# Option B (recommended for pipelines/papers): pin the release explicitly
# TARGET_RELEASE = 110
harmonizer_hlca = idtrack.HarmonizeFeatures(
project_name='hlca_tutorial',
data_h5ad_dict=data_h5ad_dict_hlca_tutorial,
project_local_repository=str(project_out_hlca),
idtrack_local_repository=str(LOCAL_REPOSITORY),
target_ensembl_release=TARGET_RELEASE,
final_database='HGNC Symbol',
organism_name=organism_name,
graph_last_ensembl_release=TARGET_RELEASE,
verbose_level=1,
)
Click to show conversion logs
(no output)
5.2.5 Inspect what happened (simple tables)
After the harmonizer is created, it exposes diagnostics that tell you whether harmonization is trustworthy.
In this tutorial we focus on a few practical checks:
how many identifiers mapped cleanly (1→1)
how many failed (1→0)
how many were ambiguous (1→n)
how many HGNC targets were actually fallbacks (no symbol available at the snapshot)
We keep the output as small tables so it is easy to read in a tutorial notebook.
11
# Per-dataset conversion summary (HGNC target)
if 'harmonizer_hlca' not in globals() or harmonizer_hlca is None:
print('HLCA harmonizer not created; run section 5.2.4 first.')
else:
matchings_by_dataset = harmonizer_hlca.get_idtrack_matchings_for_all_datasets()
def _summarize_matchings(matchings: list[dict]) -> dict[str, int]:
c = {
'input_ids': 0,
'one_to_one': 0,
'one_to_many': 0,
'one_to_none': 0,
'fallback_1_to_1': 0,
'fallback_1_to_n': 0,
}
for m in matchings:
c['input_ids'] += 1
if m.get('no_corresponding') or m.get('no_conversion'):
c['one_to_none'] += 1
continue
target_ids = m.get('target_id', [])
n_targets = len(target_ids)
if n_targets == 1:
c['one_to_one'] += 1
if m.get('no_target'):
c['fallback_1_to_1'] += 1
elif n_targets > 1:
c['one_to_many'] += 1
if m.get('no_target'):
c['fallback_1_to_n'] += 1
else:
# Defensive fallback: treat empty target list as 1→0
c['one_to_none'] += 1
return c
rows = []
for dataset, matchings in matchings_by_dataset.items():
s = _summarize_matchings(matchings)
rows.append({'dataset': dataset, **s})
df = pd.DataFrame(rows).set_index('dataset').sort_index()
df['one_to_one_target'] = df['one_to_one'] - df['fallback_1_to_1']
df['one_to_many_target'] = df['one_to_many'] - df['fallback_1_to_n']
df['success_rate'] = (df['one_to_one'] + df['one_to_many']) / df['input_ids']
df['failure_rate'] = df['one_to_none'] / df['input_ids']
# Keep the tutorial table compact
df = df[[
'input_ids',
'one_to_one_target',
'fallback_1_to_1',
'one_to_many_target',
'fallback_1_to_n',
'one_to_none',
'success_rate',
'failure_rate',
]]
display(df)
| input_ids | one_to_one_target | fallback_1_to_1 | one_to_many_target | fallback_1_to_n | one_to_none | success_rate | failure_rate | |
|---|---|---|---|---|---|---|---|---|
| dataset | ||||||||
| Barbry_unpubl | 16859 | 15721 | 1095 | 4 | 12 | 27 | 0.998398 | 0.001602 |
| Kaminski_2020 | 45947 | 34808 | 11002 | 19 | 19 | 99 | 0.997845 | 0.002155 |
| MeyerNikolic_unpubl | 33582 | 27722 | 5689 | 22 | 23 | 126 | 0.996248 | 0.003752 |
| Meyer_2021 | 20922 | 20591 | 312 | 1 | 0 | 18 | 0.999140 | 0.000860 |
5.2.6 Build harmonized AnnData (union vs intersect)
Now that identifiers are harmonized, we can actually merge the datasets into a single AnnData object.
IDTrack provides two practical merge modes:
mode='union': keep the union of genes (missing genes become zeros; wider matrix)mode='intersect': keep only genes shared by all datasets (stricter; narrower matrix)
Feature naming reminder: .var_names are Ensembl gene IDs and .var['converted_id'] holds the HGNC symbol.
This step loads the datasets into memory, so run it on a workstation/server.
12
%%collapse Click to show merge logs
# Included for tutorial purposes only.
if 'harmonizer_hlca' not in globals() or harmonizer_hlca is None:
print('HLCA harmonizer not created; run section 5.2.4 first.')
hlca_union = None
hlca_intersect = None
else:
hlca_union = harmonizer_hlca.unify_multiple_anndatas(mode='union')
hlca_intersect = harmonizer_hlca.unify_multiple_anndatas(mode='intersect')
comparison = pd.DataFrame(
{
'n_cells': [hlca_union.n_obs, hlca_intersect.n_obs],
'n_genes': [hlca_union.n_vars, hlca_intersect.n_vars],
},
index=['union', 'intersect'],
)
if hlca_union.n_vars:
comparison['genes_retained_vs_union'] = [1.0, hlca_intersect.n_vars / hlca_union.n_vars]
# Quick sanity checks
if 'intersection' in hlca_union.var.columns:
print("Genes present in all datasets (from union .var['intersection']):", int(hlca_union.var['intersection'].sum()))
# You now have two harmonized AnnDatas to use downstream:
# - hlca_union
# - hlca_intersect
# Optional: save to disk (can be large)
# hlca_union.write_h5ad(project_out_hlca / 'hlca_union_harmonized.h5ad')
# hlca_intersect.write_h5ad(project_out_hlca / 'hlca_intersect_harmonized.h5ad')
display(comparison)
| n_cells | n_genes | genes_retained_vs_union | |
|---|---|---|---|
| union | 774178 | 47188 | 1.000000 |
| intersect | 774178 | 13405 | 0.284076 |
Click to show merge logs
Genes present in all datasets (from union .var['intersection']): 13405
0%| | 0/4 [00:00<?, ?it/s]
0%| | 0/4 [00:10<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
0%| | 0/4 [00:10<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
25%|##########5 | 1/4 [00:12<00:38, 12.91s/it, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
25%|##########5 | 1/4 [00:16<00:38, 12.91s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
25%|##########5 | 1/4 [00:16<00:38, 12.91s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
50%|##################### | 2/4 [00:30<00:31, 15.58s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
50%|################ | 2/4 [00:38<00:31, 15.58s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]
50%|################ | 2/4 [00:38<00:31, 15.58s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]
75%|######################## | 3/4 [01:00<00:22, 22.21s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]
75%|#############################2 | 3/4 [01:08<00:22, 22.21s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
75%|#############################2 | 3/4 [01:08<00:22, 22.21s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
100%|#######################################| 4/4 [01:39<00:00, 28.79s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
100%|#######################################| 4/4 [01:39<00:00, 24.83s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
0%| | 0/4 [00:00<?, ?it/s]
0%| | 0/4 [00:09<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
0%| | 0/4 [00:09<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
25%|##########5 | 1/4 [00:12<00:36, 12.08s/it, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
25%|##########5 | 1/4 [00:14<00:36, 12.08s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
25%|##########5 | 1/4 [00:14<00:36, 12.08s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
50%|##################### | 2/4 [00:27<00:28, 14.12s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
50%|################ | 2/4 [00:35<00:28, 14.12s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]
50%|################ | 2/4 [00:35<00:28, 14.12s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]
75%|######################## | 3/4 [00:45<00:16, 16.00s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]
75%|#############################2 | 3/4 [00:52<00:16, 16.00s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
75%|#############################2 | 3/4 [00:52<00:16, 16.00s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
100%|#######################################| 4/4 [01:18<00:00, 22.73s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
100%|#######################################| 4/4 [01:18<00:00, 19.73s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
14
display(hlca_union)
display(hlca_intersect)
AnnData object with n_obs × n_vars = 774178 × 47188
obs: 'handle_anndata'
var: 'converted_id_Kaminski_2020', 'converted_id_Meyer_2021', 'converted_id_MeyerNikolic_unpubl', 'converted_id_Barbry_unpubl', 'converted_id', 'intersection'
AnnData object with n_obs × n_vars = 774178 × 13405
obs: 'handle_anndata'
var: 'converted_id'
5.3 — Legacy Data Rescue
Legacy datasets often contain a mix of:
older Ensembl IDs (from older releases)
gene symbols (which may have changed)
project-specific aliases
A safe, reproducible rescue workflow is:
Pick a snapshot boundary (the newest release you allow).
Convert into a stable namespace (usually Ensembl gene IDs) at that snapshot.
Inspect failure + ambiguity rates.
Only then convert into presentation-friendly labels (HGNC symbols) if needed.
The next cell demonstrates a small, realistic “mixed identifier” rescue using the human API.
Tip: Legacy data rescue often involves older releases and older assemblies. The default human snapshot is multi-assembly and can map GRCh37-derived identifiers into your chosen snapshot/primary assembly. Only rebuild with genome_assembly=37 if your downstream reference is GRCh37 and you want outputs anchored to that build (see Part 3).
13
# Legacy rescue demo (human)
api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()
organism, latest_release = api.resolve_organism('human')
api.build_graph(organism_name=organism, snapshot_release=latest_release, calculate_caches=False)
legacy_ids = [
'TP53', # HGNC symbol (common)
'P53', # older/alias-like symbol (may or may not resolve)
'ENSG00000141510', # Ensembl gene ID
'ENSG00000139618', # BRCA2
'BRCA1', # symbol
'NOT_A_REAL_GENE', # should become a clean 1→0 example
]
# Convert into Ensembl gene IDs at the snapshot boundary (stable backbone)
results = api.convert_identifier_multiple(legacy_ids, to_release=latest_release, final_database=None, strategy='all', verbose=False)
summary = api.classify_multiple_conversion(results)
api.print_binned_conversion(summary)
# Show a couple of raw results for inspection
results[:3]
13
[{'target_id': ['ENSG00000141510.20'],
'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],
'final_database': 'ensembl_gene',
'graph_id': 'TP53',
'query_id': 'TP53',
'no_corresponding': False,
'no_conversion': False,
'no_target': False},
{'target_id': ['LRG_321.1'],
'last_node': [('LRG_321.1', 'LRG_321.1')],
'final_database': 'ensembl_gene',
'graph_id': 'P53',
'query_id': 'P53',
'no_corresponding': False,
'no_conversion': False,
'no_target': False},
{'target_id': ['ENSG00000141510.20'],
'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],
'final_database': 'ensembl_gene',
'graph_id': 'ENSG00000141510',
'query_id': 'ENSG00000141510',
'no_corresponding': False,
'no_conversion': False,
'no_target': False}]
5.4 Summary
You now have a tutorial workflow for feature identifier harmonization across multiple .h5ad datasets.
Core ideas to keep:
Harmonize identifiers before integration; otherwise you risk mixing incompatible feature definitions.
Use Ensembl gene IDs as the stable integration backbone; keep HGNC symbols (or other externals) for readable reporting.
Treat 1→0 (no match) and 1→n (ambiguity) as diagnostics: inspect and decide a policy.
Create both
unionandintersectmerged AnnDatas early; they answer different downstream questions.
Practical next step: Use hlca_union or hlca_intersect as input to your integration model of choice.