Part 5 — Real-World Experiments: Harmonization

Last updated: 2026-01-08

This tutorial shows how to use IDTrack for real-world dataset harmonization.

You will learn:

  • how to harmonize feature identifiers across multiple .h5ad datasets (HLCA-style)

  • how to interpret harmonization diagnostics (what changed, what failed, what is ambiguous)

  • how to choose between union vs intersection feature spaces

  • how to approach legacy data rescue (older identifiers, mixed namespaces)

Tip: Start with the toy demo first. The exact same logic scales to large datasets.

5.0 — Why harmonization matters (plain language)

When you merge datasets, you implicitly assume that feature X in dataset A is the same biological entity as feature X in dataset B.

This breaks when:

  • datasets use different Ensembl releases (IDs changed)

  • one dataset uses HGNC symbols and the other uses Ensembl IDs

  • some features map 1→n (ambiguity) or 1→0 (no match)

IDTrack makes these cases explicit and gives you reproducible conversions anchored to a graph snapshot.

5.0.1 Pre-requisites

  • You can run 03_initialization_graph.ipynb for human (graph snapshot exists in your local repository).

  • You have anndata installed (it is an IDTrack dependency).

  • For the HLCA section, you need access to the HLCA .h5ad files (not bundled here).

1
# Load notebook utilities (collapsible output magic for tutorials)
%load_ext _notebook_utils

2
# 1) Setup
from __future__ import annotations

import os
from pathlib import Path

import anndata as ad
import numpy as np
import pandas as pd
from scipy import sparse

import idtrack

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

print('IDTrack local repository:', LOCAL_REPOSITORY)

IDTrack local repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache

5.1 — Toy Harmonization (start small, then scale)

We will create two small AnnData objects with overlapping genes but different identifier styles. This shows the workflow without requiring large data downloads.

3
# 2.1 Create toy datasets

toy_dir = LOCAL_REPOSITORY / 'toy_harmonization_demo'
toy_dir.mkdir(parents=True, exist_ok=True)

# Dataset A: HGNC symbols (common in many wet-lab exports)
genes_a = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']
X_a = sparse.random(50, len(genes_a), density=0.2, format='csr', random_state=0)
adata_a = ad.AnnData(X=X_a, obs=pd.DataFrame(index=[f'cellA_{i}' for i in range(X_a.shape[0])]), var=pd.DataFrame(index=genes_a))

# Dataset B: mix of HGNC + an Ensembl stable ID (realistic messy scenario)
genes_b = ['TP53', 'ENSG00000141510', 'BRCA1', 'NOT_A_REAL_GENE']
X_b = sparse.random(60, len(genes_b), density=0.2, format='csr', random_state=1)
adata_b = ad.AnnData(X=X_b, obs=pd.DataFrame(index=[f'cellB_{i}' for i in range(X_b.shape[0])]), var=pd.DataFrame(index=genes_b))

path_a = toy_dir / 'toy_A_symbols.h5ad'
path_b = toy_dir / 'toy_B_mixed.h5ad'
adata_a.write_h5ad(path_a)
adata_b.write_h5ad(path_b)

path_a, path_b

3
(PosixPath('/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/toy_harmonization_demo/toy_A_symbols.h5ad'),
 PosixPath('/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/toy_harmonization_demo/toy_B_mixed.h5ad'))

5.1.1 Run the harmonizer

IDTrack provides idtrack.HarmonizeFeatures.

Key parameters (interpretation):

  • idtrack_local_repository: where your built graph snapshot lives

  • graph_last_ensembl_release: which snapshot release the graph contains (must match your build)

  • target_ensembl_release: the release you want to harmonize to

  • final_database: what you want to keep as feature IDs (HGNC, Ensembl, …)

4
%%collapse Click to show conversion logs
# Included for tutorial purposes only.

# 2.2 Harmonize the toy datasets

data_h5ad_dict = {
    'toy_A': str(path_a),
    'toy_B': str(path_b),
}

project_out = toy_dir / 'outputs'
project_out.mkdir(parents=True, exist_ok=True)

organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')

harmonizer = idtrack.HarmonizeFeatures(
    project_name='toy_demo',
    data_h5ad_dict=data_h5ad_dict,
    project_local_repository=str(project_out),
    idtrack_local_repository=str(LOCAL_REPOSITORY),
    target_ensembl_release=latest_release,
    final_database='HGNC Symbol',
    organism_name=organism_name,
    graph_last_ensembl_release=latest_release,
    verbose_level=1,
)

harmonizer

Click to show conversion logs
2026-01-17 19:15:44 ERROR:ipykernel.comm: No such comm target registered: jupyter.widget.control
4
<idtrack._harmonize_features.HarmonizeFeatures at 0x7f326447b190>
Click to show conversion logs
(no output)
<idtrack._harmonize_features.HarmonizeFeatures at 0x7f326447b190>

5.1.2 Inspect what happened

Useful things to look at:

  • which identifiers failed conversion

  • which identifiers were ambiguous

  • per-dataset result pickle files written under project_local_repository

5
print('Conversion failed (any dataset):', sorted(list(harmonizer.conversion_failed_identifiers))[:20])
print('Conversion failed but consistent:', sorted(list(harmonizer.conversion_failed_but_consistent_identifiers))[:20])
print('Converted IDs with multiple Ensembl possibilities (collapsed):', list(harmonizer.multiple_ensembl_dict)[:10])

Conversion failed (any dataset): ['ENSG00000141510', 'NOT_A_REAL_GENE']
Conversion failed but consistent: []
Converted IDs with multiple Ensembl possibilities (collapsed): ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']

5.1.3 Produce a unified AnnData (union or intersection)

After harmonization, you often want a single merged dataset for downstream analysis.

IDTrack’s harmonizer can merge datasets in two modes:

  • mode='union' (default): keep the union of all features (missing genes become zeros)

  • mode='intersect': keep only features present in every dataset

6
# Merge the toy datasets (this uses AnnData.concat under the hood)
unified_union = harmonizer.unify_multiple_anndatas(mode='union')
unified_intersect = harmonizer.unify_multiple_anndatas(mode='intersect')

print('Union shape:', unified_union.shape)
print('Intersect shape:', unified_intersect.shape)

100%|█████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.73it/s, dataset=toy_B, study_var=2, union_var=5, dbh=2]
100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.91it/s, dataset=toy_B, study_var=2, union_var=5, dbh=2]
Union shape: (110, 5)
Intersect shape: (110, 2)

At HLCA scale, union can create a very large feature matrix. If you plan to run memory-heavy models, consider starting with intersect or filtering genes first.

5.2 — Multi-Dataset Integration (best practices)

Once you can harmonize identifiers, you can integrate datasets without accidentally mixing incompatible feature definitions.

Practical best practices:

  • Start by harmonizing into a stable backbone (usually Ensembl gene IDs). Convert to symbols only for reporting.

  • Treat 1→n mappings as real biology/data ambiguity, not as a nuisance. Decide a policy:

    • drop ambiguous features

    • keep all candidates (inflates feature space)

    • collapse to a single representative (requires an explicit rule)

  • Pilot first: run harmonization on a small subset and inspect diagnostics before scaling up.

  • Union vs intersection:

    • union keeps more features but can produce a very wide matrix.

    • intersect is stricter and often improves comparability, but may drop biologically important genes.

Expected result: you should be able to justify (and report) your integration choice, not just run a tool.

5.2.1 Scaling up to HLCA (the real use case)

HLCA-scale harmonization is the same logic, but you need to manage:

  • many datasets (often multiple files per study)

  • large gene lists (tens of thousands of identifiers)

  • choosing a snapshot release (what you harmonize to)

  • disk for intermediate artifacts (pickles/logs)

5.2.2 HLCA datasets used in this tutorial

HLCA .h5ad files are not shipped with IDTrack. In this tutorial we simply point to existing files (no downloads).

The next cell defines a curated HLCA list (study → list of .h5ad paths). If you’re running this notebook on a different system, edit base_path in the next cell to match where the HLCA data lives.

7
# 5.2.2 HLCA dataset paths used in this tutorial
#
# This notebook is meant to be easy to follow: edit *one* string if your HLCA data lives elsewhere.

base_path = '/lustre/groups/ml01/projects/2023_HLCA_LSikkema/HLCA_reproducibility/data'
dset0_dir = os.path.join(base_path, 'HLCA_extended/extension_datasets/ready/full')
dset1_dir = os.path.join(base_path, 'HLCA_extended/extension_datasets/raw')

# Curated HLCA AnnData list (study → list of .h5ad files)
hlca_adata_dict = {
    'Kaminski_2020': [f'{dset0_dir}/adams.h5ad'],
    'Meyer_2021': [f'{dset0_dir}/meyer_2021.h5ad'],
    'MeyerNikolic_unpubl': [f'{dset0_dir}/meyer_nikolic_unpubl.h5ad'],
    'Barbry_unpubl': [f'{dset0_dir}/barbry.h5ad'],
    'Regev_2021': [
        f'{dset0_dir}/delorey_cryo.h5ad',
        f'{dset0_dir}/delorey_fresh.h5ad',
        f'{dset0_dir}/delorey_nuclei.h5ad',
    ],
    'Thienpont_2018': [f'{dset1_dir}/Lambrechts/lambrechts.h5ad'],
    'Budinger_2020': [f'{dset0_dir}/bharat.h5ad'],
    'Banovich_Kropski_2020': [f'{dset0_dir}/haberman.h5ad'],
    'Sheppard_2020': [f'{dset0_dir}/tsukui.h5ad'],
    'Wunderink_2021': [f'{dset0_dir}/grant_cryo.h5ad', f'{dset0_dir}/grant_fresh.h5ad'],
    'Lambrechts_2021': [f'{dset0_dir}/wouters.h5ad'],
    'Zhang_2021': [f'{dset1_dir}/Liao/covid_for_publish.h5ad'],
    'Duong_lungMAP_unpubl': [f'{dset0_dir}/duong.h5ad'],
    'Janssen_2020': [f'{dset0_dir}/mould.h5ad'],
    'Sun_2020': [
        f'{dset0_dir}/wang_sub_batch1.h5ad',
        f'{dset0_dir}/wang_sub_batch2.h5ad',
        f'{dset0_dir}/wang_sub_batch3.h5ad',
        f'{dset0_dir}/wang_sub_batch4.h5ad',
    ],
    'Gomperts_2021': [
        f'{dset0_dir}/carraro_ucla.h5ad',
        f'{dset0_dir}/carraro_cff.h5ad',
        f'{dset0_dir}/carraro_csmc.h5ad',
    ],
    'Eils_2020': [f'{dset0_dir}/lukassen.h5ad'],
    'Schiller_2020': [f'{dset0_dir}/mayr.h5ad'],
    'Misharin_Budinger_2018': [f'{dset0_dir}/reyfman_disease.h5ad'],
    'Shalek_2018': [f'{dset0_dir}/ordovasmontanes.h5ad'],
    'Schiller_2021': [f'{dset0_dir}/schiller_discovair.h5ad'],
    'Peer_Massague_2020': [f'{dset0_dir}/laughney.h5ad'],
    'Lafyatis_2019': [f'{dset0_dir}/valenzi.h5ad'],
    'Tata_unpubl': [f'{dset0_dir}/tata_unpubl.h5ad'],
    'Xu_2020': [f'{dset0_dir}/guo.h5ad'],
    'Sims_2019': [f'{dset0_dir}/szabo.h5ad'],
    'Schultze_unpubl': [f'{dset0_dir}/schultze_unpubl.h5ad'],
}

print('HLCA data root:', base_path)
print('Studies in curated list:', len(hlca_adata_dict))

HLCA data root: /lustre/groups/ml01/projects/2023_HLCA_LSikkema/HLCA_reproducibility/data
Studies in curated list: 27
8
rows = []
for study, paths in hlca_adata_dict.items():
    for p in paths:
        rows.append({'study': study, 'file': os.path.basename(p), 'path': p, 'exists': os.path.exists(p)})

hlca_path_status = pd.DataFrame(rows)
missing = hlca_path_status[~hlca_path_status['exists']].copy()

print(f"HLCA files in curated list: {len(hlca_path_status)}")
print(f"Missing files: {len(missing)}")

if not missing.empty:
    missing.sort_values(['study', 'file'])

HLCA files in curated list: 35
Missing files: 0

5.2.3 Tutorial subset (4 studies; easy to inspect)

The full hlca_adata_dict contains many studies (and some have multiple .h5ad files). For a tutorial notebook, we intentionally keep the example small so it is easy to read, debug, and reproduce.

We will only use these 4 single-file studies:

  • Kaminski_2020

  • Meyer_2021

  • MeyerNikolic_unpubl

  • Barbry_unpubl

Important: idtrack.HarmonizeFeatures expects ``dict[str, str]`` (dataset_alias single .h5ad path). So we convert the curated hlca_adata_dict (study → list[path]) into a small tutorial dictionary.

9
HLCA_TUTORIAL_STUDIES = [
    'Kaminski_2020',
    'Meyer_2021',
    'MeyerNikolic_unpubl',
    'Barbry_unpubl',
]

rows = []
data_h5ad_dict_hlca_tutorial: dict[str, str] = {}

for study in HLCA_TUTORIAL_STUDIES:
    paths = hlca_adata_dict.get(study, [])
    if len(paths) != 1:
        # The tutorial subset is intentionally restricted to single-file studies.
        # If this triggers, update HLCA_TUTORIAL_STUDIES to pick a different study.
        rows.append((study, None, False, f'Expected 1 file, got {len(paths)}'))
        continue

    p = paths[0]
    exists = os.path.exists(p)
    rows.append((study, p, exists, ''))
    if exists:
        data_h5ad_dict_hlca_tutorial[study] = p

hlca_tutorial_status = pd.DataFrame(rows, columns=['study', 'path', 'exists', 'note']).set_index('study')
print('Tutorial datasets found:', len(data_h5ad_dict_hlca_tutorial), '/', len(HLCA_TUTORIAL_STUDIES))

if len(data_h5ad_dict_hlca_tutorial) == 0:
    print('\nNo tutorial HLCA files found. Edit `base_path` in section 5.2.2 (or update the paths) and re-run.')

hlca_tutorial_status

Tutorial datasets found: 4 / 4
9
path exists note
study
Kaminski_2020 /lustre/groups/ml01/projects/2023_HLCA_LSikkem... True
Meyer_2021 /lustre/groups/ml01/projects/2023_HLCA_LSikkem... True
MeyerNikolic_unpubl /lustre/groups/ml01/projects/2023_HLCA_LSikkem... True
Barbry_unpubl /lustre/groups/ml01/projects/2023_HLCA_LSikkem... True

5.2.4 Run harmonization (HGNC target; tutorial subset)

This can take time (and is best run on a workstation/server). IDTrack will:

  • load/build the human graph snapshot (cached in your local repository)

  • run identifier conversion for each dataset

  • write per-dataset mapping results as pickles under the project output directory

This cell is wrapped in a collapsible block because conversion logs can be long.

Note on outputs: when you build a merged AnnData (section 5.2.6), .var_names are Ensembl gene IDs. Mapped HGNC symbols are stored in .var['converted_id'] for readable reporting.

Reproducibility tip: pin the release (set TARGET_RELEASE explicitly) for manuscripts and pipelines.

10
%%collapse Click to show conversion logs
# Included for tutorial purposes only.

# HLCA tutorial harmonization (4 datasets)
#
# Note: this is best run on a workstation/server because it can download/build caches on first run.

if not data_h5ad_dict_hlca_tutorial:
    print('No tutorial datasets found; skipping harmonization. Edit `base_path` in section 5.2.2 (or update the paths) and re-run.')
    harmonizer_hlca = None
else:
    project_out_hlca = (LOCAL_REPOSITORY / 'hlca_tutorial_outputs').resolve()
    project_out_hlca.mkdir(parents=True, exist_ok=True)

    # Resolve the organism and pick a snapshot boundary
    api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
    organism_name, latest_release = api.resolve_organism('human')

    # Option A (default): always use the latest release available in your graph
    TARGET_RELEASE = latest_release

    # Option B (recommended for pipelines/papers): pin the release explicitly
    # TARGET_RELEASE = 110

    harmonizer_hlca = idtrack.HarmonizeFeatures(
        project_name='hlca_tutorial',
        data_h5ad_dict=data_h5ad_dict_hlca_tutorial,
        project_local_repository=str(project_out_hlca),
        idtrack_local_repository=str(LOCAL_REPOSITORY),
        target_ensembl_release=TARGET_RELEASE,
        final_database='HGNC Symbol',
        organism_name=organism_name,
        graph_last_ensembl_release=TARGET_RELEASE,
        verbose_level=1,
    )

Click to show conversion logs
(no output)

5.2.5 Inspect what happened (simple tables)

After the harmonizer is created, it exposes diagnostics that tell you whether harmonization is trustworthy.

In this tutorial we focus on a few practical checks:

  • how many identifiers mapped cleanly (1→1)

  • how many failed (1→0)

  • how many were ambiguous (1→n)

  • how many HGNC targets were actually fallbacks (no symbol available at the snapshot)

We keep the output as small tables so it is easy to read in a tutorial notebook.

11
# Per-dataset conversion summary (HGNC target)

if 'harmonizer_hlca' not in globals() or harmonizer_hlca is None:
    print('HLCA harmonizer not created; run section 5.2.4 first.')
else:
    matchings_by_dataset = harmonizer_hlca.get_idtrack_matchings_for_all_datasets()

    def _summarize_matchings(matchings: list[dict]) -> dict[str, int]:
        c = {
            'input_ids': 0,
            'one_to_one': 0,
            'one_to_many': 0,
            'one_to_none': 0,
            'fallback_1_to_1': 0,
            'fallback_1_to_n': 0,
        }

        for m in matchings:
            c['input_ids'] += 1

            if m.get('no_corresponding') or m.get('no_conversion'):
                c['one_to_none'] += 1
                continue

            target_ids = m.get('target_id', [])
            n_targets = len(target_ids)

            if n_targets == 1:
                c['one_to_one'] += 1
                if m.get('no_target'):
                    c['fallback_1_to_1'] += 1
            elif n_targets > 1:
                c['one_to_many'] += 1
                if m.get('no_target'):
                    c['fallback_1_to_n'] += 1
            else:
                # Defensive fallback: treat empty target list as 1→0
                c['one_to_none'] += 1

        return c

    rows = []
    for dataset, matchings in matchings_by_dataset.items():
        s = _summarize_matchings(matchings)
        rows.append({'dataset': dataset, **s})

    df = pd.DataFrame(rows).set_index('dataset').sort_index()
    df['one_to_one_target'] = df['one_to_one'] - df['fallback_1_to_1']
    df['one_to_many_target'] = df['one_to_many'] - df['fallback_1_to_n']
    df['success_rate'] = (df['one_to_one'] + df['one_to_many']) / df['input_ids']
    df['failure_rate'] = df['one_to_none'] / df['input_ids']

    # Keep the tutorial table compact
    df = df[[
        'input_ids',
        'one_to_one_target',
        'fallback_1_to_1',
        'one_to_many_target',
        'fallback_1_to_n',
        'one_to_none',
        'success_rate',
        'failure_rate',
    ]]
    display(df)

input_ids one_to_one_target fallback_1_to_1 one_to_many_target fallback_1_to_n one_to_none success_rate failure_rate
dataset
Barbry_unpubl 16859 15721 1095 4 12 27 0.998398 0.001602
Kaminski_2020 45947 34808 11002 19 19 99 0.997845 0.002155
MeyerNikolic_unpubl 33582 27722 5689 22 23 126 0.996248 0.003752
Meyer_2021 20922 20591 312 1 0 18 0.999140 0.000860

5.2.6 Build harmonized AnnData (union vs intersect)

Now that identifiers are harmonized, we can actually merge the datasets into a single AnnData object.

IDTrack provides two practical merge modes:

  • mode='union': keep the union of genes (missing genes become zeros; wider matrix)

  • mode='intersect': keep only genes shared by all datasets (stricter; narrower matrix)

Feature naming reminder: .var_names are Ensembl gene IDs and .var['converted_id'] holds the HGNC symbol.

This step loads the datasets into memory, so run it on a workstation/server.

12
%%collapse Click to show merge logs
# Included for tutorial purposes only.

if 'harmonizer_hlca' not in globals() or harmonizer_hlca is None:
    print('HLCA harmonizer not created; run section 5.2.4 first.')
    hlca_union = None
    hlca_intersect = None
else:
    hlca_union = harmonizer_hlca.unify_multiple_anndatas(mode='union')
    hlca_intersect = harmonizer_hlca.unify_multiple_anndatas(mode='intersect')

    comparison = pd.DataFrame(
        {
            'n_cells': [hlca_union.n_obs, hlca_intersect.n_obs],
            'n_genes': [hlca_union.n_vars, hlca_intersect.n_vars],
        },
        index=['union', 'intersect'],
    )
    if hlca_union.n_vars:
        comparison['genes_retained_vs_union'] = [1.0, hlca_intersect.n_vars / hlca_union.n_vars]

    # Quick sanity checks
    if 'intersection' in hlca_union.var.columns:
        print("Genes present in all datasets (from union .var['intersection']):", int(hlca_union.var['intersection'].sum()))

    # You now have two harmonized AnnDatas to use downstream:
    #   - hlca_union
    #   - hlca_intersect

    # Optional: save to disk (can be large)
    # hlca_union.write_h5ad(project_out_hlca / 'hlca_union_harmonized.h5ad')
    # hlca_intersect.write_h5ad(project_out_hlca / 'hlca_intersect_harmonized.h5ad')

    display(comparison)

n_cells n_genes genes_retained_vs_union
union 774178 47188 1.000000
intersect 774178 13405 0.284076
Click to show merge logs
Genes present in all datasets (from union .var['intersection']): 13405

  0%|                                                                                                                 | 0/4 [00:00<?, ?it/s]
  0%|                                                  | 0/4 [00:10<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
  0%|                                                  | 0/4 [00:10<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
 25%|##########5                               | 1/4 [00:12<00:38, 12.91s/it, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
 25%|##########5                               | 1/4 [00:16<00:38, 12.91s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
 25%|##########5                               | 1/4 [00:16<00:38, 12.91s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
 50%|#####################                     | 2/4 [00:30<00:31, 15.58s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
 50%|################                | 2/4 [00:38<00:31, 15.58s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]
 50%|################                | 2/4 [00:38<00:31, 15.58s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]
 75%|########################        | 3/4 [01:00<00:22, 22.21s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=44641, dbh=2855]
 75%|#############################2         | 3/4 [01:08<00:22, 22.21s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
 75%|#############################2         | 3/4 [01:08<00:22, 22.21s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
100%|#######################################| 4/4 [01:39<00:00, 28.79s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]
100%|#######################################| 4/4 [01:39<00:00, 24.83s/it, dataset=Barbry_unpubl, study_var=16251, union_var=46761, dbh=608]

  0%|                                                                                                                 | 0/4 [00:00<?, ?it/s]
  0%|                                                  | 0/4 [00:09<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
  0%|                                                  | 0/4 [00:09<?, ?it/s, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
 25%|##########5                               | 1/4 [00:12<00:36, 12.08s/it, dataset=Kaminski_2020, study_var=43939, union_var=0, dbh=2008]
 25%|##########5                               | 1/4 [00:14<00:36, 12.08s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
 25%|##########5                               | 1/4 [00:14<00:36, 12.08s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
 50%|#####################                     | 2/4 [00:27<00:28, 14.12s/it, dataset=Meyer_2021, study_var=20517, union_var=43939, dbh=405]
 50%|################                | 2/4 [00:35<00:28, 14.12s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]
 50%|################                | 2/4 [00:35<00:28, 14.12s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]
 75%|########################        | 3/4 [00:45<00:16, 16.00s/it, dataset=MeyerNikolic_unpubl, study_var=30727, union_var=19815, dbh=2855]
 75%|#############################2         | 3/4 [00:52<00:16, 16.00s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
 75%|#############################2         | 3/4 [00:52<00:16, 16.00s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
100%|#######################################| 4/4 [01:18<00:00, 22.73s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
100%|#######################################| 4/4 [01:18<00:00, 19.73s/it, dataset=Barbry_unpubl, study_var=16251, union_var=19815, dbh=608]
14
display(hlca_union)
display(hlca_intersect)
AnnData object with n_obs × n_vars = 774178 × 47188
    obs: 'handle_anndata'
    var: 'converted_id_Kaminski_2020', 'converted_id_Meyer_2021', 'converted_id_MeyerNikolic_unpubl', 'converted_id_Barbry_unpubl', 'converted_id', 'intersection'
AnnData object with n_obs × n_vars = 774178 × 13405
    obs: 'handle_anndata'
    var: 'converted_id'

5.3 — Legacy Data Rescue

Legacy datasets often contain a mix of:

  • older Ensembl IDs (from older releases)

  • gene symbols (which may have changed)

  • project-specific aliases

A safe, reproducible rescue workflow is:

  1. Pick a snapshot boundary (the newest release you allow).

  2. Convert into a stable namespace (usually Ensembl gene IDs) at that snapshot.

  3. Inspect failure + ambiguity rates.

  4. Only then convert into presentation-friendly labels (HGNC symbols) if needed.

The next cell demonstrates a small, realistic “mixed identifier” rescue using the human API.

Tip: Legacy data rescue often involves older releases and older assemblies. The default human snapshot is multi-assembly and can map GRCh37-derived identifiers into your chosen snapshot/primary assembly. Only rebuild with genome_assembly=37 if your downstream reference is GRCh37 and you want outputs anchored to that build (see Part 3).

13
# Legacy rescue demo (human)

api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()

organism, latest_release = api.resolve_organism('human')
api.build_graph(organism_name=organism, snapshot_release=latest_release, calculate_caches=False)

legacy_ids = [
    'TP53',                # HGNC symbol (common)
    'P53',                 # older/alias-like symbol (may or may not resolve)
    'ENSG00000141510',     # Ensembl gene ID
    'ENSG00000139618',     # BRCA2
    'BRCA1',               # symbol
    'NOT_A_REAL_GENE',     # should become a clean 1→0 example
]

# Convert into Ensembl gene IDs at the snapshot boundary (stable backbone)
results = api.convert_identifier_multiple(legacy_ids, to_release=latest_release, final_database=None, strategy='all', verbose=False)
summary = api.classify_multiple_conversion(results)
api.print_binned_conversion(summary)

# Show a couple of raw results for inspection
results[:3]

13
[{'target_id': ['ENSG00000141510.20'],
  'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],
  'final_database': 'ensembl_gene',
  'graph_id': 'TP53',
  'query_id': 'TP53',
  'no_corresponding': False,
  'no_conversion': False,
  'no_target': False},
 {'target_id': ['LRG_321.1'],
  'last_node': [('LRG_321.1', 'LRG_321.1')],
  'final_database': 'ensembl_gene',
  'graph_id': 'P53',
  'query_id': 'P53',
  'no_corresponding': False,
  'no_conversion': False,
  'no_target': False},
 {'target_id': ['ENSG00000141510.20'],
  'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],
  'final_database': 'ensembl_gene',
  'graph_id': 'ENSG00000141510',
  'query_id': 'ENSG00000141510',
  'no_corresponding': False,
  'no_conversion': False,
  'no_target': False}]

5.4 Summary

You now have a tutorial workflow for feature identifier harmonization across multiple .h5ad datasets.

Core ideas to keep:

  • Harmonize identifiers before integration; otherwise you risk mixing incompatible feature definitions.

  • Use Ensembl gene IDs as the stable integration backbone; keep HGNC symbols (or other externals) for readable reporting.

  • Treat 1→0 (no match) and 1→n (ambiguity) as diagnostics: inspect and decide a policy.

  • Create both union and intersect merged AnnDatas early; they answer different downstream questions.

Practical next step: Use hlca_union or hlca_intersect as input to your integration model of choice.