Part 5 — Real-World Experiments: Harmonization

Last updated: 2026-01-08

This tutorial shows how to use IDTrack for real-world dataset harmonization.

You will learn:

how to harmonize feature identifiers across multiple .h5ad datasets (HLCA-style)
how to interpret harmonization diagnostics (what changed, what failed, what is ambiguous)
how to choose between union vs intersection feature spaces
how to approach legacy data rescue (older identifiers, mixed namespaces)

Tip: Start with the toy demo first. The exact same logic scales to large datasets.

5.0 — Why harmonization matters (plain language)

When you merge datasets, you implicitly assume that feature X in dataset A is the same biological entity as feature X in dataset B.

This breaks when:

datasets use different Ensembl releases (IDs changed)
one dataset uses HGNC symbols and the other uses Ensembl IDs
some features map 1→n (ambiguity) or 1→0 (no match)

IDTrack makes these cases explicit and gives you reproducible conversions anchored to a graph snapshot.

5.0.1 Pre-requisites

You can run 03_initialization_graph.ipynb for human (graph snapshot exists in your local repository).
You have anndata installed (it is an IDTrack dependency).
For the HLCA section, you need access to the HLCA .h5ad files (not bundled here).

# 1) Setup
from __future__ import annotations

import os
from pathlib import Path

import anndata as ad
import numpy as np
import pandas as pd
from scipy import sparse

import idtrack

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

print('IDTrack local repository:', LOCAL_REPOSITORY)

5.1 — HLCA (Human Lung Cell Atlas) Harmonization (start small, then scale)

We will create two small AnnData objects with overlapping genes but different identifier styles. This shows the workflow without requiring large data downloads.

# 2.1 Create toy datasets

toy_dir = LOCAL_REPOSITORY / 'toy_harmonization_demo'
toy_dir.mkdir(parents=True, exist_ok=True)

# Dataset A: HGNC symbols (common in many wet-lab exports)
genes_a = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS']
X_a = sparse.random(50, len(genes_a), density=0.2, format='csr', random_state=0)
adata_a = ad.AnnData(X=X_a, obs=pd.DataFrame(index=[f'cellA_{i}' for i in range(X_a.shape[0])]), var=pd.DataFrame(index=genes_a))

# Dataset B: mix of HGNC + an Ensembl stable ID (realistic messy scenario)
genes_b = ['TP53', 'ENSG00000141510', 'BRCA1', 'NOT_A_REAL_GENE']
X_b = sparse.random(60, len(genes_b), density=0.2, format='csr', random_state=1)
adata_b = ad.AnnData(X=X_b, obs=pd.DataFrame(index=[f'cellB_{i}' for i in range(X_b.shape[0])]), var=pd.DataFrame(index=genes_b))

path_a = toy_dir / 'toy_A_symbols.h5ad'
path_b = toy_dir / 'toy_B_mixed.h5ad'
adata_a.write_h5ad(path_a)
adata_b.write_h5ad(path_b)

path_a, path_b

5.1.1 Run the harmonizer

IDTrack provides idtrack.HarmonizeFeatures.

Key parameters (interpretation):

idtrack_local_repository: where your built graph snapshot lives
graph_last_ensembl_release: which snapshot release the graph contains (must match your build)
target_ensembl_release: the release you want to harmonize to
final_database: what you want to keep as feature IDs (HGNC, Ensembl, …)

# 2.2 Harmonize the toy datasets

data_h5ad_dict = {
    'toy_A': str(path_a),
    'toy_B': str(path_b),
}

project_out = toy_dir / 'outputs'
project_out.mkdir(parents=True, exist_ok=True)

organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')

harmonizer = idtrack.HarmonizeFeatures(
    project_name='toy_demo',
    data_h5ad_dict=data_h5ad_dict,
    project_local_repository=str(project_out),
    idtrack_local_repository=str(LOCAL_REPOSITORY),
    target_ensembl_release=latest_release,
    final_database='HGNC Symbol',
    organism_name=organism_name,
    graph_last_ensembl_release=latest_release,
    verbose_level=2,
)

harmonizer

5.1.2 Inspect what happened

Useful things to look at:

which identifiers failed conversion
which identifiers were ambiguous
per-dataset result pickle files written under project_local_repository

print('Conversion failed (any dataset):', sorted(list(harmonizer.conversion_failed_identifiers))[:20])
print('Conversion failed but consistent:', sorted(list(harmonizer.conversion_failed_but_consistent_identifiers))[:20])
print('Converted IDs with multiple Ensembl possibilities (collapsed):', list(harmonizer.multiple_ensembl_dict)[:10])

5.1.3 Produce a unified AnnData (union or intersection)

After harmonization, you often want a single merged dataset for downstream analysis.

IDTrack’s harmonizer can merge datasets in two modes:

mode='union' (default): keep the union of all features (missing genes become zeros)
mode='intersect': keep only features present in every dataset

# Merge the toy datasets (this uses AnnData.concat under the hood)
unified_union = harmonizer.unify_multiple_anndatas(mode='union')
unified_intersect = harmonizer.unify_multiple_anndatas(mode='intersect')

print('Union shape:', unified_union.shape)
print('Intersect shape:', unified_intersect.shape)

At HLCA scale, union can create a very large feature matrix. If you plan to run memory-heavy models, consider starting with intersect or filtering genes first.

5.2 — Multi-Dataset Integration (best practices)

Once you can harmonize identifiers, you can integrate datasets without accidentally mixing incompatible feature definitions.

Practical best practices:

Start by harmonizing into a stable backbone (usually Ensembl gene IDs). Convert to symbols only for reporting.
Treat 1→n mappings as real biology/data ambiguity, not as a nuisance. Decide a policy:
- drop ambiguous features
- keep all candidates (inflates feature space)
- collapse to a single representative (requires an explicit rule)
Pilot first: run harmonization on a small subset and inspect diagnostics before scaling up.
Union vs intersection:
- union keeps more features but can produce a very wide matrix.
- intersect is stricter and often improves comparability, but may drop biologically important genes.

Expected result: you should be able to justify (and report) your integration choice, not just run a tool.

5.2.1 Scaling up to HLCA (the real use case)

HLCA-scale harmonization is the same logic, but you need to manage:

many datasets
large gene lists
storage for intermediate results

5.2.2 Point IDTrack to your HLCA data

Set HLCA_BASE_PATH to a folder containing .h5ad files. Example layout:

HLCA_BASE_PATH/
  Dataset1.h5ad
  Dataset2.h5ad
  ...

HLCA_BASE_PATH = Path(os.environ.get('HLCA_BASE_PATH', ''))
HLCA_BASE_PATH

If that prints an empty path, set it in your environment or replace it manually in the cell.

# 3.2 Build a dataset dictionary (dataset name -> .h5ad path)
# This scans only the top level. Adjust if your HLCA layout is nested.

if HLCA_BASE_PATH and HLCA_BASE_PATH.exists():
    hlca_files = sorted(HLCA_BASE_PATH.glob('*.h5ad'))
    data_h5ad_dict_hlca = {p.stem: str(p) for p in hlca_files}
    print('Found .h5ad files:', len(data_h5ad_dict_hlca))
    print('Example keys:', list(data_h5ad_dict_hlca)[:10])
else:
    data_h5ad_dict_hlca = {}
    print('HLCA_BASE_PATH not set or does not exist; skipping HLCA run.')

5.2.3 Run harmonization on HLCA

This can take time. IDTrack will:

load/build the graph snapshot (from your local repository)
run identifier conversion for each dataset
write per-dataset mapping results as pickles under the project output directory

Tip: start with a small subset (e.g. 2–3 datasets) to validate your setup.

# 3.3 (OPTIONAL) HLCA run — uncomment to execute
#
# if data_h5ad_dict_hlca:
#     project_out_hlca = LOCAL_REPOSITORY / 'hlca_harmonization_outputs'
#     project_out_hlca.mkdir(parents=True, exist_ok=True)
#
#     organism_name, latest_release = idtrack.API(str(LOCAL_REPOSITORY)).resolve_organism('human')
#
#     harmonizer_hlca = idtrack.HarmonizeFeatures(
#         project_name='hlca',
#         data_h5ad_dict=data_h5ad_dict_hlca,
#         project_local_repository=str(project_out_hlca),
#         idtrack_local_repository=str(LOCAL_REPOSITORY),
#         target_ensembl_release=latest_release,
#         final_database='HGNC Symbol',
#         organism_name=organism_name,
#         graph_last_ensembl_release=latest_release,
#         verbose_level=2,
#     )
#
#     harmonizer_hlca

5.2.4 Advanced notes (recommended once you’re comfortable)

Choice of ``final_database``
- For integration, Ensembl IDs are often the most stable.
- For interpretability, HGNC symbols are convenient but can be ambiguous.
Handling 1→n mappings
- Decide whether you keep all targets, pick best, or drop ambiguous genes.
- The right choice depends on your downstream analysis.
Auditability
- IDTrack writes intermediate results (pickles).
- Treat those as provenance artifacts: keep them with your analysis.

5.2.5 HLCA experiments (extended): large-scale harmonization + QA

The minimal HLCA snippet above is intentionally conservative: it shows how to wire IDTrack into an HLCA-style run. This extended section turns it into a robust, paper-friendly workflow:

verify your inputs and environment variables
pin a snapshot release (for reproducibility)
run a small pilot first (2–3 datasets)
read diagnostics (1→0 and 1→n signals)
only then scale to the full collection

Why this matters: large runs are slow and generate many artifacts. A pilot run gives you early feedback and a clean audit trail.

5.2.5.1 Verify HLCA discovery (safe to run)

This cell does not download anything. It only checks whether your HLCA_BASE_PATH pointed to real .h5ad files.

Expected output: a boolean and (if available) a dataset count.

# HLCA presence check (uses the dataset dictionary built earlier)
HAS_HLCA = bool(data_h5ad_dict_hlca)
print('HLCA detected:', HAS_HLCA)
print('Datasets found:', len(data_h5ad_dict_hlca))
print('Example dataset names:', list(data_h5ad_dict_hlca)[:10])

5.2.5.2 Choose your snapshot release (reproducibility knob)

For exploratory work, using the latest release is fine. For a manuscript or production pipeline, pin the release explicitly.

Tip: keep the chosen release in your analysis metadata (or config) so you can reproduce results later.

# Create an API handle (reuses the same local repository)
api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))

organism, latest_release = api.resolve_organism("human")

# Option A (default): always use latest available release
TARGET_RELEASE = latest_release

# Option B (recommended for papers): pin to an explicit release
# TARGET_RELEASE = 110

organism, latest_release, TARGET_RELEASE

5.2.5.3 Ensure the human graph snapshot is available

If you already ran Part 3 for human, this should load quickly from cache. If not, this will download reference tables and build the snapshot (minutes, one-time).

Warning: building a snapshot requires network access the first time (unless your cache is already populated).

if HAS_HLCA:
    api.build_graph(organism_name=organism, snapshot_release=TARGET_RELEASE, calculate_caches=True)

    g = api.track.graph
    print('Graph organism:', g.graph.get('organism'))
    print('Graph snapshot release:', g.graph.get('ensembl_release'))
    print('Graph nodes:', g.number_of_nodes())
    print('Graph edges:', g.number_of_edges())
else:
    print('No HLCA datasets detected; skipping graph build.')

5.2.5.4 Build a pilot dataset dictionary

Start with a tiny pilot (2–3 datasets). The pilot run answers:

Do my files load?
Does conversion work for my typical identifiers?
Are there many 1→0 failures or 1→n ambiguities?

Only after the pilot looks sensible should you scale to the full HLCA collection.

def _pick_first_items(d: dict[str, str], n: int) -> dict[str, str]:
    keys = list(d)[:n]
    return {k: d[k] for k in keys}

data_h5ad_dict_hlca_all = data_h5ad_dict_hlca
data_h5ad_dict_hlca_pilot = _pick_first_items(data_h5ad_dict_hlca_all, n=3)

print('All HLCA datasets:', len(data_h5ad_dict_hlca_all))
print('Pilot datasets:', len(data_h5ad_dict_hlca_pilot))
list(data_h5ad_dict_hlca_pilot)[:10]

5.2.5.5 Pilot harmonization run

This constructs a harmonizer object. It will run conversions and write intermediate artifacts under a dedicated project folder.

Expected output: a HarmonizeFeatures object (or a helpful message if HLCA data was not detected).

# Keep experiment outputs separate from the global cache
project_out_hlca = (LOCAL_REPOSITORY / 'hlca_experiments').resolve()
project_out_hlca.mkdir(parents=True, exist_ok=True)

# Choose your final namespace
FINAL_DATABASE = 'HGNC Symbol'  # try also: 'ensembl_gene'

if not data_h5ad_dict_hlca_pilot:
    print('No pilot datasets found; set HLCA_BASE_PATH and re-run from section 5.2.2.')
    harmonizer_hlca_pilot = None
else:
    harmonizer_hlca_pilot = idtrack.HarmonizeFeatures(
        project_name='hlca_pilot',
        data_h5ad_dict=data_h5ad_dict_hlca_pilot,
        project_local_repository=str(project_out_hlca),
        idtrack_local_repository=str(LOCAL_REPOSITORY),
        target_ensembl_release=TARGET_RELEASE,
        final_database=FINAL_DATABASE,
        organism_name=organism,
        graph_last_ensembl_release=TARGET_RELEASE,
        verbose_level=2,
    )

harmonizer_hlca_pilot

5.2.5.6 Read the diagnostics (this is the most important part)

Treat diagnostics as first-class results. They tell you whether harmonization is trustworthy for your data.

Conversion failures (1→0): identifiers that could not be placed on the graph
Ambiguity (1→n): identifiers that map to multiple plausible targets

Tip: store these diagnostics alongside your paper or pipeline outputs. They explain why features were dropped or changed.

if harmonizer_hlca_pilot is None:
    print('Pilot harmonizer not created; skipping diagnostics.')
else:
    print('Datasets in pilot:', len(harmonizer_hlca_pilot.data_h5ad_dict))

    print('\n--- Conversion failures (1→0) ---')
    print('Count:', len(harmonizer_hlca_pilot.conversion_failed_identifiers))
    print('Example (first 25):', sorted(list(harmonizer_hlca_pilot.conversion_failed_identifiers))[:25])

    print('\n--- Failed but consistent (kept) ---')
    print('Count:', len(harmonizer_hlca_pilot.conversion_failed_but_consistent_identifiers))
    print('Example (first 25):', sorted(list(harmonizer_hlca_pilot.conversion_failed_but_consistent_identifiers))[:25])

    print('\n--- Collapsed ambiguous mappings (1→n; multiple Ensembl candidates) ---')
    print('Count:', len(harmonizer_hlca_pilot.multiple_ensembl_dict))
    some_keys = list(harmonizer_hlca_pilot.multiple_ensembl_dict)[:10]
    print('Example keys:', some_keys)

5.2.5.7 “Union vs intersect” experiment (feature retention)

This is a quick way to understand how much “feature width” you are adding when you merge many datasets. Start with the pilot: it is fast and gives you the same qualitative signal.

if harmonizer_hlca_pilot is None:
    print('Pilot harmonizer not created; skipping merge step.')
else:
    hlca_pilot_union = harmonizer_hlca_pilot.unify_multiple_anndatas(mode="union")
    hlca_pilot_intersect = harmonizer_hlca_pilot.unify_multiple_anndatas(mode="intersect")

    print('Union shape (cells x genes):', hlca_pilot_union.shape)
    print('Intersect shape (cells x genes):', hlca_pilot_intersect.shape)

5.2.5.8 Scale up to full HLCA (template)

Once the pilot looks good, you can scale up by switching the dataset dictionary. The code below is commented out by default because a full HLCA run can take a long time and use significant disk.

Warning: keep the output folder and intermediate artifacts. They are your provenance record.

# Full HLCA run template (commented out by default)
#
# if data_h5ad_dict_hlca_all:
#     project_out_full = (LOCAL_REPOSITORY / 'hlca_full_run').resolve()
#     project_out_full.mkdir(parents=True, exist_ok=True)
#
#     harmonizer_full = idtrack.HarmonizeFeatures(
#         project_name='hlca_full',
#         data_h5ad_dict=data_h5ad_dict_hlca_all,
#         project_local_repository=str(project_out_full),
#         idtrack_local_repository=str(LOCAL_REPOSITORY),
#         target_ensembl_release=TARGET_RELEASE,
#         final_database=FINAL_DATABASE,
#         organism_name=organism,
#         graph_last_ensembl_release=TARGET_RELEASE,
#         verbose_level=2,
#     )
#
#     hlca_union = harmonizer_full.unify_multiple_anndatas(mode="union")
#     hlca_union.write_h5ad(project_out_full / 'hlca_union_harmonized.h5ad')
#
#     # Optional: also keep an intersect version
#     # hlca_intersect = harmonizer_full.unify_multiple_anndatas(mode="intersect")
#     # hlca_intersect.write_h5ad(project_out_full / 'hlca_intersect_harmonized.h5ad')

5.3 — Legacy Data Rescue

Legacy datasets often contain a mix of:

older Ensembl IDs (from older releases)
gene symbols (which may have changed)
project-specific aliases

A safe, reproducible rescue workflow is:

Pick a snapshot boundary (the newest release you allow).
Convert into a stable namespace (usually Ensembl gene IDs) at that snapshot.
Inspect failure + ambiguity rates.
Only then convert into presentation-friendly labels (HGNC symbols) if needed.

The next cell demonstrates a small, realistic “mixed identifier” rescue using the human API.

Tip: Legacy data rescue often involves older releases and older assemblies. The default human snapshot is multi-assembly and can map GRCh37-derived identifiers into your chosen snapshot/primary assembly. Only rebuild with genome_assembly=37 if your downstream reference is GRCh37 and you want outputs anchored to that build (see Part 3).

# Legacy rescue demo (human)

api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()

organism, latest_release = api.resolve_organism('human')
api.build_graph(organism_name=organism, snapshot_release=latest_release, calculate_caches=False)

legacy_ids = [
    'TP53',                # HGNC symbol (common)
    'P53',                 # older/alias-like symbol (may or may not resolve)
    'ENSG00000141510',     # Ensembl gene ID
    'ENSG00000139618',     # BRCA2
    'BRCA1',               # symbol
    'NOT_A_REAL_GENE',     # should become a clean 1→0 example
]

# Convert into Ensembl gene IDs at the snapshot boundary (stable backbone)
results = api.convert_identifier_multiple(legacy_ids, to_release=latest_release, final_database=None, strategy='all', verbose=False)
summary = api.classify_multiple_conversion(results)
api.print_binned_conversion(summary)

# Show a couple of raw results for inspection
results[:3]