Part 0 — Conceptual Foundation

Last updated: 2026-01-08

This section builds the mental model you need before touching code. It is written for wet-lab researchers and bioinformaticians who want reliable, reproducible identifier conversion across Ensembl releases and external databases.

Learning objectives

  • Understand what identifier drift is and why it breaks analyses.

  • Understand the two axes: time (Ensembl release) and space (namespace/database).

  • Understand what a snapshot boundary is and why it makes results reproducible.

  • Know how to interpret IDTrack outcomes: 1→0, 1→1, 1→n.

Tip: If you remember one sentence: IDTrack does time travel on the Ensembl backbone, then optional hops into external namespaces, all inside a bounded snapshot so results are auditable and reproducible.

0.1 — What is IDTrack?

In modern biology you rarely analyze a single dataset in isolation. You compare:

  • older vs newer studies (different Ensembl releases)

  • different technologies (bulk vs scRNA-seq)

  • different annotation sources (Ensembl IDs vs HGNC symbols vs RefSeq vs UniProt)

The challenge: identifiers change (this is called identifier drift).

  • Ensembl stable IDs can be retired, merged, split, or re-assigned across releases.

  • Gene symbols are convenient but can be ambiguous (one symbol can refer to multiple genes) and can change over time.

  • External databases provide cross-references, but they overlap and can introduce many-to-many ambiguity.

IDTrack solves this by building a time-aware graph of identifier relationships and answering: “Given this ID, what does it correspond to at release X (and optionally in database Y), inside a reproducible snapshot?”

When should you use IDTrack (vs alternatives)?

Use IDTrack when you need:

  • Reproducibility across time: “map everything to Ensembl release X and keep that fixed for the project”.

  • Auditability: you want to inspect why a mapping happened (or why it didn’t).

  • Mixed namespaces: your inputs are a mix of Ensembl IDs, symbols, UniProt/RefSeq/Entrez, etc.

  • Honest ambiguity handling: you prefer explicit 1→n results to silent coercion.

You might not need IDTrack if you only need:

  • a quick “latest release only” mapping for a handful of genes and you do not care about snapshot reproducibility.

0.2 — Architecture & Workflow Overview

IDTrack is easiest to understand with two axes:

  1. Time axisEnsembl releases (e.g. 90, 100, 110, 115)

  2. Space axisidentifier namespaces (Ensembl gene IDs, gene symbols, UniProt accessions, RefSeq IDs, …)

Most real tasks are:

  • Time travel on the backbone (move an Ensembl identifier from one release to another)

  • then optionally switch namespaces (e.g. Ensembl → HGNC Symbol, or Ensembl → UniProt)

A useful picture (the conversion pipeline):

Your ID (some database, some year)
        |
        v
  [Normalize + match to graph node]
        |
        v
  [Time travel across Ensembl releases]
        |
        +--> (optional) [external hop(s) if needed]
        |
        v
  [Arrive at target release]
        |
        v
  (optional) [Convert to requested external database]

Why this matters: If you say “convert to release 115”, you are choosing a point on the time axis. If you say “give me HGNC symbols”, you are choosing a coordinate on the space axis. The rest is controlled path-finding inside a reproducible snapshot.

0.2.1 — Graph Snapshots and the Snapshot Boundary

IDTrack builds a graph where:

  • nodes are identifiers (Ensembl IDs, base IDs, external IDs)

  • edges encode relationships (release-to-release history, gene↔transcript↔protein links, external cross-references)

A key feature is the snapshot boundary (also called the snapshot release):

  • you choose a maximum Ensembl release (e.g. 115)

  • IDTrack ignores anything newer

This matters because it makes results reproducible:

  • same snapshot boundary + same external YAML → same graph → same conversions

A practical metaphor: you are doing “time travel” in a museum that you freeze in time. The outside world (newer releases) keeps changing, but your museum exhibits do not.

0.2.2 — Backbone vs External Namespaces (and the External YAML)

IDTrack does not automatically include every external database Ensembl knows about. Instead, you explicitly opt in via a small YAML file (the external YAML).

Why? Because including everything would:

  • make the graph huge

  • slow down path-finding

  • increase ambiguity (many-to-many relationships)

So the workflow is:

  1. generate a template YAML for an organism

  2. enable a curated set of external databases (set Include: true)

  3. build the graph snapshot

Tip: Think of Ensembl IDs as the “backbone” (the timeline). External databases are “bridges” you can enable when you need them.

0.2.3 — Understanding Mapping Outcomes (1→0, 1→1, 1→n)

When you convert one identifier, three outcomes are common:

  • 1→0: nothing matches (unknown ID, or no path exists in the snapshot)

  • 1→1: clean conversion

  • 1→n: ambiguous conversion (splits, merged history, promiscuous external IDs, symbols, …)

IDTrack will tell you which case you are in. This is a feature, not a failure: it prevents silent mistakes.

0.2.4 — Genome assemblies are part of the mapping

A genome assembly (also called a “build”) is the reference coordinate system used to define genes and transcripts. Common examples:

  • human: GRCh38 and GRCh37

  • mouse: GRCm39 and GRCm38

  • pig: Sscrofa11.1 and Sscrofa10.2

Why this matters in practice: two datasets can use the “same kind of identifiers” but still be anchored to different builds because they were annotated with different GTFs or reference packages. That is a very common atlas-building situation.

IDTrack treats assemblies as a first-class part of the graph. A snapshot is not only “release-bounded in time” — it can also be multi-assembly:

  • nodes/edges carry assembly context where needed

  • the path-finder can move across releases and across assemblies to reach a unified target space

  • some external databases are only present (or only reliable) on specific assembly/release combinations, so having multiple assemblies available can increase connectivity

In most projects you choose a single target (a snapshot boundary + a primary assembly for output), and use IDTrack to bring mixed inputs into that target in a reproducible way.

Human is the main case where two assemblies are actively supported in parallel (GRCh38 + GRCh37). Mouse and pig are generally clean-handoff species (one maintained assembly per release), but older assemblies still matter for legacy datasets.

0.2.5 — Key Abstractions You Will Use

You do not need to be a developer to use these, but knowing the names helps you navigate the tutorials:

  1. ``idtrack.API`` — the high-level entry point

    • resolves organism names

    • builds/loads the graph snapshot

    • provides convert_identifier(...) and batch helpers

  2. ``idtrack.DatabaseManager`` — data access + caching

    • downloads tables from Ensembl (live MySQL when reachable; otherwise HTTPS/FTP MySQL dumps)

    • manages your external YAML (*_externals_modified.yml)

  3. ``idtrack.Track`` — the conversion engine

    • performs path-finding and scoring in the graph

    • you usually access it as api.track

  4. ``idtrack.HarmonizeFeatures`` — multi-dataset harmonization

    • converts gene identifiers across multiple .h5ad datasets

    • helps build a unified integrated dataset

  5. ``idtrack._external_mappers`` (optional) — orthologs and external mapping services

    • advanced features that require extra dependencies

0.3 — Assumptions & Limitations

IDTrack is deliberately opinionated about reproducibility and transparency. That comes with assumptions and limits.

IDTrack assumes you have:

  • a writable local repository folder (cache + graphs + YAML configs)

  • network access the first time you build a graph (REST + HTTPS/FTP dumps; later runs reuse the cache)

  • enough disk space (graphs and cached tables can be large)

IDTrack can:

  • time-travel identifiers across Ensembl releases inside a chosen snapshot boundary

  • optionally hop into external namespaces you explicitly enabled

  • surface ambiguity (1→n) instead of hiding it

IDTrack cannot magically fix upstream ambiguity:

  • if Ensembl history says a gene split into multiple descendants, the correct answer may be 1→n

  • if a symbol is reused across multiple genes, symbols can stay ambiguous

  • mapping quality is bounded by upstream metadata quality

Why this matters: IDTrack prefers to be “honestly ambiguous” rather than silently wrong. If you need a single answer from a 1→n case, you should make that choice explicitly (and record how you did it).

3
# 1) Minimal setup cell (safe to run in any notebook)
from __future__ import annotations

import os
from pathlib import Path

import idtrack

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()

print('IDTrack version:', idtrack.__version__)
print('Local repository:', LOCAL_REPOSITORY)

IDTrack version: 0.0.5
Local repository: /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache

The cell above does three things:

  1. chooses a cache directory (your local repository)

  2. creates the high-level idtrack.API object

  3. enables logging so you can see progress (downloads, caching, graph build)

4
# 2) Resolve organism names the way IDTrack expects
# You can use common names ('human'), scientific names, taxon IDs, or Ensembl-style names.

for query in ['human', 'mus musculus', 'sus scrofa']:
    formal_name, latest_release = api.resolve_organism(query)
    print(f'{query!r} -> {formal_name!r} (latest Ensembl release: {latest_release})')

2026-01-09 21:31:06 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
'human' -> 'homo_sapiens' (latest Ensembl release: 115)
2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
'mus musculus' -> 'mus_musculus' (latest Ensembl release: 115)
'sus scrofa' -> 'sus_scrofa' (latest Ensembl release: 115)

You will see outputs like:

  • 'human' -> 'homo_sapiens'

  • 'mus musculus' -> 'mus_musculus'

  • 'sus scrofa' -> 'sus_scrofa'

From now on, the tutorials use the formal Ensembl names (snake_case).

0.4 — Tutorial Roadmap (Recommended Order)

  1. Install + verify (Part 1)

  2. Prepare external YAMLs (Part 2)

  3. Build graph snapshots (Part 3)

  4. Run self-tests / sanity checks (initialization tests)

  5. Human API deep dive (Part 4)

  6. Harmonization tutorial (HLCA-style) (Part 5)

  7. HLCA experiments case study (Part 5, advanced)

  8. Cross-species humanization (Part 6, advanced)

  9. Advanced topics (Part 7)

0.5 — Quick Troubleshooting (Fast Wins)

If something fails, check these first:

  1. Permissions: is your local repository writable?

  2. Network: first-time runs need to reach Ensembl services (REST + HTTPS/FTP dumps; MySQL is optional).

  3. Disk space: building graphs can use multiple GB.

  4. Snapshot boundary + assemblies: the snapshot release must exist for the organism and the chosen primary assembly. Older assemblies have different release coverage; a multi-assembly snapshot can still include them when they exist within the release window.

  5. External YAML: for mouse/pig you must create a *_externals_modified.yml in your local repository.

Tip: For a deeper checklist (including diagnostics helpers), see Part 7.3 in 07_advanced_topics.ipynb.