Part 0 — Conceptual Foundation
Last updated: 2026-01-08
This section builds the mental model you need before touching code. It is written for wet-lab researchers and bioinformaticians who want reliable, reproducible identifier conversion across Ensembl releases and external databases.
Learning objectives
Understand what identifier drift is and why it breaks analyses.
Understand the two axes: time (Ensembl release) and space (namespace/database).
Understand what a snapshot boundary is and why it makes results reproducible.
Know how to interpret IDTrack outcomes: 1→0, 1→1, 1→n.
Tip: If you remember one sentence: IDTrack does time travel on the Ensembl backbone, then optional hops into external namespaces, all inside a bounded snapshot so results are auditable and reproducible.
0.1 — What is IDTrack?
In modern biology you rarely analyze a single dataset in isolation. You compare:
older vs newer studies (different Ensembl releases)
different technologies (bulk vs scRNA-seq)
different annotation sources (Ensembl IDs vs HGNC symbols vs RefSeq vs UniProt)
The challenge: identifiers change (this is called identifier drift).
Ensembl stable IDs can be retired, merged, split, or re-assigned across releases.
Gene symbols are convenient but can be ambiguous (one symbol can refer to multiple genes) and can change over time.
External databases provide cross-references, but they overlap and can introduce many-to-many ambiguity.
IDTrack solves this by building a time-aware graph of identifier relationships and answering: “Given this ID, what does it correspond to at release X (and optionally in database Y), inside a reproducible snapshot?”
When should you use IDTrack (vs alternatives)?
Use IDTrack when you need:
Reproducibility across time: “map everything to Ensembl release X and keep that fixed for the project”.
Auditability: you want to inspect why a mapping happened (or why it didn’t).
Mixed namespaces: your inputs are a mix of Ensembl IDs, symbols, UniProt/RefSeq/Entrez, etc.
Honest ambiguity handling: you prefer explicit 1→n results to silent coercion.
You might not need IDTrack if you only need:
a quick “latest release only” mapping for a handful of genes and you do not care about snapshot reproducibility.
0.2 — Architecture & Workflow Overview
IDTrack is easiest to understand with two axes:
Time axis → Ensembl releases (e.g. 90, 100, 110, 115)
Space axis → identifier namespaces (Ensembl gene IDs, gene symbols, UniProt accessions, RefSeq IDs, …)
Most real tasks are:
Time travel on the backbone (move an Ensembl identifier from one release to another)
then optionally switch namespaces (e.g. Ensembl → HGNC Symbol, or Ensembl → UniProt)
A useful picture (the conversion pipeline):
Your ID (some database, some year)
|
v
[Normalize + match to graph node]
|
v
[Time travel across Ensembl releases]
|
+--> (optional) [external hop(s) if needed]
|
v
[Arrive at target release]
|
v
(optional) [Convert to requested external database]
Why this matters: If you say “convert to release 115”, you are choosing a point on the time axis. If you say “give me HGNC symbols”, you are choosing a coordinate on the space axis. The rest is controlled path-finding inside a reproducible snapshot.
0.2.1 — Graph Snapshots and the Snapshot Boundary
IDTrack builds a graph where:
nodes are identifiers (Ensembl IDs, base IDs, external IDs)
edges encode relationships (release-to-release history, gene↔transcript↔protein links, external cross-references)
A key feature is the snapshot boundary (also called the snapshot release):
you choose a maximum Ensembl release (e.g. 115)
IDTrack ignores anything newer
This matters because it makes results reproducible:
same snapshot boundary + same external YAML → same graph → same conversions
A practical metaphor: you are doing “time travel” in a museum that you freeze in time. The outside world (newer releases) keeps changing, but your museum exhibits do not.
0.2.2 — Backbone vs External Namespaces (and the External YAML)
IDTrack does not automatically include every external database Ensembl knows about. Instead, you explicitly opt in via a small YAML file (the external YAML).
Why? Because including everything would:
make the graph huge
slow down path-finding
increase ambiguity (many-to-many relationships)
So the workflow is:
generate a template YAML for an organism
enable a curated set of external databases (set
Include: true)build the graph snapshot
Tip: Think of Ensembl IDs as the “backbone” (the timeline). External databases are “bridges” you can enable when you need them.
0.2.3 — Understanding Mapping Outcomes (1→0, 1→1, 1→n)
When you convert one identifier, three outcomes are common:
1→0: nothing matches (unknown ID, or no path exists in the snapshot)
1→1: clean conversion
1→n: ambiguous conversion (splits, merged history, promiscuous external IDs, symbols, …)
IDTrack will tell you which case you are in. This is a feature, not a failure: it prevents silent mistakes.
0.2.4 — Genome assemblies are part of the mapping
A genome assembly (also called a “build”) is the reference coordinate system used to define genes and transcripts. Common examples:
human: GRCh38 and GRCh37
mouse: GRCm39 and GRCm38
pig: Sscrofa11.1 and Sscrofa10.2
Why this matters in practice: two datasets can use the “same kind of identifiers” but still be anchored to different builds because they were annotated with different GTFs or reference packages. That is a very common atlas-building situation.
IDTrack treats assemblies as a first-class part of the graph. A snapshot is not only “release-bounded in time” — it can also be multi-assembly:
nodes/edges carry assembly context where needed
the path-finder can move across releases and across assemblies to reach a unified target space
some external databases are only present (or only reliable) on specific assembly/release combinations, so having multiple assemblies available can increase connectivity
In most projects you choose a single target (a snapshot boundary + a primary assembly for output), and use IDTrack to bring mixed inputs into that target in a reproducible way.
Human is the main case where two assemblies are actively supported in parallel (GRCh38 + GRCh37). Mouse and pig are generally clean-handoff species (one maintained assembly per release), but older assemblies still matter for legacy datasets.
0.2.5 — Key Abstractions You Will Use
You do not need to be a developer to use these, but knowing the names helps you navigate the tutorials:
``idtrack.API`` — the high-level entry point
resolves organism names
builds/loads the graph snapshot
provides
convert_identifier(...)and batch helpers
``idtrack.DatabaseManager`` — data access + caching
downloads tables from Ensembl (live MySQL when reachable; otherwise HTTPS/FTP MySQL dumps)
manages your external YAML (
*_externals_modified.yml)
``idtrack.Track`` — the conversion engine
performs path-finding and scoring in the graph
you usually access it as
api.track
``idtrack.HarmonizeFeatures`` — multi-dataset harmonization
converts gene identifiers across multiple
.h5addatasetshelps build a unified integrated dataset
``idtrack._external_mappers`` (optional) — orthologs and external mapping services
advanced features that require extra dependencies
0.3 — Assumptions & Limitations
IDTrack is deliberately opinionated about reproducibility and transparency. That comes with assumptions and limits.
IDTrack assumes you have:
a writable local repository folder (cache + graphs + YAML configs)
network access the first time you build a graph (REST + HTTPS/FTP dumps; later runs reuse the cache)
enough disk space (graphs and cached tables can be large)
IDTrack can:
time-travel identifiers across Ensembl releases inside a chosen snapshot boundary
optionally hop into external namespaces you explicitly enabled
surface ambiguity (1→n) instead of hiding it
IDTrack cannot magically fix upstream ambiguity:
if Ensembl history says a gene split into multiple descendants, the correct answer may be 1→n
if a symbol is reused across multiple genes, symbols can stay ambiguous
mapping quality is bounded by upstream metadata quality
Why this matters: IDTrack prefers to be “honestly ambiguous” rather than silently wrong. If you need a single answer from a 1→n case, you should make that choice explicitly (and record how you did it).
3
# 1) Minimal setup cell (safe to run in any notebook)
from __future__ import annotations
import os
from pathlib import Path
import idtrack
LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)
api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()
print('IDTrack version:', idtrack.__version__)
print('Local repository:', LOCAL_REPOSITORY)
IDTrack version: 0.0.5
Local repository: /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache
The cell above does three things:
chooses a cache directory (your local repository)
creates the high-level
idtrack.APIobjectenables logging so you can see progress (downloads, caching, graph build)
4
# 2) Resolve organism names the way IDTrack expects
# You can use common names ('human'), scientific names, taxon IDs, or Ensembl-style names.
for query in ['human', 'mus musculus', 'sus scrofa']:
formal_name, latest_release = api.resolve_organism(query)
print(f'{query!r} -> {formal_name!r} (latest Ensembl release: {latest_release})')
2026-01-09 21:31:06 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
'human' -> 'homo_sapiens' (latest Ensembl release: 115)
2026-01-09 21:31:07 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
'mus musculus' -> 'mus_musculus' (latest Ensembl release: 115)
'sus scrofa' -> 'sus_scrofa' (latest Ensembl release: 115)
You will see outputs like:
'human' -> 'homo_sapiens''mus musculus' -> 'mus_musculus''sus scrofa' -> 'sus_scrofa'
From now on, the tutorials use the formal Ensembl names (snake_case).
0.4 — Tutorial Roadmap (Recommended Order)
Install + verify (Part 1)
Prepare external YAMLs (Part 2)
Build graph snapshots (Part 3)
Run self-tests / sanity checks (initialization tests)
Human API deep dive (Part 4)
Harmonization tutorial (HLCA-style) (Part 5)
HLCA experiments case study (Part 5, advanced)
Cross-species humanization (Part 6, advanced)
Advanced topics (Part 7)
0.5 — Quick Troubleshooting (Fast Wins)
If something fails, check these first:
Permissions: is your local repository writable?
Network: first-time runs need to reach Ensembl services (REST + HTTPS/FTP dumps; MySQL is optional).
Disk space: building graphs can use multiple GB.
Snapshot boundary + assemblies: the snapshot release must exist for the organism and the chosen primary assembly. Older assemblies have different release coverage; a multi-assembly snapshot can still include them when they exist within the release window.
External YAML: for mouse/pig you must create a
*_externals_modified.ymlin your local repository.
Tip: For a deeper checklist (including diagnostics helpers), see Part 7.3 in
07_advanced_topics.ipynb.