Part 4 — Core API Deep-Dive (Human Example)

Last updated: 2026-01-08

This notebook is a hands-on tutorial of the public IDTrack API using human.

Learning objectives

Create the idtrack.API façade and understand what it wraps.
Convert single identifiers (time travel + optional external outputs).
Convert batches and summarize outcomes (1→0 / 1→1 / 1→n).
Request explanation payloads for audit trails.
Learn advanced knobs (external bridging, ambiguity strategy, assembly awareness).
Learn introspection helpers (available databases, assemblies, releases, active ranges).

Prerequisite: 03_initialization_graph.ipynb (Part 3) is recommended so the graph loads from cache.

4.1 — The API Facade

idtrack.API is the user-facing entry point. It handles:

organism resolution (human/mouse/pig names and synonyms)
building or loading a graph snapshot (the reproducible snapshot boundary)
conversion helpers like convert_identifier(...) and convert_identifier_multiple(...)

In this notebook we build (or load) the human snapshot, then use it for the rest of the examples.

Expected result: after the setup cell runs, api.track exists and conversions become available.

from __future__ import annotations

import os
from pathlib import Path

import idtrack

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()

organism, latest_release = api.resolve_organism('human')
SNAPSHOT_RELEASE = latest_release

api.build_graph(organism_name=organism, snapshot_release=SNAPSHOT_RELEASE, calculate_caches=True)
print('Ready:', organism, 'snapshot', SNAPSHOT_RELEASE)

2026-01-17 14:53:38 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
2026-01-17 14:54:03 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 38: releases 76-115 (from config [76, None])
2026-01-17 14:54:57 INFO:graph_maker: The graph is being read: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/graph_homo_sapiens_min48_max115_narrow.pickle
2026-01-17 14:56:40 INFO:the_graph: Cached properties being calculated: available_genome_assemblies
2026-01-17 14:56:40 INFO:the_graph: Cached properties being calculated: combined_edges
2026-01-17 14:58:16 INFO:the_graph: Cached properties being calculated: combined_edges_genes
2026-01-17 15:00:04 INFO:the_graph: Cached properties being calculated: combined_edges_assembly_specific_genes
2026-01-17 15:00:09 INFO:the_graph: Cached properties being calculated: lower_chars_graph
2026-01-17 15:00:11 INFO:the_graph: Cached properties being calculated: get_active_ranges_of_id
2026-01-17 15:01:14 INFO:the_graph: Cached properties being calculated: available_external_databases
2026-01-17 15:01:19 INFO:the_graph: Cached properties being calculated: available_external_databases_assembly
2026-01-17 15:01:23 INFO:the_graph: Cached properties being calculated: node_trios
2026-01-17 15:02:16 INFO:the_graph: Cached properties being calculated: hyperconnective_nodes

Ready: homo_sapiens snapshot 115

4.2 — Single Identifier Conversion

api.convert_identifier('TP53', to_release=SNAPSHOT_RELEASE)

{'target_id': ['ENSG00000141510.20'],
 'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],
 'final_database': 'ensembl_gene',
 'graph_id': 'TP53',
 'query_id': 'TP53',
 'no_corresponding': False,
 'no_conversion': False,
 'no_target': False}

If you see no_corresponding=True, it means the input could not be matched. Try a different spelling/casing, or use an Ensembl ID directly.

4.2.1 Example: time travel (convert to an older release)

Why this matters: published datasets often use older releases.

# Choose an older release to demonstrate time travel
older_release = SNAPSHOT_RELEASE - 10
api.convert_identifier('TP53', to_release=older_release)

{'target_id': ['ENSG00000141510.18'],
 'last_node': [('ENSG00000141510.18', 'ENSG00000141510.18')],
 'final_database': 'ensembl_gene',
 'graph_id': 'TP53',
 'query_id': 'TP53',
 'no_corresponding': False,
 'no_conversion': False,
 'no_target': False}

4.2.2 Convert into an external database (HGNC)

To convert into a specific external database, pass final_database=.... Database names are the same names you see in your external YAML.

api.convert_identifier('ENSG00000141510', to_release=SNAPSHOT_RELEASE, final_database='HGNC Symbol')

{'target_id': ['TP53'],
 'last_node': [('ENSG00000141510.20', 'TP53')],
 'final_database': 'HGNC Symbol',
 'graph_id': 'ENSG00000141510',
 'query_id': 'ENSG00000141510',
 'no_corresponding': False,
 'no_conversion': False,
 'no_target': False}

4.2.3 How do I know which external databases are available?

Use the graph itself to list databases currently represented.

g = api.track.graph
sorted(g.available_external_databases)[:50]

['CCDS',
 'Clone_based_ensembl_gene',
 'Clone_based_vega_gene',
 'EntrezGene',
 'HGNC Symbol',
 'Havana gene',
 'Havana transcript',
 'Havana translation',
 'NCBI gene',
 'NCBI gene (formerly Entrezgene)',
 'RFAM',
 'RefSeq_mRNA',
 'RefSeq_mRNA_predicted',
 'RefSeq_ncRNA',
 'RefSeq_ncRNA_predicted',
 'RefSeq_peptide',
 'RefSeq_peptide_predicted',
 'UniProtKB Gene Name',
 'Uniprot/SPTREMBL',
 'Uniprot/SWISSPROT',
 'Vega gene',
 'Vega_gene',
 'synonym_id::EntrezGene',
 'synonym_id::HGNC Symbol',
 'synonym_id::NCBI gene',
 'synonym_id::NCBI gene (formerly Entrezgene)',
 'synonym_id::UniProtKB Gene Name']

4.2.4 Understanding the result dictionary

Key fields:

query_id: exactly what you typed
graph_id: what IDTrack matched internally (normalization step)
target_id: list of outputs (can be 0, 1, or many)
no_corresponding: input didn’t match any node
no_conversion: input matched, but no path to target release / database
no_target: reached an Ensembl target, but requested external DB had no synonym

Important: target_id is a list because ambiguity is real and common.

4.2.5 Ambiguity control: strategy=’best’ vs strategy=’all’

strategy='best' (default): returns a single best target when possible
strategy='all': returns all candidates IDTrack found

Use 'all' when you are doing QC or want to inspect ambiguous mappings.

api.convert_identifier('TP53', to_release=SNAPSHOT_RELEASE, strategy='all')

{'target_id': ['ENSG00000141510.20'],
 'last_node': [('ENSG00000141510.20', 'ENSG00000141510.20')],
 'final_database': 'ensembl_gene',
 'graph_id': 'TP53',
 'query_id': 'TP53',
 'no_corresponding': False,
 'no_conversion': False,
 'no_target': False}

4.3 — Batch Conversion

Most workflows start from a list of identifiers (genes in a count matrix, markers, hits, etc.). IDTrack provides two helpers:

convert_identifier_multiple(...)
classify_multiple_conversion(...) to summarize outcomes

genes = ['TP53', 'BRCA1', 'BRCA2', 'BRAF', 'KRAS', 'NOT_A_REAL_GENE']
results = api.convert_identifier_multiple(genes, to_release=SNAPSHOT_RELEASE, final_database='HGNC Symbol')
results[:2]  # show first two

100%|█████████████████████████████████████████████| 6/6 [00:00<00:00, 73.13it/s, ID:NOT_A_REAL_GENE]

[{'target_id': ['TP53'],
  'last_node': [('ENSG00000141510.20', 'TP53')],
  'final_database': 'HGNC Symbol',
  'graph_id': 'TP53',
  'query_id': 'TP53',
  'no_corresponding': False,
  'no_conversion': False,
  'no_target': False},
 {'target_id': ['BRCA1'],
  'last_node': [('ENSG00000012048.27', 'BRCA1')],
  'final_database': 'HGNC Symbol',
  'graph_id': 'BRCA1',
  'query_id': 'BRCA1',
  'no_corresponding': False,
  'no_conversion': False,
  'no_target': False}]

summary = api.classify_multiple_conversion(results)
# Each bin is a list of per-gene dictionaries
{k: len(v) for k, v in summary.items()}

{'changed_only_1_to_n': 0,
 'changed_only_1_to_1': 0,
 'alternative_target_1_to_1': 0,
 'alternative_target_1_to_n': 0,
 'matching_1_to_0': 1,
 'matching_1_to_1': 5,
 'matching_1_to_n': 0,
 'input_identifiers': 6}

If you want a human-readable report, you can print the summary bins:

api.print_binned_conversion(summary)

2026-01-17 15:02:27 INFO:api:
IDTrack conversion summary:
  Total processed: 6
  1→0: 1 (16.7%)
  1→1: 5 (83.3%)
    Changed only: 0 (0.0%)
    Alternative targets: 0 (0.0%)
    Rest: 5 (100.0%)
  1→n: 0 (0.0%)
    Changed only: 0 (0.0%)
    Alternative targets: 0 (0.0%)
  Diagnostics:
    no_corresponding: 1
    no_conversion:   0
    no_target:       0

4.4 — Explainability & Auditability

When you set explain=True, the result includes a the_path field describing the graph edges followed. This is very useful for advanced QC and debugging.

explained = api.convert_identifier('TP53', to_release=SNAPSHOT_RELEASE, final_database='HGNC Symbol', explain=True)
list(explained.keys())

['target_id',
 'last_node',
 'final_database',
 'graph_id',
 'query_id',
 'no_corresponding',
 'no_conversion',
 'no_target',
 'the_path']

# The path dictionary keys are (target_id, ensembl_gene_id) pairs
list(explained['the_path'].keys())[:3]

[('TP53', 'ENSG00000141510.20')]

the_path is intentionally detailed. For most users, the summary flags and target_id are enough.

4.5 — Advanced Conversion Options

api.convert_identifier(...) is a convenience wrapper around api.track.convert(...).

Use the high-level API most of the time. But if you need full control (search settings, whether external bridging is allowed, deeper path diagnostics), you can call Track.convert directly.

4.5.1 Best vs all (selection strategy)

strategy='best' returns a single globally best target.
strategy='all' returns all scored targets (useful for ambiguity-aware pipelines).

4.5.2 Controlling external bridging

External bridging helps reconnect broken Ensembl histories using external IDs, but it can also increase search space. Power users can toggle it via go_external on Track.convert.

4.5.3 Hyperconnected nodes

Some external identifiers connect to many entities (e.g. generic accessions). IDTrack detects these and limits their use to keep searches fast.

4.5.4 Assembly-aware conversions

In IDTrack, genome assemblies are part of the graph. This is crucial when you integrate datasets that were annotated with different references (for example, a GRCh37-based GTF and a GRCh38-based GTF).

When you build a snapshot you choose a primary assembly (default: the newest/highest-priority assembly for that organism). The snapshot can still include other assemblies that Ensembl exposes within the snapshot window, and the path-finder can traverse between assemblies when it improves connectivity.

Practical consequences:

You can feed identifiers originating from older builds and still harmonize them into one target space (your snapshot release + primary assembly).
External databases can be assembly-scoped; keeping assembly blocks enabled in your external YAML increases the set of bridges available for mapping.

If you truly need outputs anchored to a different primary assembly (for example, a GRCh37-only downstream reference), rebuild with genome_assembly=37. Note that the cached graph filename does not include the assembly; use a separate local repository if you want to keep multiple primary-assembly snapshots side-by-side.

Example (advanced):

api.track.convert(
    from_id='TP53',
    from_release=None,
    to_release=SNAPSHOT_RELEASE,
    final_database='HGNC Symbol',
    go_external=True,
    return_path=True,
)

# Advanced demo: inspect hyperconnected nodes and compare `go_external` behavior.
# Safe: does not modify your cache; it only runs conversions.

# 1) Hyperconnected nodes (performance/ambiguity concept)
g = api.track.graph
hc = getattr(g, 'hyperconnective_nodes', {})
print('Hyperconnected external nodes:', len(hc))
if hc:
    top = sorted(hc.items(), key=lambda kv: kv[1], reverse=True)[:10]
    print('Top 10 by out-degree:')
    for node, deg in top:
        print(' ', deg, '-', node)

# 2) External bridging toggle (often matters when backbone history is disconnected)
# For many well-behaved genes, both calls will succeed; the point is the *option* exists.
res_no_external = api.track.convert(
    from_id='TP53',
    from_release=None,
    to_release=SNAPSHOT_RELEASE,
    final_database=None,
    go_external=False,
    prioritize_to_one_filter=True,
    return_path=False,
)
res_with_external = api.track.convert(
    from_id='TP53',
    from_release=None,
    to_release=SNAPSHOT_RELEASE,
    final_database=None,
    go_external=True,
    prioritize_to_one_filter=True,
    return_path=False,
)

print('go_external=False ->', 'OK' if res_no_external else None)
print('go_external=True  ->', 'OK' if res_with_external else None)

Hyperconnected external nodes: 1611
Top 10 by out-degree:
  1911 - Metazoa_SRP
  1911 - RF00017
  1619 - RF00026
  1619 - U6
  1090 - RF00019
  1090 - Y_RNA
  633 - 5S_rRNA
  633 - RF00001
  490 - RF01210
  490 - snoU13
go_external=False -> OK
go_external=True  -> OK

4.6 — Introspection & Discovery

These helpers answer practical questions like:

Which external databases are available in my current graph?
Which genome assemblies are represented?
What Ensembl release range does my snapshot cover?
When was a given identifier active across releases?

The next cell demonstrates the most useful introspection calls.

# Introspection demo

print('Assemblies in this graph:', sorted(api.list_genome_assemblies()))

ext_dbs = sorted(api.list_external_databases())
print('External DBs enabled (count):', len(ext_dbs))
print('External DBs (first 25):', ext_dbs[:25])

forms = api.external_database_forms()
print()
print('External DB connection forms (sample):')
for name in ext_dbs[:10]:
    print(' ', name, '→', forms.get(name))

rels = api.list_ensembl_releases()
print()
print('Ensembl releases in snapshot window:', (min(rels), max(rels)) if rels else None)

# Active ranges: when was an ID "alive" across releases?
# (Useful for provenance documentation.)
g = api.track.graph
example_gene = 'ENSG00000141510'  # TP53
if example_gene in g.nodes:
    print()
    print('Active ranges (main assembly) for', example_gene, ':', g.get_active_ranges_of_id.get(example_gene))
    try:
        print('Active ranges (all assemblies) for', example_gene, ':', g.get_active_ranges_of_id_ensembl_all_inclusive(example_gene))
    except Exception as e:
        print('All-assemblies active range failed ->', repr(e))
else:
    print('Example gene not found in graph (unexpected).')

2026-01-17 15:02:28 INFO:the_graph: Cached properties being calculated (for tests): external_database_connection_form

Assemblies in this graph: [36, 37, 38]
External DBs enabled (count): 27
External DBs (first 25): ['CCDS', 'Clone_based_ensembl_gene', 'Clone_based_vega_gene', 'EntrezGene', 'HGNC Symbol', 'Havana gene', 'Havana transcript', 'Havana translation', 'NCBI gene', 'NCBI gene (formerly Entrezgene)', 'RFAM', 'RefSeq_mRNA', 'RefSeq_mRNA_predicted', 'RefSeq_ncRNA', 'RefSeq_ncRNA_predicted', 'RefSeq_peptide', 'RefSeq_peptide_predicted', 'UniProtKB Gene Name', 'Uniprot/SPTREMBL', 'Uniprot/SWISSPROT', 'Vega gene', 'Vega_gene', 'synonym_id::EntrezGene', 'synonym_id::HGNC Symbol', 'synonym_id::NCBI gene']

External DB connection forms (sample):
  CCDS → transcript
  Clone_based_ensembl_gene → gene
  Clone_based_vega_gene → gene
  EntrezGene → gene
  HGNC Symbol → gene
  Havana gene → gene
  Havana transcript → transcript
  Havana translation → translation
  NCBI gene → gene
  NCBI gene (formerly Entrezgene) → gene

Ensembl releases in snapshot window: (76, 115)

Active ranges (main assembly) for ENSG00000141510 : [[48, 115]]
All-assemblies active range failed -> ValueError("Cannot get active ranges for 'ENSG00000141510': node type 'base_ensembl_gene' is not a gene type. Expected 'ensembl_gene' or an assembly-specific gene type.")

4.7 — Practical advice (the kind that saves you a week)

Always record your snapshot boundary (release) in your analysis notes.
If you share results, share the external YAML too.
When mapping is ambiguous, do not hide it — decide how your pipeline should handle 1→n mappings.
For scRNA-seq harmonization, prefer stable namespaces (Ensembl IDs) before switching to symbols.

Tip: If you need troubleshooting checklists and diagnostics helpers, see 07_advanced_topics.ipynb (Part 7.3).