External Mappers

External mappers are optional helpers for cross-database mapping and orthology-related utilities. They are not the core “time-aware graph snapshot” engine (that lives in the backbone graph), but they can be useful for:

fetching or aligning ortholog tables for cross-species workflows
using external services to map between common identifier types when you need a convenience layer

Note

Some external mapper backends require optional dependencies (or network access). The Part 6 tutorial shows how to check availability before you rely on them.

Package API

External ID mapping backends for idtrack.

This module provides interfaces to external ID mapping services: - g:Profiler (gprofiler-official) - MyGene.info (mygene) - Ensembl BioMart (pybiomart) - gget (Ensembl REST API)

Additionally, ortholog utilities are available (require gget + biopython).

Note: This module requires optional dependencies that are not installed with the core idtrack package. Install them with:

pip install gget mygene pybiomart gprofiler-official biopython

Or install only the backends you need.

check_optional_dependencies(warn=True)[source]

Check which optional dependencies are installed.

Parameters:: warn (bool) – When True, emit a warning summarizing missing packages.
Returns:: Mapping from dependency key to availability.
Return type:: dict[str, bool]

convert_ids(ids, input_db, output_db, method, species, drop_metadata_json_column=True, chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, release_for_pybiomart=None, strict_input_db_gprofiler=True, suppress_method_verbosity=True, verbose=2)[source]

Convert identifiers using an external mapper backend.

Parameters:

ids (Iterable[str]) – Input identifiers to map.
input_db (str) – Source database type.
output_db (str) – Target database type.
method (str) – Backend method name (one of SUPPORTED_METHODS).
species (str) – Species code (e.g. "hsapiens").
drop_metadata_json_column (bool) – If True, drop the metadata_json column from the returned DataFrame.
chunk_size (int) – Number of IDs per API request.
pause (float) – Pause in seconds between requests.
max_retries (int) – Maximum retry attempts per chunk on failure (for backends that support it).
strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.
release_for_pybiomart (str | int | None) – Ensembl release/key for the pybiomart backend. Must be None unless method="pybiomart".
strict_input_db_gprofiler (bool) – If True, enforce strict input-db filtering in the gprofiler backend.
suppress_method_verbosity (bool) – Suppress stdout/stderr from the underlying backend library.
verbose (int | str | bool) – Verbosity level (1/2/3) or string alias ("error", "warning", "info", "debug").

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If method/verbose is invalid, or if release_for_pybiomart is used with a non-pybiomart backend.

Conversion helpers

convert_ids(ids, input_db, output_db, method, species, drop_metadata_json_column=True, chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, release_for_pybiomart=None, strict_input_db_gprofiler=True, suppress_method_verbosity=True, verbose=2)[source]

Convert identifiers using an external mapper backend.

Parameters:

ids (Iterable[str]) – Input identifiers to map.
input_db (str) – Source database type.
output_db (str) – Target database type.
method (str) – Backend method name (one of SUPPORTED_METHODS).
species (str) – Species code (e.g. "hsapiens").
drop_metadata_json_column (bool) – If True, drop the metadata_json column from the returned DataFrame.
chunk_size (int) – Number of IDs per API request.
pause (float) – Pause in seconds between requests.
max_retries (int) – Maximum retry attempts per chunk on failure (for backends that support it).
strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.
release_for_pybiomart (str | int | None) – Ensembl release/key for the pybiomart backend. Must be None unless method="pybiomart".
strict_input_db_gprofiler (bool) – If True, enforce strict input-db filtering in the gprofiler backend.
suppress_method_verbosity (bool) – Suppress stdout/stderr from the underlying backend library.
verbose (int | str | bool) – Verbosity level (1/2/3) or string alias ("error", "warning", "info", "debug").

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If method/verbose is invalid, or if release_for_pybiomart is used with a non-pybiomart backend.

Orthology helpers

class AlignmentScores(alignment_length, identity_fraction, positive_fraction, very_negative_fraction, gap_fraction_query, gap_fraction_target, gap_openings_query, gap_openings_target, seq1_coverage, seq2_coverage, blosum62_sum, blosum62_mean, composition_l2_distance)[source]

Bases: object

Alignment-derived scalar metrics for a pairwise protein alignment.

Parameters:

alignment_length (int)
identity_fraction (float)
positive_fraction (float)
very_negative_fraction (float)
gap_fraction_query (float)
gap_fraction_target (float)
gap_openings_query (int)
gap_openings_target (int)
seq1_coverage (float)
seq2_coverage (float)
blosum62_sum (float)
blosum62_mean (float)
composition_l2_distance (float)

class EmbeddingFeatures(model_name, dim, cosine_similarity, euclidean_distance, diff_embedding)[source]

Bases: object

Embedding-derived similarity features for two protein sequences.

Parameters:

model_name (str)
dim (int)
cosine_similarity (float)
euclidean_distance (float)
diff_embedding (ndarray)

_canonical_from_alias(name)[source]

Return canonical short-code for a species alias.

Maps strings like "human", "Homo sapiens", "sus_scrofa" to canonical short codes (hsapiens, mmusculus, sscrofa). Unknown inputs are returned unchanged (lowercased/stripped).

Parameters:: name (str) – Species name or alias.
Returns:: Canonical short code (or a normalized fallback if unknown).
Return type:: str

_get_blosum62()[source]: Lazily load BLOSUM62 matrix.

_require_ortholog_deps()[source]: Lazily import gget and biopython; raise helpful error if missing.

_species_to_genus_species(name)[source]

Convert a species string to (canonical_code, genus, species) for Bgee.

Inputs may be any key in _SPECIES_ALIASES or any canonical key in _SPECIES_CANONICAL_TO_BGEENAMES. Unknown canonical codes raise a helpful error.

Parameters:: name (str) – Species name or alias.
Returns:: (canonical_code, genus, species) for Bgee.
Return type:: tuple[str, str, str]
Raises:: ValueError – If the species cannot be resolved to a supported canonical code.

align_ortholog_pair_with_features(query_ensembl_id, target_species, *, use_super5=False, embedding_model_name='facebook/esm2_t12_35M_UR50D', embedding_device='cpu', embedding_revision='main', verbose=True)[source]

Compute ortholog alignment features for a query gene and target species.

Parameters:

query_ensembl_id (str) – Ensembl gene ID in the query organism.
target_species (str) – Target species alias, interpreted via _SPECIES_ALIASES.
use_super5 (bool) – Passed through to gget.muscle(super5=...).
embedding_model_name (str | None) – Hugging Face model name for embeddings. Set to None to disable embeddings.
embedding_device (str) – Device for the transformer model (e.g. "cpu", "cuda").
embedding_revision (str) – Model revision for from_pretrained (pin to a commit hash for reproducibility).
verbose (bool) – Print progress/warnings.

Returns:

Mapping from target ortholog Ensembl IDs to feature dictionaries.

Raises:

ValueError – If no orthologs are available for the query/target species pair.

Return type:

dict[str, dict[str, Any]]

compute_alignment_scores(seq1, seq2, aligned1, aligned2)[source]

Compute alignment-derived scores and the AA composition-difference vector.

Parameters:

seq1 (str)
seq2 (str)
aligned1 (str)
aligned2 (str)

Return type:

tuple[AlignmentScores, ndarray]

compute_embedding_features(seq1, seq2, *, model_name, device='cpu', revision='main')[source]

Compute embedding-based similarity features for two sequences.

Parameters:

seq1 (str) – First protein sequence.
seq2 (str) – Second protein sequence.
model_name (str) – Hugging Face model name for embeddings.
device (str) – Device for the transformer model (e.g. "cpu", "cuda").
revision (str) – Model revision for from_pretrained.

Returns:

Scalar similarities plus the difference vector.

Return type:

EmbeddingFeatures

fetch_aa_sequence(ensembl_id)[source]

Fetch amino-acid sequence for a gene using gget.seq(translate=True).

Parameters:: ensembl_id (str)
Return type:: str

get_ortholog_ids_for_species(ortholog_df, target_species)[source]

Return all ortholog Ensembl IDs in ortholog_df for a target species.

Parameters:

ortholog_df (DataFrame)
target_species (str)

Return type:

list[str]

get_ortholog_table(query_ensembl_id, *, verbose=True)[source]

Return Bgee ortholog table for a query Ensembl gene ID via gget.bgee.

Parameters:

query_ensembl_id (str)
verbose (bool)

Return type:

DataFrame

get_protein_embedding(sequence, *, model_name, device='cpu', revision='main', max_len=1022)[source]

Return a pooled protein embedding for a sequence via a transformer model.

The embedding is obtained by mean pooling the per-residue representations.

Parameters:

sequence (str) – Protein sequence (amino-acid string).
model_name (str) – Hugging Face model name for the embedding model.
device (str) – Device for the transformer model (e.g. "cpu", "cuda").
revision (str) – Model revision for from_pretrained.
max_len (int) – Maximum tokenized sequence length (including special tokens).

Returns:

1D embedding vector.

Return type:

np.ndarray

parse_clustal_alignment(clustal_text)[source]

Parse a minimal ClustalW/MUSCLE text alignment.

Parameters:: clustal_text (str) – Raw ClustalW/MUSCLE text output.
Returns:: Mapping of {sequence_name: aligned_sequence}.
Return type:: dict[str, str]

pick_ortholog_for_species(ortholog_df, target_species)[source]

Backward-compatible helper that returns only the first ortholog ID.

Parameters:

ortholog_df (pd.DataFrame)
target_species (str)

Return type:

str | None

run_muscle_pairwise(seq1, seq2, *, name1, name2, super5=False)[source]

Align two AA sequences via gget.muscle.

Parameters:

seq1 (str) – Amino-acid sequence for the first protein.
seq2 (str) – Amino-acid sequence for the second protein.
name1 (str) – Sequence name for the first protein (used in FASTA headers).
name2 (str) – Sequence name for the second protein (used in FASTA headers).
super5 (bool) – Whether to enable MUSCLE super5 mode.

Returns:

(aligned_seq1, aligned_seq2, raw_clustal_text).

Return type:

tuple[str, str, str]

Raises:

RuntimeError – If MUSCLE output cannot be parsed or does not contain the expected sequence names.

Backend implementations

Ensembl BioMart backend for ID mapping.

This module provides the map_with_pybiomart() function for querying Ensembl BioMart to convert biological identifiers. Supports historical Ensembl releases via archive hosts.

_biomart_dataset_for_species(species, explicit=None)[source]

Return the Ensembl BioMart dataset name for the given species.

Parameters:

species (str)
explicit (str | None)

Return type:

str

_bm_list_attribute_names(ds)[source]

Return a list of attribute names for a pybiomart Dataset.

Return type:: list[str]

_bm_list_filter_names(ds)[source]

Return a list of filter names for a pybiomart Dataset.

Return type:: list[str]

_bm_pick_attribute(canonical_db_name, available_attrs)[source]

Choose a BioMart attribute name for a canonical DB.

The helper first tries explicit candidates from _BM_ATTR_CANDIDATES and falls back to fuzzy matching on common substrings.

Parameters:

canonical_db_name (str) – Canonical database key (see canonical_db()).
available_attrs (list[str]) – Attribute names provided by the BioMart dataset.

Returns:

Selected attribute name.

Return type:

str

Raises:

RuntimeError – If no compatible attribute is available on the dataset.

_bm_pick_filter(canonical_db_name, attr_name, available_filters)[source]

Choose a BioMart filter name.

The selection depends on the canonical database and chosen attribute.

Parameters:

canonical_db_name (str) – Canonical database key for the input IDs.
attr_name (str) – Attribute name chosen for the input IDs.
available_filters (list[str]) – Filter names provided by the BioMart dataset.

Returns:

Selected filter name.

Return type:

str

Raises:

RuntimeError – If no compatible filter is available on the dataset.

_ensembl_archive_host_for_release(release)[source]

Resolve an Ensembl release or key to an archive host.

Examples include an integer release (e.g. 104) or a special string key (e.g. "GRCh37") mapping to hosts like "may2021.archive.ensembl.org".

Parameters:: release (int | str | None) – Ensembl release number or key (e.g. 104, "v104", "GRCh37"), or None.
Returns:: Archive host for the requested release, or None if unknown.
Return type:: str | None

_normalize_biomart_host(host)[source]

Normalize an Ensembl BioMart host for pybiomart.

Examples of valid outputs:: “http://www.ensembl.org” “http://nov2020.archive.ensembl.org” “http://grch37.ensembl.org”

Parameters:: host (str | None) – Hostname or URL (scheme optional). If None, defaults to "http://www.ensembl.org".
Returns:: Normalized base URL suitable for pybiomart.
Return type:: str

map_with_pybiomart(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, strip_versions=True, release=None, show_progress=True, suppress_method_verbosity=True)[source]

Map identifiers using Ensembl BioMart via pybiomart.

Note: BioMart can only filter by Ensembl IDs (gene, transcript, protein). Other ID types can be used as output_db but not input_db.

Parameters:

ids (_t.Iterable[str]) – Input Ensembl identifiers to map.
input_db (str) – Source database type. Must be one of "ensembl_gene", "ensembl_transcript", or "ensembl_protein".
output_db (str) – Target database type (e.g. "hgnc_symbol", "uniprot", "entrez_gene").
species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").
chunk_size (int) – Number of IDs per BioMart query.
pause (float) – Pause in seconds between queries.
strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.
release (str | int | None) – Ensembl release number (e.g. 104) or special key (e.g. "grch37"). If None, uses the current Ensembl release.
show_progress (bool) – Display progress bar.
suppress_method_verbosity (bool) – Suppress stdout/stderr from pybiomart.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

RuntimeError – If the BioMart connection fails or required dataset metadata cannot be retrieved.
ValueError – If input_db is not an Ensembl type.

MyGene.info backend for ID mapping.

This module provides the map_with_mygene() function for querying the MyGene.info API to convert biological identifiers.

_mg_extract(rec, target)[source]

Extract target identifiers from a MyGene.info record.

Parameters:

rec (dict[str, Any]) – One record from the MyGene.info querymany response.
target (str) – Canonical target database name (e.g. "hgnc_symbol", "uniprot").

Returns:

Extracted target identifiers (may be empty).

Return type:

list[str]

map_with_mygene(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, show_progress=True, suppress_method_verbosity=True)[source]

Map identifiers using the MyGene.info API.

Parameters:

ids (Iterable[str]) – Input identifiers to map.
input_db (str) – Source database type (e.g. "ensembl_gene", "hgnc_symbol", "entrez_gene").
output_db (str) – Target database type (e.g. "uniprot", "hgnc_symbol", "entrez_gene").
species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").
chunk_size (int) – Number of IDs per API request.
pause (float) – Pause in seconds between API requests.
max_retries (int) – Maximum retry attempts per chunk on failure.
strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.
show_progress (bool) – Display progress bar.
suppress_method_verbosity (bool) – Suppress stdout/stderr from the mygene library.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If input_db is not supported by MyGene.info.

_build_metadata_column(df, extra_cols)[source]

Build a metadata_json column from extra columns.

Parameters:

df (DataFrame) – Input DataFrame containing columns referenced by extra_cols.
extra_cols (list[str]) – Column names to include in the JSON metadata per row.

Returns:

JSON-encoded metadata aligned with df.index.

Return type:

pd.Series

_extract_namespace_tokens(raw)[source]

Normalize the g:Profiler namespaces field into uppercase tokens.

Parameters:: raw (Any)
Return type:: set[str]

_gp_target_candidates(outp)[source]

Return g:Profiler target_namespace candidates for a canonical output database.

The returned list is ordered by preference (first hit wins).

Parameters:: outp (str) – Canonical output database name.
Returns:: Target namespace candidates in preference order.
Return type:: list[str]

_process_gprofiler_response(df, namespace_filter)[source]

Process a g:Profiler convert() response into standardized format.

Parameters:

df (pd.DataFrame | None) – Raw g:Profiler response DataFrame (or None).
namespace_filter (_t.Callable[[_t.Any], bool] | None) – Optional predicate to filter the namespaces column when enforcing strict input-db behavior.

Returns:

(processed_df, has_non_null_outputs).

Return type:

tuple[pd.DataFrame, bool]

Raises:

RuntimeError – If namespace_filter is provided but the response lacks a namespaces column.

map_with_gprofiler(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, show_progress=True, suppress_method_verbosity=True, strict_input_db=False)[source]

Map IDs via g:Profiler (gprofiler-official).

Parameters:

ids (Iterable[str]) – Input identifiers to map.
input_db (str) – Database/namespace of input IDs (e.g. "ensembl_gene", "hgnc_symbol").
output_db (str) – Target database/namespace (e.g. "uniprot", "entrez_gene").
species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").
chunk_size (int) – Number of IDs per API request.
pause (float) – Seconds to pause between requests.
max_retries (int) – Maximum retry attempts per chunk on failure.
strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.
show_progress (bool) – Display progress bar.
suppress_method_verbosity (bool) – Suppress stdout/stderr from the gprofiler library.
strict_input_db (bool) – If True, filter results to only include mappings where the input namespace matches the expected input_db.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If strict_input_db is enabled for an unsupported input database.

gget backend for ID mapping.

This module provides the map_with_gget() function for querying the gget.info API (Ensembl REST-backed) to convert biological identifiers.

_gget_extract(df, outp)[source]

Normalize gget.info() output into standardized (input_id, output_id) format.

Parameters:

df (DataFrame) – Raw DataFrame from gget.info().
outp (str) – Target database type.

Returns:

DataFrame with input_id and output_id columns.

Return type:

DataFrame

map_with_gget(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, show_progress=True, suppress_method_verbosity=True)[source]

Map identifiers using gget.info (Ensembl REST API-backed).

Note: gget is Ensembl-centric, so input_db must be an Ensembl ID type. For non-Ensembl inputs, use ‘mygene’ or ‘gprofiler’ methods.

Parameters:

ids (Iterable[str]) – Input Ensembl identifiers to map.
input_db (str) – Source database type. Must be one of "ensembl_gene", "ensembl_transcript", or "ensembl_protein".
output_db (str) – Target database type (e.g. "hgnc_symbol", "uniprot", "entrez_gene").
species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").
chunk_size (int) – Number of IDs per API request.
pause (float) – Pause in seconds between requests.
max_retries (int) – Maximum retry attempts per chunk on failure.
strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.
show_progress (bool) – Display progress bar.
suppress_method_verbosity (bool) – Suppress stdout/stderr from gget.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If input_db is not an Ensembl type.

Shared utilities and constants

Utility functions for the _external_mappers module.

This module provides: - Database and species name canonicalization - ID version stripping for Ensembl and RefSeq identifiers - DataFrame utilities for standardizing output format - Helper functions for chunking, JSON serialization, etc.

_add_mapping_column(df)[source]

Add or recompute the mapping column on a standardized mapping DataFrame.

The mapping value is the per-input cardinality: 1:0 (no outputs), 1:1 (one unique output), 1:n (multiple outputs).

Parameters:: df (DataFrame) – Standardized mapping DataFrame.
Returns:: DataFrame with a (re)computed mapping column.
Return type:: pd.DataFrame

_as_list(v)[source]

Coerce a value to a list.

None returns []
list/tuple/set returns list(v)
scalar values return [v]

Parameters:: v (Any) – Value to coerce.
Returns:: v represented as a list.
Return type:: list[Any]

_chunker(items, size)[source]

Yield successive chunks of a list.

Parameters:

items (list[Any])
size (int)

Return type:

Iterator[list[Any]]

_empty_result()[source]

Return an empty standardized mapping result DataFrame.

Return type:: DataFrame

_ensure_all_inputs(df, original_inputs, inp, outp, method, release_used)[source]

Ensure each input appears at least once in the output.

Missing inputs are appended with output_id=None. Input order is preserved and the mapping column is (re)computed.

Parameters:

df (pd.DataFrame) – Partially populated standardized mapping DataFrame.
original_inputs (list[str]) – Original input identifiers (order is preserved).
inp (str) – Canonical input database key.
outp (str) – Canonical output database key.
method (str) – Backend method name.
release_used (str | None) – Backend-provided release/host label (if any).

Returns:

Standardized mapping DataFrame containing at least one row per input.

Return type:

pd.DataFrame

_is_bare_numeric(s)[source]

Return True if a string consists entirely of digits.

Parameters:: s (str)
Return type:: bool

_json(obj)[source]

Serialize an object to a compact JSON string (unicode preserved).

Parameters:: obj (Any)
Return type:: str

_species_for_mygene(species)[source]

Return the MyGene-compatible common name for a canonical species code.

Parameters:: species (str | None)
Return type:: str

_suppress_stdout_stderr(enabled)[source]

Context manager to squelch noisy stdout/stderr emissions.

Parameters:: enabled (bool)

_unique_not_null(seq)[source]

Return unique non-null string values from a sequence, preserving order.

Filters out: - None values - empty / whitespace-only strings - stringified null values ("nan", "none", "null"; case-insensitive)

Parameters:: seq (Iterable[Any]) – Sequence of values to normalize and filter.
Returns:: Unique non-null string values in first-seen order.
Return type:: list[str]

canonical_db(db)[source]

Return canonical DB key given a user-friendly/alias string.

Parameters:: db (str)
Return type:: str

canonical_species(species)[source]

Return canonical organism code (g:Profiler / Ensembl style).

Supported out-of-the-box: human → hsapiens, mouse → mmusculus, pig → sscrofa.

Parameters:: species (str | None) – Species code/alias. If None or empty, defaults to "hsapiens".
Returns:: Canonical organism code.
Return type:: str

check_optional_dependencies(warn=True)[source]

Check which optional dependencies are installed.

Parameters:: warn (bool) – When True, emit a warning summarizing missing packages.
Returns:: Mapping from dependency key to availability.
Return type:: dict[str, bool]

raise_missing_dependency(dep_key, feature=None, original_error=None)[source]

Raise a detailed error for a missing optional dependency.

Parameters:

dep_key (str) – Key in OPTIONAL_DEPENDENCIES (e.g. "gget", "mygene").
feature (str | None) – Description of the feature that requires the dependency.
original_error (BaseException | None) – Optional original ImportError to chain.

Raises:

RuntimeError – Always, with detailed installation instructions.

Return type:

NoReturn

strip_version(ididid)[source]

Strip version suffixes from Ensembl and RefSeq identifiers.

Parameters:: ididid (str) – Identifier to strip.
Returns:: Identifier without a version suffix (or unchanged if none is present).
Return type:: str

Constants and configuration for the _external_mappers module.

This module provides: - Database alias mappings to canonical database names - Species aliases and mappings for various backends - Backend-specific configuration for MyGene, pybiomart, g:Profiler, and gget - Ensembl archive host mappings by release number

Ontology

This section defines the vocabulary used throughout IDTrack’s tutorials and API reference. The goal is to make results interpretable without requiring graph theory or database background.

Ensembl release: A numbered snapshot of Ensembl reference data (e.g., release 114). In IDTrack, releases define the time axis.
Snapshot release (snapshot boundary): The release you choose as the upper time boundary for a graph snapshot. It makes conversions reproducible: the same inputs and the same snapshot boundary should yield the same outputs.
Backbone namespace: The Ensembl “core” identifier spaces that carry historical relationships across releases (e.g., Ensembl gene IDs). Backbone edges enable time-travel mapping across releases.
External namespace: A non-Ensembl identifier system (HGNC, EntrezGene, UniProtKB, RefSeq, MGI, …). External edges connect backbone nodes to external identifiers when enabled by configuration.
External YAML: A user-editable configuration file that declares which external namespaces are allowed to participate in mapping for a given organism/snapshot. It is an explicit contract that improves reproducibility and reduces accidental ambiguity.
Assembly: The genome build context for an organism (e.g., human GRCh38 vs GRCh37). Assemblies can affect which releases are reachable and which identifiers are valid.
Graph snapshot: A precomputed mapping graph built for a specific organism and snapshot release (and often multiple assemblies). It is stored on disk under your local repository so it can be reused across sessions.
Identifier drift: The fact that identifiers change across releases (retirements, merges, splits, version changes) and differ across databases. Drift is the core reason conversions must be time-aware.
1→0 / 1→1 / 1→n outcomes: The three conversion outcome families: no mapping, unique mapping, or ambiguous mapping with multiple valid targets. IDTrack reports these explicitly rather than silently forcing a single answer.
Strategy (best vs all): A conversion policy. strategy='best' returns a single preferred target; strategy='all' returns all plausible targets so you can handle ambiguity explicitly.
Explainability payload: Optional structured information returned alongside a conversion result that helps you audit why a mapping happened (paths, intermediate nodes, decisions).
Hyperconnected node: An identifier that connects to many other identifiers (common in some external namespaces). These can explode the search space; IDTrack uses safeguards so “promiscuous” nodes do not dominate results.
Local repository (cache directory): A writable directory where IDTrack stores downloads, derived tables, and graph snapshots. You can set it via the IDTRACK_LOCAL_REPO environment variable.