External Mappers

External mappers are optional helpers for cross-database mapping and orthology-related utilities. They are not the core “time-aware graph snapshot” engine (that lives in the backbone graph), but they can be useful for:

  • fetching or aligning ortholog tables for cross-species workflows

  • using external services to map between common identifier types when you need a convenience layer

Note

Some external mapper backends require optional dependencies (or network access). The Part 6 tutorial shows how to check availability before you rely on them.

Package API

External ID mapping backends for idtrack.

This module provides interfaces to external ID mapping services: - g:Profiler (gprofiler-official) - MyGene.info (mygene) - Ensembl BioMart (pybiomart) - gget (Ensembl REST API)

Additionally, ortholog utilities are available (require gget + biopython).

Note: This module requires optional dependencies that are not installed with the core idtrack package. Install them with:

pip install gget mygene pybiomart gprofiler-official biopython

Or install only the backends you need.

check_optional_dependencies(warn=True)[source]

Check which optional dependencies are installed.

Parameters:

warn (bool) – When True, emit a warning summarizing missing packages.

Returns:

Mapping from dependency key to availability.

Return type:

dict[str, bool]

convert_ids(ids, input_db, output_db, method, species, drop_metadata_json_column=True, chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, release_for_pybiomart=None, strict_input_db_gprofiler=True, suppress_method_verbosity=True, verbose=2)[source]

Convert identifiers using an external mapper backend.

Parameters:
  • ids (Iterable[str]) – Input identifiers to map.

  • input_db (str) – Source database type.

  • output_db (str) – Target database type.

  • method (str) – Backend method name (one of SUPPORTED_METHODS).

  • species (str) – Species code (e.g. "hsapiens").

  • drop_metadata_json_column (bool) – If True, drop the metadata_json column from the returned DataFrame.

  • chunk_size (int) – Number of IDs per API request.

  • pause (float) – Pause in seconds between requests.

  • max_retries (int) – Maximum retry attempts per chunk on failure (for backends that support it).

  • strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.

  • release_for_pybiomart (str | int | None) – Ensembl release/key for the pybiomart backend. Must be None unless method="pybiomart".

  • strict_input_db_gprofiler (bool) – If True, enforce strict input-db filtering in the gprofiler backend.

  • suppress_method_verbosity (bool) – Suppress stdout/stderr from the underlying backend library.

  • verbose (int | str | bool) – Verbosity level (1/2/3) or string alias ("error", "warning", "info", "debug").

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If method/verbose is invalid, or if release_for_pybiomart is used with a non-pybiomart backend.

Conversion helpers

convert_ids(ids, input_db, output_db, method, species, drop_metadata_json_column=True, chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, release_for_pybiomart=None, strict_input_db_gprofiler=True, suppress_method_verbosity=True, verbose=2)[source]

Convert identifiers using an external mapper backend.

Parameters:
  • ids (Iterable[str]) – Input identifiers to map.

  • input_db (str) – Source database type.

  • output_db (str) – Target database type.

  • method (str) – Backend method name (one of SUPPORTED_METHODS).

  • species (str) – Species code (e.g. "hsapiens").

  • drop_metadata_json_column (bool) – If True, drop the metadata_json column from the returned DataFrame.

  • chunk_size (int) – Number of IDs per API request.

  • pause (float) – Pause in seconds between requests.

  • max_retries (int) – Maximum retry attempts per chunk on failure (for backends that support it).

  • strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.

  • release_for_pybiomart (str | int | None) – Ensembl release/key for the pybiomart backend. Must be None unless method="pybiomart".

  • strict_input_db_gprofiler (bool) – If True, enforce strict input-db filtering in the gprofiler backend.

  • suppress_method_verbosity (bool) – Suppress stdout/stderr from the underlying backend library.

  • verbose (int | str | bool) – Verbosity level (1/2/3) or string alias ("error", "warning", "info", "debug").

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If method/verbose is invalid, or if release_for_pybiomart is used with a non-pybiomart backend.

Orthology helpers

class AlignmentScores(alignment_length, identity_fraction, positive_fraction, very_negative_fraction, gap_fraction_query, gap_fraction_target, gap_openings_query, gap_openings_target, seq1_coverage, seq2_coverage, blosum62_sum, blosum62_mean, composition_l2_distance)[source]

Bases: object

Alignment-derived scalar metrics for a pairwise protein alignment.

Parameters:
  • alignment_length (int)

  • identity_fraction (float)

  • positive_fraction (float)

  • very_negative_fraction (float)

  • gap_fraction_query (float)

  • gap_fraction_target (float)

  • gap_openings_query (int)

  • gap_openings_target (int)

  • seq1_coverage (float)

  • seq2_coverage (float)

  • blosum62_sum (float)

  • blosum62_mean (float)

  • composition_l2_distance (float)

class EmbeddingFeatures(model_name, dim, cosine_similarity, euclidean_distance, diff_embedding)[source]

Bases: object

Embedding-derived similarity features for two protein sequences.

Parameters:
  • model_name (str)

  • dim (int)

  • cosine_similarity (float)

  • euclidean_distance (float)

  • diff_embedding (ndarray)

_canonical_from_alias(name)[source]

Return canonical short-code for a species alias.

Maps strings like "human", "Homo sapiens", "sus_scrofa" to canonical short codes (hsapiens, mmusculus, sscrofa). Unknown inputs are returned unchanged (lowercased/stripped).

Parameters:

name (str) – Species name or alias.

Returns:

Canonical short code (or a normalized fallback if unknown).

Return type:

str

_get_blosum62()[source]

Lazily load BLOSUM62 matrix.

_require_ortholog_deps()[source]

Lazily import gget and biopython; raise helpful error if missing.

_species_to_genus_species(name)[source]

Convert a species string to (canonical_code, genus, species) for Bgee.

Inputs may be any key in _SPECIES_ALIASES or any canonical key in _SPECIES_CANONICAL_TO_BGEENAMES. Unknown canonical codes raise a helpful error.

Parameters:

name (str) – Species name or alias.

Returns:

(canonical_code, genus, species) for Bgee.

Return type:

tuple[str, str, str]

Raises:

ValueError – If the species cannot be resolved to a supported canonical code.

align_ortholog_pair_with_features(query_ensembl_id, target_species, *, use_super5=False, embedding_model_name='facebook/esm2_t12_35M_UR50D', embedding_device='cpu', embedding_revision='main', verbose=True)[source]

Compute ortholog alignment features for a query gene and target species.

Parameters:
  • query_ensembl_id (str) – Ensembl gene ID in the query organism.

  • target_species (str) – Target species alias, interpreted via _SPECIES_ALIASES.

  • use_super5 (bool) – Passed through to gget.muscle(super5=...).

  • embedding_model_name (str | None) – Hugging Face model name for embeddings. Set to None to disable embeddings.

  • embedding_device (str) – Device for the transformer model (e.g. "cpu", "cuda").

  • embedding_revision (str) – Model revision for from_pretrained (pin to a commit hash for reproducibility).

  • verbose (bool) – Print progress/warnings.

Returns:

Mapping from target ortholog Ensembl IDs to feature dictionaries.

Raises:

ValueError – If no orthologs are available for the query/target species pair.

Return type:

dict[str, dict[str, Any]]

compute_alignment_scores(seq1, seq2, aligned1, aligned2)[source]

Compute alignment-derived scores and the AA composition-difference vector.

Parameters:
  • seq1 (str)

  • seq2 (str)

  • aligned1 (str)

  • aligned2 (str)

Return type:

tuple[AlignmentScores, ndarray]

compute_embedding_features(seq1, seq2, *, model_name, device='cpu', revision='main')[source]

Compute embedding-based similarity features for two sequences.

Parameters:
  • seq1 (str) – First protein sequence.

  • seq2 (str) – Second protein sequence.

  • model_name (str) – Hugging Face model name for embeddings.

  • device (str) – Device for the transformer model (e.g. "cpu", "cuda").

  • revision (str) – Model revision for from_pretrained.

Returns:

Scalar similarities plus the difference vector.

Return type:

EmbeddingFeatures

fetch_aa_sequence(ensembl_id)[source]

Fetch amino-acid sequence for a gene using gget.seq(translate=True).

Parameters:

ensembl_id (str)

Return type:

str

get_ortholog_ids_for_species(ortholog_df, target_species)[source]

Return all ortholog Ensembl IDs in ortholog_df for a target species.

Parameters:
  • ortholog_df (DataFrame)

  • target_species (str)

Return type:

list[str]

get_ortholog_table(query_ensembl_id, *, verbose=True)[source]

Return Bgee ortholog table for a query Ensembl gene ID via gget.bgee.

Parameters:
  • query_ensembl_id (str)

  • verbose (bool)

Return type:

DataFrame

get_protein_embedding(sequence, *, model_name, device='cpu', revision='main', max_len=1022)[source]

Return a pooled protein embedding for a sequence via a transformer model.

The embedding is obtained by mean pooling the per-residue representations.

Parameters:
  • sequence (str) – Protein sequence (amino-acid string).

  • model_name (str) – Hugging Face model name for the embedding model.

  • device (str) – Device for the transformer model (e.g. "cpu", "cuda").

  • revision (str) – Model revision for from_pretrained.

  • max_len (int) – Maximum tokenized sequence length (including special tokens).

Returns:

1D embedding vector.

Return type:

np.ndarray

parse_clustal_alignment(clustal_text)[source]

Parse a minimal ClustalW/MUSCLE text alignment.

Parameters:

clustal_text (str) – Raw ClustalW/MUSCLE text output.

Returns:

Mapping of {sequence_name: aligned_sequence}.

Return type:

dict[str, str]

pick_ortholog_for_species(ortholog_df, target_species)[source]

Backward-compatible helper that returns only the first ortholog ID.

Parameters:
  • ortholog_df (pd.DataFrame)

  • target_species (str)

Return type:

str | None

run_muscle_pairwise(seq1, seq2, *, name1, name2, super5=False)[source]

Align two AA sequences via gget.muscle.

Parameters:
  • seq1 (str) – Amino-acid sequence for the first protein.

  • seq2 (str) – Amino-acid sequence for the second protein.

  • name1 (str) – Sequence name for the first protein (used in FASTA headers).

  • name2 (str) – Sequence name for the second protein (used in FASTA headers).

  • super5 (bool) – Whether to enable MUSCLE super5 mode.

Returns:

(aligned_seq1, aligned_seq2, raw_clustal_text).

Return type:

tuple[str, str, str]

Raises:

RuntimeError – If MUSCLE output cannot be parsed or does not contain the expected sequence names.

Backend implementations

Ensembl BioMart backend for ID mapping.

This module provides the map_with_pybiomart() function for querying Ensembl BioMart to convert biological identifiers. Supports historical Ensembl releases via archive hosts.

_biomart_dataset_for_species(species, explicit=None)[source]

Return the Ensembl BioMart dataset name for the given species.

Parameters:
  • species (str)

  • explicit (str | None)

Return type:

str

_bm_list_attribute_names(ds)[source]

Return a list of attribute names for a pybiomart Dataset.

Return type:

list[str]

_bm_list_filter_names(ds)[source]

Return a list of filter names for a pybiomart Dataset.

Return type:

list[str]

_bm_pick_attribute(canonical_db_name, available_attrs)[source]

Choose a BioMart attribute name for a canonical DB.

The helper first tries explicit candidates from _BM_ATTR_CANDIDATES and falls back to fuzzy matching on common substrings.

Parameters:
  • canonical_db_name (str) – Canonical database key (see canonical_db()).

  • available_attrs (list[str]) – Attribute names provided by the BioMart dataset.

Returns:

Selected attribute name.

Return type:

str

Raises:

RuntimeError – If no compatible attribute is available on the dataset.

_bm_pick_filter(canonical_db_name, attr_name, available_filters)[source]

Choose a BioMart filter name.

The selection depends on the canonical database and chosen attribute.

Parameters:
  • canonical_db_name (str) – Canonical database key for the input IDs.

  • attr_name (str) – Attribute name chosen for the input IDs.

  • available_filters (list[str]) – Filter names provided by the BioMart dataset.

Returns:

Selected filter name.

Return type:

str

Raises:

RuntimeError – If no compatible filter is available on the dataset.

_ensembl_archive_host_for_release(release)[source]

Resolve an Ensembl release or key to an archive host.

Examples include an integer release (e.g. 104) or a special string key (e.g. "GRCh37") mapping to hosts like "may2021.archive.ensembl.org".

Parameters:

release (int | str | None) – Ensembl release number or key (e.g. 104, "v104", "GRCh37"), or None.

Returns:

Archive host for the requested release, or None if unknown.

Return type:

str | None

_normalize_biomart_host(host)[source]

Normalize an Ensembl BioMart host for pybiomart.

Examples of valid outputs:

http://www.ensembl.org” “http://nov2020.archive.ensembl.org” “http://grch37.ensembl.org

Parameters:

host (str | None) – Hostname or URL (scheme optional). If None, defaults to "http://www.ensembl.org".

Returns:

Normalized base URL suitable for pybiomart.

Return type:

str

map_with_pybiomart(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, strip_versions=True, release=None, show_progress=True, suppress_method_verbosity=True)[source]

Map identifiers using Ensembl BioMart via pybiomart.

Note: BioMart can only filter by Ensembl IDs (gene, transcript, protein). Other ID types can be used as output_db but not input_db.

Parameters:
  • ids (_t.Iterable[str]) – Input Ensembl identifiers to map.

  • input_db (str) – Source database type. Must be one of "ensembl_gene", "ensembl_transcript", or "ensembl_protein".

  • output_db (str) – Target database type (e.g. "hgnc_symbol", "uniprot", "entrez_gene").

  • species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").

  • chunk_size (int) – Number of IDs per BioMart query.

  • pause (float) – Pause in seconds between queries.

  • strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.

  • release (str | int | None) – Ensembl release number (e.g. 104) or special key (e.g. "grch37"). If None, uses the current Ensembl release.

  • show_progress (bool) – Display progress bar.

  • suppress_method_verbosity (bool) – Suppress stdout/stderr from pybiomart.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:
  • RuntimeError – If the BioMart connection fails or required dataset metadata cannot be retrieved.

  • ValueError – If input_db is not an Ensembl type.

MyGene.info backend for ID mapping.

This module provides the map_with_mygene() function for querying the MyGene.info API to convert biological identifiers.

_mg_extract(rec, target)[source]

Extract target identifiers from a MyGene.info record.

Parameters:
  • rec (dict[str, Any]) – One record from the MyGene.info querymany response.

  • target (str) – Canonical target database name (e.g. "hgnc_symbol", "uniprot").

Returns:

Extracted target identifiers (may be empty).

Return type:

list[str]

map_with_mygene(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, show_progress=True, suppress_method_verbosity=True)[source]

Map identifiers using the MyGene.info API.

Parameters:
  • ids (Iterable[str]) – Input identifiers to map.

  • input_db (str) – Source database type (e.g. "ensembl_gene", "hgnc_symbol", "entrez_gene").

  • output_db (str) – Target database type (e.g. "uniprot", "hgnc_symbol", "entrez_gene").

  • species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").

  • chunk_size (int) – Number of IDs per API request.

  • pause (float) – Pause in seconds between API requests.

  • max_retries (int) – Maximum retry attempts per chunk on failure.

  • strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.

  • show_progress (bool) – Display progress bar.

  • suppress_method_verbosity (bool) – Suppress stdout/stderr from the mygene library.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If input_db is not supported by MyGene.info.

_build_metadata_column(df, extra_cols)[source]

Build a metadata_json column from extra columns.

Parameters:
  • df (DataFrame) – Input DataFrame containing columns referenced by extra_cols.

  • extra_cols (list[str]) – Column names to include in the JSON metadata per row.

Returns:

JSON-encoded metadata aligned with df.index.

Return type:

pd.Series

_extract_namespace_tokens(raw)[source]

Normalize the g:Profiler namespaces field into uppercase tokens.

Parameters:

raw (Any)

Return type:

set[str]

_gp_target_candidates(outp)[source]

Return g:Profiler target_namespace candidates for a canonical output database.

The returned list is ordered by preference (first hit wins).

Parameters:

outp (str) – Canonical output database name.

Returns:

Target namespace candidates in preference order.

Return type:

list[str]

_process_gprofiler_response(df, namespace_filter)[source]

Process a g:Profiler convert() response into standardized format.

Parameters:
  • df (pd.DataFrame | None) – Raw g:Profiler response DataFrame (or None).

  • namespace_filter (_t.Callable[[_t.Any], bool] | None) – Optional predicate to filter the namespaces column when enforcing strict input-db behavior.

Returns:

(processed_df, has_non_null_outputs).

Return type:

tuple[pd.DataFrame, bool]

Raises:

RuntimeError – If namespace_filter is provided but the response lacks a namespaces column.

map_with_gprofiler(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, show_progress=True, suppress_method_verbosity=True, strict_input_db=False)[source]

Map IDs via g:Profiler (gprofiler-official).

Parameters:
  • ids (Iterable[str]) – Input identifiers to map.

  • input_db (str) – Database/namespace of input IDs (e.g. "ensembl_gene", "hgnc_symbol").

  • output_db (str) – Target database/namespace (e.g. "uniprot", "entrez_gene").

  • species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").

  • chunk_size (int) – Number of IDs per API request.

  • pause (float) – Seconds to pause between requests.

  • max_retries (int) – Maximum retry attempts per chunk on failure.

  • strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.

  • show_progress (bool) – Display progress bar.

  • suppress_method_verbosity (bool) – Suppress stdout/stderr from the gprofiler library.

  • strict_input_db (bool) – If True, filter results to only include mappings where the input namespace matches the expected input_db.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If strict_input_db is enabled for an unsupported input database.

gget backend for ID mapping.

This module provides the map_with_gget() function for querying the gget.info API (Ensembl REST-backed) to convert biological identifiers.

_gget_extract(df, outp)[source]

Normalize gget.info() output into standardized (input_id, output_id) format.

Parameters:
  • df (DataFrame) – Raw DataFrame from gget.info().

  • outp (str) – Target database type.

Returns:

DataFrame with input_id and output_id columns.

Return type:

DataFrame

map_with_gget(ids, input_db, output_db, *, species='hsapiens', chunk_size=1000, pause=0.2, max_retries=3, strip_versions=True, show_progress=True, suppress_method_verbosity=True)[source]

Map identifiers using gget.info (Ensembl REST API-backed).

Note: gget is Ensembl-centric, so input_db must be an Ensembl ID type. For non-Ensembl inputs, use ‘mygene’ or ‘gprofiler’ methods.

Parameters:
  • ids (Iterable[str]) – Input Ensembl identifiers to map.

  • input_db (str) – Source database type. Must be one of "ensembl_gene", "ensembl_transcript", or "ensembl_protein".

  • output_db (str) – Target database type (e.g. "hgnc_symbol", "uniprot", "entrez_gene").

  • species (str) – Species code (e.g. "hsapiens", "mmusculus", "sscrofa").

  • chunk_size (int) – Number of IDs per API request.

  • pause (float) – Pause in seconds between requests.

  • max_retries (int) – Maximum retry attempts per chunk on failure.

  • strip_versions (bool) – Strip version suffixes from Ensembl/RefSeq IDs.

  • show_progress (bool) – Display progress bar.

  • suppress_method_verbosity (bool) – Suppress stdout/stderr from gget.

Returns:

Standardized mapping DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If input_db is not an Ensembl type.

Shared utilities and constants

Utility functions for the _external_mappers module.

This module provides: - Database and species name canonicalization - ID version stripping for Ensembl and RefSeq identifiers - DataFrame utilities for standardizing output format - Helper functions for chunking, JSON serialization, etc.

_add_mapping_column(df)[source]

Add or recompute the mapping column on a standardized mapping DataFrame.

The mapping value is the per-input cardinality: 1:0 (no outputs), 1:1 (one unique output), 1:n (multiple outputs).

Parameters:

df (DataFrame) – Standardized mapping DataFrame.

Returns:

DataFrame with a (re)computed mapping column.

Return type:

pd.DataFrame

_as_list(v)[source]

Coerce a value to a list.

  • None returns []

  • list/tuple/set returns list(v)

  • scalar values return [v]

Parameters:

v (Any) – Value to coerce.

Returns:

v represented as a list.

Return type:

list[Any]

_chunker(items, size)[source]

Yield successive chunks of a list.

Parameters:
  • items (list[Any])

  • size (int)

Return type:

Iterator[list[Any]]

_empty_result()[source]

Return an empty standardized mapping result DataFrame.

Return type:

DataFrame

_ensure_all_inputs(df, original_inputs, inp, outp, method, release_used)[source]

Ensure each input appears at least once in the output.

Missing inputs are appended with output_id=None. Input order is preserved and the mapping column is (re)computed.

Parameters:
  • df (pd.DataFrame) – Partially populated standardized mapping DataFrame.

  • original_inputs (list[str]) – Original input identifiers (order is preserved).

  • inp (str) – Canonical input database key.

  • outp (str) – Canonical output database key.

  • method (str) – Backend method name.

  • release_used (str | None) – Backend-provided release/host label (if any).

Returns:

Standardized mapping DataFrame containing at least one row per input.

Return type:

pd.DataFrame

_is_bare_numeric(s)[source]

Return True if a string consists entirely of digits.

Parameters:

s (str)

Return type:

bool

_json(obj)[source]

Serialize an object to a compact JSON string (unicode preserved).

Parameters:

obj (Any)

Return type:

str

_species_for_mygene(species)[source]

Return the MyGene-compatible common name for a canonical species code.

Parameters:

species (str | None)

Return type:

str

_suppress_stdout_stderr(enabled)[source]

Context manager to squelch noisy stdout/stderr emissions.

Parameters:

enabled (bool)

_unique_not_null(seq)[source]

Return unique non-null string values from a sequence, preserving order.

Filters out: - None values - empty / whitespace-only strings - stringified null values ("nan", "none", "null"; case-insensitive)

Parameters:

seq (Iterable[Any]) – Sequence of values to normalize and filter.

Returns:

Unique non-null string values in first-seen order.

Return type:

list[str]

canonical_db(db)[source]

Return canonical DB key given a user-friendly/alias string.

Parameters:

db (str)

Return type:

str

canonical_species(species)[source]

Return canonical organism code (g:Profiler / Ensembl style).

Supported out-of-the-box: human → hsapiens, mouse → mmusculus, pig → sscrofa.

Parameters:

species (str | None) – Species code/alias. If None or empty, defaults to "hsapiens".

Returns:

Canonical organism code.

Return type:

str

check_optional_dependencies(warn=True)[source]

Check which optional dependencies are installed.

Parameters:

warn (bool) – When True, emit a warning summarizing missing packages.

Returns:

Mapping from dependency key to availability.

Return type:

dict[str, bool]

raise_missing_dependency(dep_key, feature=None, original_error=None)[source]

Raise a detailed error for a missing optional dependency.

Parameters:
  • dep_key (str) – Key in OPTIONAL_DEPENDENCIES (e.g. "gget", "mygene").

  • feature (str | None) – Description of the feature that requires the dependency.

  • original_error (BaseException | None) – Optional original ImportError to chain.

Raises:

RuntimeError – Always, with detailed installation instructions.

Return type:

NoReturn

strip_version(ididid)[source]

Strip version suffixes from Ensembl and RefSeq identifiers.

Parameters:

ididid (str) – Identifier to strip.

Returns:

Identifier without a version suffix (or unchanged if none is present).

Return type:

str

Constants and configuration for the _external_mappers module.

This module provides: - Database alias mappings to canonical database names - Species aliases and mappings for various backends - Backend-specific configuration for MyGene, pybiomart, g:Profiler, and gget - Ensembl archive host mappings by release number

Ontology

This section defines the vocabulary used throughout IDTrack’s tutorials and API reference. The goal is to make results interpretable without requiring graph theory or database background.

Ensembl release

A numbered snapshot of Ensembl reference data (e.g., release 114). In IDTrack, releases define the time axis.

Snapshot release (snapshot boundary)

The release you choose as the upper time boundary for a graph snapshot. It makes conversions reproducible: the same inputs and the same snapshot boundary should yield the same outputs.

Backbone namespace

The Ensembl “core” identifier spaces that carry historical relationships across releases (e.g., Ensembl gene IDs). Backbone edges enable time-travel mapping across releases.

External namespace

A non-Ensembl identifier system (HGNC, EntrezGene, UniProtKB, RefSeq, MGI, …). External edges connect backbone nodes to external identifiers when enabled by configuration.

External YAML

A user-editable configuration file that declares which external namespaces are allowed to participate in mapping for a given organism/snapshot. It is an explicit contract that improves reproducibility and reduces accidental ambiguity.

Assembly

The genome build context for an organism (e.g., human GRCh38 vs GRCh37). Assemblies can affect which releases are reachable and which identifiers are valid.

Graph snapshot

A precomputed mapping graph built for a specific organism and snapshot release (and often multiple assemblies). It is stored on disk under your local repository so it can be reused across sessions.

Identifier drift

The fact that identifiers change across releases (retirements, merges, splits, version changes) and differ across databases. Drift is the core reason conversions must be time-aware.

1→0 / 1→1 / 1→n outcomes

The three conversion outcome families: no mapping, unique mapping, or ambiguous mapping with multiple valid targets. IDTrack reports these explicitly rather than silently forcing a single answer.

Strategy (best vs all)

A conversion policy. strategy='best' returns a single preferred target; strategy='all' returns all plausible targets so you can handle ambiguity explicitly.

Explainability payload

Optional structured information returned alongside a conversion result that helps you audit why a mapping happened (paths, intermediate nodes, decisions).

Hyperconnected node

An identifier that connects to many other identifiers (common in some external namespaces). These can explode the search space; IDTrack uses safeguards so “promiscuous” nodes do not dominate results.

Local repository (cache directory)

A writable directory where IDTrack stores downloads, derived tables, and graph snapshots. You can set it via the IDTRACK_LOCAL_REPO environment variable.