Reference

class API(local_repository)[source]

Bases: object

Provide a high-level façade for building graphs and converting biological identifiers with IDTrack.

This class centralises common workflows so users can quickly initialise the underlying graph (for a chosen organism and Ensembl release), configure logging, and run identifier-related operations. Internally it delegates to lower-level components such as idtrack.DatabaseManager for data access and idtrack.Track (or idtrack.TrackTests) for graph traversal and matching. It is intended as the primary entry point for day-to-day tasks like resolving an organism name, constructing the working graph snapshot, converting identifiers between releases or external databases, and inspecting available external data sources.

Bind the interface to a local repository used for data downloads and on-disk caches.

This initialiser wires up a dedicated logger for the API layer and records the path where IDTrack will keep its working files. The actual graph and tracking objects are created lazily (e.g. by idtrack.API.build_graph()) so that simply constructing idtrack.API is inexpensive.

Parameters:

local_repository (str) – Absolute (recommended) or relative path to a writable directory where the package may store downloaded resources and precomputed artefacts. The caller is responsible for ensuring the path exists and is accessible.

log

Logger named "api" for progress messages and diagnostics.

Type:

logging.Logger

logger_configured

False until idtrack.API.configure_logger() is called.

Type:

bool

local_repository

The given repository path.

Type:

str

track

Placeholder for the active tracker; populated after idtrack.API.build_graph() is invoked.

Type:

idtrack.Track | idtrack.TrackTests

_require_track()[source]

Return the active tracker or raise a clear error if the graph is not built yet.

Return type:

Track | TrackTests

build_graph(organism_name, snapshot_release, genome_assembly=None, return_test=False, calculate_caches=True)[source]

Build the bio-ID graph for an organism and prepare the path-finding engine.

This method wires together the high-level components used throughout IDTrack. It first creates a idtrack._database_manager.DatabaseManager that ignores releases newer than snapshot_release. It then instantiates idtrack._track.Track (or idtrack._track_tests.TrackTests when testing), which loads or builds the underlying idtrack._the_graph.TheGraph via idtrack._graph_maker.GraphMaker. Optionally, it primes all graph caches to improve query latency. The resulting resolver is stored on self.track for subsequent conversions and inspections.

Parameters:
  • organism_name (str) – Canonical Ensembl species name, typically the output of idtrack.API.resolve_organism().

  • snapshot_release (int) – Ensembl release anchoring this build. Data from later releases are ignored to ensure reproducible results.

  • genome_assembly (int | None) – Genome assembly code used in Ensembl core schema names (<organism>_core_<release>_<assembly>). This selects the primary assembly for the snapshot (default: highest-priority/newest configured for the organism). The snapshot graph can still include additional assemblies within the snapshot window; use idtrack.API.list_genome_assemblies() to inspect what is present.

  • return_test (bool) – If True, initialise idtrack._track_tests.TrackTests instead of the standard idtrack._track.Track to enable test and diagnostics helpers. Defaults to False.

  • calculate_caches (bool) – If True, eagerly compute the graph’s cached properties. When combined with return_test=True, test-only caches are included. Defaults to True.

Return type:

None

See also

idtrack.API.get_database_manager(), idtrack.API.calculate_graph_caches(), idtrack._track.Track, idtrack._track_tests.TrackTests

calculate_graph_caches(for_test=False)[source]

Prime the working graph by eagerly computing all cached properties.

This helper reduces first-call latency and makes test runs deterministic by batch-computing every @cached_property exposed by idtrack._the_graph.TheGraph. Use it after idtrack.API.build_graph() has attached a populated idtrack._track.Track (or idtrack._track_tests.TrackTests) to self.track. Internally it forwards to idtrack._the_graph.TheGraph.calculate_caches().

Parameters:

for_test (bool) – If True, also compute heavyweight, test-only caches such as idtrack._the_graph.TheGraph.external_database_connection_form and idtrack._the_graph.TheGraph.available_releases_given_database_assembly. Defaults to False.

Return type:

None

classify_multiple_conversion(matchings)[source]

Group batch-conversion results into semantic bins for downstream reporting.

This post-processing step takes the per-identifier results produced by idtrack.API.convert_identifier_multiple() (or a compatible list of idtrack.API.convert_identifier() payloads) and organises them into logically meaningful categories. The bins distinguish between “no match,” one-to-one vs. one-to-many mappings, whether the output differs from the input, and whether a reported target is an Ensembl fallback due to a missing external synonym.

Parameters:

matchings (list[dict[str, Any]]) – Collection of dictionaries returned by idtrack.API.convert_identifier_multiple(). Each element must contain, at minimum, the keys "query_id", "target_id", "no_corresponding", "no_conversion", and "no_target" as described in idtrack.API.convert_identifier().

Returns:

A dictionary of category → list-of-results. Categories are not

mutually exclusive; an item can appear in multiple bins (e.g. a changed 1→1 mapping also appears in the general 1→1 bin). Keys are:

  • "input_identifiers": All input result objects, echoed unchanged (convenient for summary counts).

  • "matching_1_to_0": Queries that could not be mapped to any target (either no_corresponding=True or no_conversion=True). Indicates a 1→0 outcome.

  • "matching_1_to_1": Queries with exactly one target in "target_id". Includes both unchanged and changed outputs, and may overlap with "changed_only_1_to_1" or "alternative_target_1_to_1".

  • "matching_1_to_n": Queries with more than one target in "target_id" (n > 1). May overlap with "changed_only_1_to_n" or "alternative_target_1_to_n".

  • "changed_only_1_to_1": Strict subset of 1→1 where the single "target_id"[0] is different from "query_id" (i.e. the identifier changed across releases/databases).

  • "changed_only_1_to_n": Strict subset of 1→n where none of the "target_id" entries equal "query_id" (the original identifier is not present among the alternatives).

  • "alternative_target_1_to_1": Cases with exactly one "target_id" and no_target=True. This flags Ensembl fallbacks where the external database lacked a synonym; the single reported value is not a genuine external match.

  • "alternative_target_1_to_n": As above, but with multiple entries in "target_id" (n > 1) while no_target=True (typically multiple Ensembl-side candidates with no external synonym).

Return type:

dict[str, list[dict[str, Any]]]

Raises:

ValueError – If any element in matchings has an empty "target_id" list despite no_corresponding and no_conversion both being False (indicates an unexpected upstream state).

See also

idtrack.API.convert_identifier_multiple(), idtrack.API.print_binned_conversion(), idtrack.API.convert_identifier().

Notes

The function does not mutate input dictionaries. The binning logic is intentionally overlapping so that “summary” buckets (matching_*) can be used alongside “diagnostic” buckets (changed_only_*, alternative_target_*) without additional passes.

configure_logger(level=None)[source]

Configure process-wide logging with a concise, time-stamped console format.

This method is idempotent per idtrack.API instance: the first call sets up a basic configuration for the Python logging system (time, level, logger name, and message). Subsequent calls on the same instance will not reconfigure logging and instead emit an informational message via idtrack.API.log.

Parameters:

level (int | str | None) – Desired logging level (e.g. logging.INFO, "INFO", logging.DEBUG). If None, defaults to logging.INFO.

Return type:

None

Notes

The configuration applies to the root logger and therefore affects logging for the entire Python process, not only this package. Call this early in your application if you want IDTrack’s log output formatted consistently with the rest of your program.

convert_identifier(identifier, from_release=None, to_release=None, final_database=None, strategy='best', explain=False)[source]

Resolve a raw identifier and convert it to a target Ensembl release and (optionally) an external database.

This high-level helper wraps idtrack._track.Track.convert() and returns a compact, user-oriented summary of the result. It first normalises identifier to the canonical graph node label with idtrack._the_graph.TheGraph.node_name_alternatives(), then invokes the path-finding and final-conversion pipeline to reach the requested to_release and final_database. The output is designed for interactive use and downstream tooling: it reports whether the query is present in the graph, whether a conversion could be computed, and (if requested) the full path(s) followed through the Ensembl backbone and the external database hop.

Parameters:
  • identifier (str) – Query identifier to resolve. May be an Ensembl stable ID, gene symbol, or a known synonym; case and common punctuation variations are tolerated by the normaliser.

  • from_release (int | None) – Ensembl release the identifier originates from. If None, the direction of time travel is inferred automatically. Supplying a value constrains the search to forward/reverse travel.

  • to_release (int | None) – Target Ensembl release. If None, the newest release available in the graph is used.

  • final_database (str | None) – Name of the external database to convert into (e.g. "uniprot"). If None, the result remains on the Ensembl gene backbone (reported as idtrack._db.DB.nts_ensembl[idtrack._db.DB.backbone_form]).

  • strategy (Literal["all", "best"]) – Selection strategy applied after scoring all admissible targets. "best" keeps a single globally best target; "all" keeps all scored targets. Defaults to "best".

  • explain (bool) – If True, include the concatenated edge list(s) that show how each result was reached.

Returns:

Dictionary describing the conversion outcome with the following keys.

  • "target_id" (list[str]): Unique identifiers in the requested final_database. When strategy="best" and a target exists, this list contains exactly one element. If final_database is None, the list contains the Ensembl gene ID(s).

  • "last_node" (list[tuple[str, str]]): Pairs of (ensembl_gene_id, target_id) for every surviving candidate. The first element is the final Ensembl node reached by time travel; the second is the chosen target in final_database (or the Ensembl gene itself when staying on the backbone).

  • "final_database" (str | None): The database name the target_id values come from. None only when the query was not found at all; otherwise this is either final_database or the Ensembl backbone label idtrack._db.DB.nts_ensembl[idtrack._db.DB.backbone_form].

  • "graph_id" (str | None): Canonical node label used internally by the graph for identifier (e.g. "ACTB" for the symbol "actb"). None when the query has no corresponding graph node.

  • "query_id" (str): Echo of the original identifier argument for bookkeeping.

  • "no_corresponding" (bool): True if the query could not be matched to any graph node (nothing to convert). In this case "graph_id" is None and the other fields are empty or None.

  • "no_conversion" (bool): True if the query exists in the graph but no admissible path to to_release and/or final_database could be constructed (a 1→0 mapping).

  • "no_target" (bool): True if an Ensembl gene was reached but the requested final_database yielded no synonym. The result may fall back to returning the Ensembl gene itself; this flag lets callers distinguish that fallback from a genuine external match.

  • "the_path" (dict[tuple[str, str], tuple[tuple]]): Present only when explain is True. Maps each (target_id, ensembl_gene_id) pair to an ordered tuple of edges representing the full walk: first the Ensembl history segment that reaches the gene, then the final-conversion hop into the external database. Each edge is expressed in the internal format used by idtrack._track.Track and may include auxiliary fields (e.g. release markers).

Return type:

dict[str, Any]

Raises:

ValueError – If strategy is not "all" or "best".

Notes

  • Interactions between the boolean flags:

    • no_corresponding=True ⇒ no conversion is attempted; graph_id is None; target_id is [].

    • no_conversion=True ⇒ query exists but path scoring/selecting produced no admissible target.

    • no_target=True ⇒ Ensembl history succeeded but the external database lacked a synonym; callers may still receive an Ensembl fallback target.

  • When strategy="best", the scoring and tie-breakers are those implemented by idtrack._track.Track.calculate_score_and_select() and its callers. When "all", no global tie-break is applied and all scored targets are returned.

convert_identifier_multiple(identifier_list, verbose=True, pbar_prefix='', **kwargs)[source]

Convert a batch of identifiers and aggregate per-query conversion metadata.

This is a thin, progress-enabled wrapper around idtrack.API.convert_identifier(). It iterates over identifier_list in order, forwards **kwargs to the single-item converter, and collects each per-identifier result. Use this helper for bulk operations where you want progress feedback and a uniform result structure that mirrors the single-call API.

Parameters:
  • identifier_list (list[str]) – Input identifiers to resolve and convert. Each element is passed to idtrack.API.convert_identifier() as its identifier argument, in the same order.

  • verbose (bool) – If True, display a tqdm progress bar (throttled to avoid excessive redraws). Set to False to disable the progress bar. Defaults to True.

  • pbar_prefix (str) – Optional label shown before the progress bar text (for distinguishing concurrent runs). Defaults to an empty string.

  • kwargs

    Keyword arguments forwarded verbatim to idtrack.API.convert_identifier(). Common options include:

    • from_release (int | None): Origin Ensembl release of the input identifier.

    • to_release (int | None): Target Ensembl release to which to time-travel.

    • final_database (str | None): Name of the external database to project into (e.g. "uniprot"). If None, results stay on the Ensembl backbone and are reported as idtrack._db.DB.nts_ensembl[idtrack._db.DB.backbone_form].

    • strategy (Literal["best", "all"]): Selection policy after scoring candidates.

    • explain (bool): If True, include full path details in the result (see "the_path" below).

Returns:

One element per input identifier, preserving input order. Each dictionary matches the schema returned by idtrack.API.convert_identifier().

Return type:

list[dict[str, Any]]

See also

idtrack.API.convert_identifier(), idtrack.API.classify_multiple_conversion(), idtrack.API.print_binned_conversion().

Notes

The output list preserves the order of identifier_list. Items are independent; failures for one query do not prevent processing of the others.

external_database_forms()[source]

Return the Ensembl form each external database connects through.

Provides a compact view of how third-party databases attach to the Ensembl backbone ("gene", "transcript", or "translation") via idtrack._the_graph.TheGraph.external_database_connection_form().

Returns:

Mapping of external database name → Ensembl form (e.g., "gene").

Return type:

dict[str, str]

get_database_manager(organism_name, snapshot_release, genome_assembly=None, ignore_before=None)[source]

Create a database manager configured for an organism and a release-bounded snapshot.

Construct and return idtrack._database_manager.DatabaseManager bound to organism_name and configured to ignore data newer than snapshot_release. The manager centralizes all download, caching, and version logic for graph builds and identifier conversions. The biological form is initialised from idtrack._db.DB.backbone_form, and all artefacts are stored under idtrack.API.local_repository.

Parameters:
  • organism_name (str) – Canonical Ensembl species name (e.g. "homo_sapiens").

  • snapshot_release (int) – Most recent Ensembl release to include; later releases are ignored for reproducibility.

  • genome_assembly (int | None) – Genome assembly code used in Ensembl core schema names (<organism>_core_<release>_<assembly>). This selects the primary assembly for the snapshot (e.g. 38 = human GRCh38, 37 = human GRCh37, 39 = mouse GRCm39, 111 = pig Sscrofa11.1). If None (default), the highest-priority assembly configured for the organism is used. Note that the resulting snapshot graph can still include additional assemblies within the snapshot window; use idtrack.API.list_genome_assemblies() to inspect what is present.

  • ignore_before (int | None) – Earliest Ensembl release to include in the snapshot window. When None (default), use the earliest release supported by the public Ensembl MySQL/FTP dumps (see idtrack._db.DB.mysql_port_min_release). This default ensures multi-assembly history is retained for clean-handoff species (e.g. mouse) where older assemblies live entirely in earlier releases.

Returns:

A manager ready for use by graph-building and

conversion routines.

Return type:

idtrack._database_manager.DatabaseManager

Notes

Any exceptions raised by idtrack._database_manager.DatabaseManager propagate unchanged.

infer_identifier_source(id_list, mode='assembly_ensembl_release', report_only_winner=True)[source]

Infer the most likely source (database/assembly/release) for a heterogeneous identifier list.

This helper estimates which origin best explains the given IDs so users can pick a sensible graph configuration before running conversions at scale. Internally it resolves each input to a canonical node (where possible), consults idtrack._the_graph.TheGraph.node_trios to recover known origins, and tallies them via idtrack._track.Track.identify_source(). Under development: both the public signature and the scoring details may change in future releases.

Parameters:
  • id_list (list[str]) – Identifiers to analyse. Each item should be a string; non-existent IDs are safely ignored (and logged) during the tally.

  • report_only_winner (bool) – If True, return the single highest-count origin for the requested mode. If False, return all candidate origins ranked by descending count.

  • mode (str) –

    Granularity of the origin to infer.

    One of: - "complete" → return triples (database, assembly, release). - "ensembl_release" → return the Ensembl release only (int). - "assembly" → return the genome assembly only (int). - "assembly_ensembl_release" → return pairs (assembly, release).

Returns:

The inferred origin(s),

depending on report_only_winner:

  • If report_only_winner is True:
    • mode == "complete"(database: str, assembly: int, release: int)

    • mode == "ensembl_release"release: int

    • mode == "assembly"assembly: int

    • mode == "assembly_ensembl_release"(assembly: int, release: int)

  • If report_only_winner is False:

    A list of (origin, count) pairs where origin has the corresponding shape above.

Return type:

tuple[str, int, int] | tuple[int, int] | int | list[tuple[object, int]]

list_ensembl_releases()[source]

List Ensembl releases reachable for the configured organism and assembly.

Wraps idtrack._database_manager.DatabaseManager.available_releases(), honoring any ignore window configured in the manager. The result is sorted in ascending order.

Returns:

Sorted release numbers that can be queried and cached locally.

Return type:

list[int]

list_external_databases()[source]

Return the set of third-party (non-Ensembl) databases represented in the current graph.

Returns:

Unique external database names discovered via

idtrack._the_graph.TheGraph.available_external_databases().

Return type:

set[str]

list_external_databases_by_assembly()[source]

Map each genome assembly to the external databases present in that slice of the graph.

Delegates to idtrack._the_graph.TheGraph.available_external_databases_assembly() to reveal which third-party resources are available per assembly for the loaded organism/release window.

Returns:

Mapping of assembly → set of external database names.

Return type:

dict[int, set[str]]

list_genome_assemblies()[source]

List genome assemblies represented in the currently loaded graph.

Exposes the assembly identifiers discovered when the graph was built. This is a thin wrapper over idtrack._the_graph.TheGraph.available_genome_assemblies() and requires that idtrack.API.build_graph() has been called.

Returns:

Unique genome assembly identifiers present in the graph (e.g., 38 for GRCh38).

Return type:

set[int]

print_binned_conversion(classified)[source]

Log a structured multi-line summary of binned conversion results with percentages and rest counts.

Parameters:

classified (dict[str, list[dict]]) – Output from classify_multiple_conversion.

Return type:

None

resolve_organism(tentative_organism_name)[source]

Normalize a tentative organism name and fetch the latest supported Ensembl release.

This shields callers from Ensembl naming quirks by resolving a user-provided synonym (e.g. common name, shorthand, taxon ID) to the canonical Ensembl species identifier (e.g. "homo_sapiens") and to the newest Ensembl release that still hosts that species. The lookup delegates to idtrack._verify_organism.VerifyOrganism, ensuring subsequent graph construction and data access use a consistent, up-to-date pair.

Parameters:

tentative_organism_name (str) – Organism descriptor in any supported synonym form (e.g. "human", "hsapiens", "9606", or "homo_sapiens"). Matching is case-insensitive.

Returns:

(formal_name, latest_release) where formal_name is the canonical Ensembl species

string and latest_release is the most recent Ensembl release number known for that species.

Return type:

tuple[str, int]

Process-scoped SOCKS bridging for restricted servers.

The primary entry point is idtrack.ConnectionBridge. It enables IDTrack to run on servers without direct internet access (e.g. HPC clusters) by routing the current Python process through a SOCKS5 proxy provided by an SSH reverse tunnel such as ssh -R 1080 user@server.

The bridge is intentionally lightweight and process-scoped:

  • It does not modify system-wide proxy configuration.

  • It only affects the current interpreter (one Python process / one Jupyter kernel).

  • It is reversible via idtrack.ConnectionBridge.stop() and also cleaned up best-effort at interpreter exit.

class ConnectionBridge(proxy_host='127.0.0.1', proxy_port=1080, *, set_env_proxy=True)[source]

Bases: object

Route this Python process’ outgoing TCP connections through an SSH-provided SOCKS proxy.

Many restricted environments block outbound internet access from compute nodes. IDTrack needs outbound access to Ensembl services (REST/HTTPS, FTP over HTTPS, and sometimes public MySQL). If you can SSH into the server from a machine with internet access, you can expose a SOCKS5 proxy on the server via OpenSSH remote dynamic forwarding:

ssh -R 1080 user@server

Then, inside Python on the server (or inside a Jupyter notebook kernel running on the server), enable the bridge:

import idtrack

b = idtrack.ConnectionBridge(proxy_port=1080)
b.start()  # applies process-scoped networking changes

# ... run IDTrack ...

b.stop()   # restores the previous networking configuration

Internals (for maintainers / power users)

start() monkeypatches socket.socket to socks.socksocket (PySocks) and optionally sets the environment variables ALL_PROXY and all_proxy so subprocesses spawned from this process inherit the proxy.

A private, process-wide _BridgeState singleton stores the original socket class, environment variables, and PySocks default proxy to ensure stop() can restore the prior state precisely. The singleton also implements a simple reference counter so multiple ConnectionBridge instances can share the same active bridge.

Notes

  • The bridge affects only the current Python process (one Jupyter kernel). Closing the Python process/kernel automatically removes the monkeypatch.

  • To avoid surprises, call start() before the first network access in your program.

  • Status messages are emitted via the logger named "connection_bridge" and, when verbose=True, printed to stdout for immediate visibility in notebooks.

param proxy_host:

SOCKS proxy host on the server. With ssh -R 1080 ... this is typically "127.0.0.1".

param proxy_port:

SOCKS proxy port on the server. Must match the port used in the SSH command.

param set_env_proxy:

If True (default), set ALL_PROXY/all_proxy while active so subprocesses inherit the proxy configuration.

Create a new bridge controller without applying any network changes.

param proxy_host:

SOCKS proxy host on the server (default 127.0.0.1).

param proxy_port:

SOCKS proxy port on the server (default 1080).

param set_env_proxy:

If True, set ALL_PROXY/all_proxy while active so subprocesses inherit the proxy.

log

Logger named "connection_bridge" for structured diagnostics.

proxy_host

Effective proxy host for this instance.

proxy_port

Effective proxy port for this instance.

set_env_proxy

Whether this instance sets proxy environment variables when activating the bridge.

static _atexit_cleanup()[source]

Best-effort cleanup hook registered via atexit.

Return type:

None

_emit(message, *, verbose, level=20)[source]

Emit a status message via the instance logger and (optionally) stdout.

Parameters:
  • message (str)

  • verbose (bool)

  • level (int)

Return type:

None

classmethod _emit_global(message, *, verbose, level=20)[source]

Emit a message without requiring an instance (used by atexit cleanup).

Parameters:
  • message (str)

  • verbose (bool)

  • level (int)

Return type:

None

classmethod _force_disable_bridge(*, verbose)[source]

Disable the bridge regardless of which instance started it (best-effort).

This method is used by the atexit hook and by unit tests to ensure a clean process state. It intentionally bypasses instance-level bookkeeping (e.g. self._started flags).

Parameters:

verbose (bool) – If True, print a status message to stdout.

Return type:

None

static _format_proxy_url(host, port)[source]

Return a SOCKS proxy URL suitable for environment variables.

Parameters:
  • host (str)

  • port (int)

Return type:

str

static _require_pysocks()[source]

Import and return the PySocks module (import name: socks).

Returns:

Imported socks module.

Return type:

Any

Raises:

ImportError – If PySocks is not installed.

static _restore_socks_default_proxy(socks_module, original_proxy)[source]

Restore the PySocks default proxy configuration (best effort).

Parameters:
  • socks_module (Any)

  • original_proxy (Any)

Return type:

None

property is_active: bool

Return True if this instance currently holds an active bridge reference.

start(*, test=True, verbose=True)[source]

Enable the bridge for the current Python process.

The bridge is reference-counted across instances in the current interpreter. If another ConnectionBridge already enabled the bridge with the same proxy host/port, calling start() will simply increment the internal counter and return.

Parameters:
  • test (bool) – If True (default), run test_connection() after enabling the bridge. If the test fails, the bridge is automatically disabled again and the method returns False.

  • verbose (bool) – If True (default), print status messages to stdout.

Returns:

True if the bridge is enabled (and the optional test succeeds), otherwise False.

Return type:

bool

Raises:

RuntimeError – If a bridge is already active in this process but configured with a different proxy host/port.

stop(*, verbose=True)[source]

Disable the bridge and restore normal networking for this process.

If multiple ConnectionBridge instances are active, the bridge is only fully disabled once the last instance calls stop().

Parameters:

verbose (bool) – If True (default), print status messages to stdout.

Return type:

None

test_connection(*, verbose=True, timeout_s=15.0)[source]

Verify connectivity to Ensembl services through the active bridge.

The Ensembl REST ping is treated as the authoritative signal for success. MySQL connectivity checks are reported as warnings because IDTrack can fall back to HTTPS/FTP in some workflows.

Parameters:
  • verbose (bool) – If True (default), print status messages to stdout.

  • timeout_s (float) – Timeout (seconds) for the REST request.

Returns:

True if Ensembl REST is reachable, otherwise False.

Return type:

bool

Raises:

RuntimeError – If the bridge is not active in this process.

Parameters:
  • proxy_host (str)

  • proxy_port (int)

  • set_env_proxy (bool)

class HarmonizeFeatures(project_name, data_h5ad_dict, project_local_repository, idtrack_local_repository, target_ensembl_release, final_database='HGNC Symbol', organism_name='homo_sapiens', graph_last_ensembl_release=114, verbose_level=2, debugging_variables=False, converted_id_column='converted_id')[source]

Bases: object

Harmonize gene/feature identifiers across multiple single-cell expression datasets.

This manager streamlines the otherwise error-prone task of bringing heterogeneous gene identifiers (Ensembl IDs, gene symbols, etc.) into a single, version-controlled namespace before integrated downstream analysis. Under the hood it leverages idtrack.api.API to resolve identifier mappings through a pre-computed Ensembl graph, handles one-to-many and one-to-zero conversions, logs any ambiguous or inconsistent matches, and finally produces harmonised anndata.AnnData objects ready for comparative or joint analysis.

The public workflow is intentionally simple:

Instances keep several diagnostic attributes (e.g. removed_conversion_failed_identifiers, multiple_ensembl_dict) so that users can audit every decision that removed or altered a feature.

Parameters:
  • project_name (str) – Human-readable label used in log messages and derived output file names.

  • data_h5ad_dict (dict[str, str]) – Mapping dataset_alias → absolute .h5ad path of the source single-cell expression matrices.

  • project_local_repository (str) – Writable directory where harmonised outputs, logs, and temporary artefacts will be stored.

  • idtrack_local_repository (str) – Local clone or cache directory understood by idtrack.api.API; used to read the pre-built identifier graph.

  • target_ensembl_release (int) – Ensembl release that all identifiers will be converted to. Must be ≤ graph_last_ensembl_release.

  • final_database (str) – Canonical namespace kept after conversion (e.g. "HGNC Symbol"). Defaults to "HGNC Symbol".

  • organism_name (str) – Ensembl-style organism short name (e.g. "homo_sapiens"). Defaults to "homo_sapiens".

  • graph_last_ensembl_release (int) – Highest release present in the on-disk IDTrack graph. Defaults to 114.

  • verbose_level (Literal[0, 1, 2]) – Logging verbosity; 0 = errors only, 1 = warnings, 2 = info. Defaults to 2.

  • debugging_variables (bool) – Retain heavy intermediate structures for post-mortem inspection. Defaults to False.

  • converted_id_column (str) – Column name used to store converted identifiers inside the resulting AnnData.var DataFrame. Defaults to "converted_id".

idt

Lazily initialised IDTrack interface used for all identifier look-ups.

Type:

idtrack.api.API

multiple_ensembl_dict

Map of collapsed IDs to all Ensembl IDs that were originally associated with the same target identifier.

Type:

dict[str, list[str]]

removed_conversion_failed_identifiers

Features that failed conversion and were dropped from each dataset.

Type:

dict[str, set[str]]

kept_conversion_failed_identifiers

Non-convertible features kept because they were consistently non-convertible across all datasets.

Type:

dict[str, set[str]]

removed_inconsistent_identifier_matching

Features whose mappings disagreed between datasets and were therefore removed for consistency.

Type:

dict[str, set[str]]

Instantiate the harmoniser and perform lightweight validation.

The constructor merely prepares the harmonisation context: it validates input paths, configures logging, and primes IDTrack. Heavy work—graph initialisation, identifier matching, gene-symbol resolution—happens lazily when the first harmonisation method is called.

Parameters:
Raises:

ValueError – If verbose_level is not 0, 1, or 2.

_initialize()[source]

Populate diagnostic structures for failed or ambiguous identifier conversions.

Called once by HarmonizeFeatures.__init__(), this routine scans every input dataset and updates several reporting attributes (for example removed_conversion_failed_identifiers or removed_inconsistent_identifier_matching). It also derives multiple_ensembl_dict, a reverse map of ambiguous Ensembl ID → source identifiers, enabling downstream inspection of one-to-many relationships.

Returns None: All results are stored on self for later inspection.

Internally the method:

  1. Extracts raw feature identifiers from each anndata.AnnData file.

  2. Classifies identifiers into failure or inconsistency categories.

  3. Records per-dataset membership via reporter_dict_creator().

  4. Builds the multiple_ensembl_list used by HarmonizeFeatures.create_multiple_ensembl_dict().

  5. Touches datataset_conversion_dataframe_issues() so the cached-property is built eagerly.

Return type:

None

_initialize_idt()[source]

Instantiate the IDTrack interface on first use.

The public API defers expensive graph loading until it is actually required. This helper therefore checks whether idt is None and, if so, loads the on-disk identifier graph described by idtrack_local_repository and graph_last_ensembl_release, then configures release filters so that subsequent look-ups always target target_ensembl_release. Re-invocations are no-ops.

Returns None: The idt attribute is populated and ready for queries.

Return type:

None

property conversion_failed_but_consistent_identifiers: set[str]

Identify non-convertible identifiers that are consistently absent across all datasets.

An identifier that fails conversion in every dataset can be retained (or at least logged once) without jeopardising dataset comparability. This property computes the set intersection of conversion_failed_identifiers across datasets and makes the result available for selective retention or downstream visualisation.

Returns:

Identifiers that were never convertible but appeared in every dataset examined.

Return type:

set[str]

property conversion_failed_identifiers: set[str]

Return identifiers that could not be converted in at least one dataset.

The property wraps dict_1_to_not_1() and filters its "1-to-0" category so that downstream code can quickly query irrecoverable failures without iterating over the entire diagnostic structure.

Returns:

Identifiers that failed conversion in at least one dataset

or have inconsistent mappings (1-to-0, 1-to-n, or n-to-1 where not all datasets share the same mapping).

Return type:

set[str]

create_dataset_conversion_dataframe(gene_list, initialization_run)[source]

Build a two-column mapping table for a single dataset’s feature identifiers.

The routine transforms every source identifier in gene_list into the target namespace defined by self.final_database and Ensembl gene IDs. The resulting convertible subset is written into a new pandas.DataFrame with three columns—"ensembl_gene", self.final_database, and "Query ID"—while problematic identifiers are annotated or filtered according to the rules established during _initialize().

When called by _initialize() (initialization_run True), the method writes provisional mappings without inspecting post-initialisation overrides. In subsequent calls (initialization_run False) it resolves single-Ensembl ambiguities via self.datataset_conversion_dataframe_issues["final_database_chosen_single_ensembl_dict"] to guarantee a one-to-one relation between indices and feature rows.

Parameters:
  • gene_list (Union[list[str], pd.Index]) – Ordered collection of source identifiers to convert for the current dataset.

  • initialization_run (bool) – True if invoked from _initialize(); disables the single-Ensembl disambiguation step applied in later passes.

Returns:

Mapping table ready to become adata.var. Columns are "ensembl_gene", self.final_database, and the original "Query ID" for traceability.

Return type:

pandas.DataFrame

Raises:

AssertionError – If diagnostic sets such as conversion_failed_identifiers were not populated—indicating an incorrect call order—or if unexpected duplicate target IDs remain after processing.

create_intersection_column_values(adata_var)[source]

Flag features present in every study after harmonisation.

The merged .var table produced by unify_multiple_anndatas() contains one gene-symbol column per study, each named f"{self.converted_id_column}_{handle}" where handle is the dictionary key that identifies the originating dataset. A cell in one of those columns holds the gene symbol originally reported by the study, or idtrack._db.DB.placeholder_na if the gene was absent or could not be mapped to the target namespace.

This helper collapses the per-study presence/absence information into a single boolean intersection flag, later exposed to users as adata.var["intersection"]. A value of 1 indicates that the feature survived the intersect filter—i.e., it has a valid symbol in all studies—whereas 0 marks features missing from at least one dataset. The resulting NumPy vector is inserted by the caller; this routine is intentionally pure and side-effect free.

Parameters:

adata_var (pandas.DataFrame) – The .var table of the already concatenated anndata.AnnData object. It must contain one or more columns whose names start with f"{self.converted_id_column}_"; each such column is assumed to encode the gene symbol for a particular study.

Returns:

A 1-D array of int (values 0 or 1) with len(adata_var) elements. The i-th entry equals 1 if the i-th feature is present (non- idtrack._db.DB.placeholder_na) in every per-study symbol column; otherwise it is 0.

Return type:

numpy.ndarray

create_multiple_ensembl_dict()[source]

Reverse map ambiguous Ensembl target IDs to their originating source identifiers.

During scanning, _initialize() collects every (source_id, target_ensembl_id) pair that falls outside the consistent one-to-one category into multiple_ensembl_list. This helper consolidates that list into a dictionary keyed by target_ensembl_id with a sorted list of associated source_id values, allowing auditors to quickly discover all inputs that collapsed onto the same Ensembl record.

Returns:

{target_ensembl_id: [source_id₁, source_id₂, …]} with duplicates removed and values sorted alphanumerically.

Return type:

dict[str, list[str]]

property datataset_conversion_dataframe_issues: DataFrame

Aggregate conversion failures and ambiguities into a tidy diagnostic table.

The cached DataFrame has one row per source identifier encountered across all datasets and the following columns:

  • dataset — Dataset alias that triggered the row (duplicates possible).

  • reason — Underscore-delimited label from reporter_dict_creator_helper_reason_finder().

  • target_identifier — The resolved identifier or NaN if conversion failed.

  • was_removed (bool) — Whether the feature was ultimately dropped from the dataset.

This compact view is ideal for spreadsheet export or in-notebook inspection because it condenses the richer nested structures stored on the class into a flat, analysis-friendly format.

Returns:

Combined diagnostic table sorted lexicographically by dataset and reason.

Return type:

pandas.DataFrame

property dict_1_to_not_1: dict[str, set[str]]

Collect identifiers involved in one-to-many or one-to-zero conversions.

This helper scans unified_matching_dict and extracts every source identifier whose conversion to the target namespace is not a strict one-to-one mapping. Two situations are considered problematic:

  • 1 → 0 (conversion failure) — no target identifier could be resolved.

  • 1 → n (ambiguous hit) — multiple targets share the best score, preventing an unambiguous choice.

The resulting dictionary is later consumed by reporter_dict_creator() to populate the diagnostic attributes exposed to users and by create_dataset_conversion_dataframe() to decide which features should be dropped or flagged in each anndata.AnnData object.

Returns:

{problem_class: {source_id₁, source_id₂, …}} where problem_class is

either "1-to-0" or "1-to-n".

Return type:

dict[str, set[str]]

extract_source_identifiers_from_anndata(dataset_path)[source]

Load an .h5ad file and harvest the raw feature identifiers.

To prepare inputs for ID-Track, this routine opens the single-cell expression matrix at dataset_path, reads the .var DataFrame, and extracts either the "gene_id" field (if present) or the index itself as the source identifier. Identifiers are returned in file order so that downstream procedures can preserve the original feature ordering when reconstructing matrices.

Parameters:

dataset_path (str) – Absolute or project-relative path to an .h5ad file containing a valid anndata.AnnData object.

Returns:

Ordered list of identifier strings exactly as they appear in the source file.

Return type:

list[str]

feature_harmonizer(dataset_name)[source]

Convert one dataset’s feature space into the unified target namespace.

This convenience wrapper reads a single .h5ad file, removes identifiers deemed unusable during _initialize(), applies the conversion mapping from create_dataset_conversion_dataframe(), and returns a new anndata.AnnData object with harmonised features. The function is intentionally side-effect-free: it never alters the source file, and large temporary matrices are deleted immediately to minimise memory usage.

Parameters:

dataset_name (str) – Key from data_h5ad_dict identifying which dataset to load and harmonise.

Returns:

  • resulting_adata (anndata.AnnData) - Dataset whose var now contains "ensembl_gene" as index and self.final_database as a column.

  • t0 (int) - Number of features before filtering and harmonisation.

  • t1 (int) - Number of features after the procedure (i.e., retained in resulting_adata).

Return type:

tuple

Raises:

AssertionError – If duplicate Ensembl or target-database IDs slip past the conversion checks, which would break one-to-one mapping assumptions.

get_idtrack_matchings_for_all_datasets()[source]

Return raw ID-Track matchings for every dataset in the project.

This helper exposes the unfiltered mapping tables produced by ID-Track so that users can inspect exactly how each source identifier was converted (or failed to convert) in every individual dataset. Internally it triggers run_idtrack_for_single_dataset() for any dataset that has not yet been processed, caches the resulting tables in memory, and then assembles a {dataset_name: dataframe} dictionary whose keys align one-to-one with data_h5ad_dict.

Each returned pandas.DataFrame includes at least the following columns: source_id, target_id, conversion_status, reason, and any custom metadata injected by idtrack.api.API.

Returns:

Mapping of dataset alias to its full, row-level ID-Track matching table. The dictionary order follows the insertion order of data_h5ad_dict.

Return type:

dict[str, pandas.DataFrame]

n_to_1_within_individual_dataset(dataset_name, dataset_matching_list)[source]

Detect n-to-1 collapses inside one dataset and populate diagnostic caches.

In the ID-Track context n-to-1 means several source identifiers (query_id) converging on the same target identifier (matched_id). Such collapses are problematic because they merge distinct features when building the harmonised expression matrix. This helper inspects the raw matching rows for a single dataset, discovers all many-to-one events (including those that passed through the alternative target database), and records the results in a family of per-project dictionaries so that later stages—merging, filtering, and reporting—can make informed decisions.

The routine never returns a value; instead it mutates the following public attributes:

  • dict_n_to_1 - {matched_id: [dataset₁, dataset₂, …]} listing every dataset where the collapse occurred.

  • dict_n_to_1_with_query - {matched_id: {(query_id₁,…): [dataset]}} for cases where the matched_id also appears in the collapsing query set.

  • dict_n_to_1_with_query_reverse - {query_id: {matched_id: [dataset]}} for a query-centric view.

  • dict_n_to_1_without_query - collapses where the target never appears in its own query set.

Returns None: All information is stored on the instance for subsequent pipeline stages.

Parameters:
  • dataset_name (str) – Human-readable alias used throughout the project for this dataset.

  • dataset_matching_list (list[dict]) – Raw per-feature matchings returned by idtrack.api.API. Each dictionary must provide at least the keys "query_id", "last_node", and "final_database".

Return type:

None

reporter_dict_creator(the_dict, the_set, dataset_name)[source]

Update or create per-identifier diagnostic entries for a single dataset.

Each identifier in the_set is ensured to exist as a key inside the_dict. The entry’s "reason" field is generated exactly once using reporter_dict_creator_helper_reason_finder(); its "datasets_containing" list is then appended with dataset_name. This allows quick aggregation of “where did this problematic identifier occur?” across all datasets.

Returns None: the_dict is modified in-place.

Parameters:
  • the_dict (dict[str, dict]) – Target dictionary that stores diagnostic metadata. Keys are source identifiers; values have keys "reason" (str) and "datasets_containing" (list[str]).

  • the_set (set[str]) – Identifiers that belong to the diagnostic category represented by the_dict.

  • dataset_name (str) – Human-readable alias of the dataset currently being processed.

Return type:

None

reporter_dict_creator_helper_reason_finder(the_id)[source]

Infer why a particular identifier failed or produced a non-one-to-one conversion.

The algorithm inspects unified_matching_dict and categorises the_id into one or more mutually non-exclusive reasons:

  • "n-to-1" — The identifier was part of an n → 1 collapse within at least one dataset.

  • "1-to-0" — No target identifier was returned (conversion failure).

  • "1-to-n" — The conversion yielded multiple targets (ambiguous mapping).

The final label is a single string where multiple reasons are concatenated with underscores, e.g. "1-to-0_1-to-n".

Parameters:

the_id (str) – Source identifier whose conversion outcome needs explanation.

Returns:

Underscore-delimited reason string describing the failure or ambiguity class.

Return type:

str

run_idtrack_for_single_dataset(dataset_name, dataset_path)[source]

Convert identifiers for one dataset and cache the raw ID-Track output.

Given a dataset alias and its on-disk location, this method:

  1. Calls extract_source_identifiers_from_anndata() to obtain the feature list.

  2. Feeds those identifiers to idtrack.api.API and collects the per-feature match results.

  3. Stores the resulting pandas.DataFrame inside the _idtrack_matchings_per_dataset cache so repeated calls are O(1).

  4. Updates unified_matching_dict so that cross-dataset diagnostics remain consistent.

Users rarely call this directly—get_idtrack_matchings_for_all_datasets() handles the orchestration—but it remains public for advanced, dataset-by-dataset debugging.

Parameters:
  • dataset_name (str) – Human-friendly alias used as the key inside diagnostic dictionaries.

  • dataset_path (str) – Absolute or project-relative .h5ad path passed straight to extract_source_identifiers_from_anndata().

Returns:

Full ID-Track matching table for dataset_name with columns

source_id, target_id, conversion_status, and any extra metadata returned by the API.

Return type:

pandas.DataFrame

property unified_matching_dict

Expose the full source-to-target identifier mapping produced by IDTrack.

The dictionary is created during _initialize() when the IDTrack graph is first queried. Keys are source identifiers (as found in input files); values are all candidate target IDs returned by the graph query, ordered by decreasing score. A value may therefore be

  • a single-element list (unambiguous one-to-one),

  • a multi-element list (ambiguous one-to-n), or

  • an empty list (1-to-0 conversion failure).

Public access to this attribute enables advanced users to perform their own diagnostics or to reproduce the algorithm’s decisions outside the class.

Returns:

Mapping {source_id: [target_id₁, target_id₂, …]} in the order

delivered by the IDTrack query.

Return type:

dict[str, list[str]]

unify_multiple_anndatas(mode='union', obs_columns_to_keep=None, numeric_var_columns=None, numeric_obs_columns=None, handle_anndata_key='handle_anndata')[source]

Merge several study-specific anndata.AnnData objects into a single, consolidated dataset.

This helper finalises the feature-harmonisation workflow. Earlier stages ensure that every source study expresses its features (e.g. genes or proteins) in a consistent identifier namespace and that per-cell metadata follow a shared schema. unify_multiple_anndatas takes those already normalised objects—stored in data_h5ad_dict—and fuses them into one coherent AnnData ready for joint analysis (dimensionality reduction, batch correction, integrated clustering, etc.).

Two strategies govern how the function reconciles mismatched feature sets:

  • "union" (default) preserves the superset of all identifiers. If a particular study lacks a feature,

    its expression values are imputed as exact zeros. This choice maximises information retention at the cost of a sparse matrix with assay-dependent missingness.

  • "intersect" retains only the identifiers present in every study, implicitly discarding features

    unique to a subset. This yields a denser matrix that is easier to factorise but sacrifices potentially informative study-specific biology.

Beyond concatenating the main X matrices, the routine also harmonises associated annotations:

  • .var (feature annotations)

    All columns are outer-joined across studies. Non-shared categorical values are unioned; numeric columns specified in numeric_var_columns are cast to floating point and NaNs inserted where data are missing. In union mode an additional boolean "intersection" column flags whether a feature survived the intersect filter, enabling fast subsetting later.

  • .obs (cell annotations)

    Each original column is kept if its name appears in obs_columns_to_keep or if it exists in every study. Missing columns are created and populated with pandas.NA. Columns listed in numeric_obs_columns are coerced to float64. A new column named handle_anndata_key stores the handle (dictionary key) that identifies the originating study, making it trivial to stratify analyses.

  • .layers, .obsp, .varp, .uns

    This method uses anndata.AnnData.concat() for this.

The implementation is mindful of scalability: concatenation leverages SciPy CSR/CSC sparse formats, avoiding densification, and streaming allocation prevents double memory use for extremely large datasets.

Parameters:
  • mode (Literal["union", "intersect"]) – Strategy for reconciling discordant feature sets. "union" keeps every identifier observed across studies (padding absent entries with zeros); "intersect" restricts the result to identifiers common to all studies. Defaults to "union".

  • obs_columns_to_keep (list[str] | None) – Names of per-cell metadata columns that must survive the merge even if they appear in only a subset of studies (e.g. cell_type, donor_age). When a column is missing from a particular study, it is inserted and filled with pandas.NA. Provide an empty list to allow the routine to decide purely by intersection; None means “no user preference”.

  • numeric_var_columns (set[str] | None) – Columns in .var that should retain numeric dtype. The function validates that each specified column can be losslessly converted to floating point; otherwise it raises ValueError. Non-listed columns default to category dtype to conserve memory. If None an empty set is assumed.

  • numeric_obs_columns (set[str] | None) – Analogous to numeric_var_columns but applied to .obs. Conversions are performed after the table has been unioned, ensuring consistent dtype across the final concatenated frame. If None an empty set is assumed.

  • handle_anndata_key (str) – Name of the column inserted into .obs that records the dictionary key of the source study. This provenance tag facilitates stratified visualisation (e.g. UMAP coloured by batch) and downstream batch-correction utilities that expect a “batch” column. Defaults to "handle_anndata".

Returns:

A fully merged expression matrix whose .X contains either the union or intersection

of all study features. Index ordering follows the order in which studies were supplied, ensuring deterministic output for reproducible pipelines. The result inherits sparse/dense representation from the first study unless mode forces feature padding, in which case CSR/CSC is chosen automatically to keep memory use in check.

Return type:

anndata.AnnData

Raises:
  • ValueError – If mode is not "union" or "intersect"; if any column listed in numeric_var_columns or numeric_obs_columns fails numeric coercion; or if feature identifiers clash across studies after harmonisation (e.g. two studies mapping different genes to the same ID).

  • AssertionError – If duplicate cell or feature indices are detected post-merge, a condition that would break many Scanpy workflows and indicates upstream validation errors.

Notes

Performance considerations The operation is CPU-bound when aligning large sparse matrices. For datasets exceeding ~1 million cells, empirical benchmarks show that running on Python 3.11 with MKL yields a 2-3x speed-up over Python 3.8 due to better sparse BLAS threading. Provide pre-compressed datasets (hdf5, zarr) to further lower I/O overhead.

Thread safety The method is re-entrant but not thread-safe because it mutates the source AnnData objects in-place to reduce copying. Invoke one instance per process or deep-copy the inputs beforehand if concurrent harmonisation is required.

Extensibility Sub-classes may override private hooks _before_concat(), _after_concat(), and _merge_uns() to refine behaviour without re-implementing the full algorithm.

class DatabaseManager(organism, form, local_repository, ensembl_release=None, ignore_before=None, ignore_after=None, store_raw_always=True, genome_assembly=None)[source]

Bases: object

Manage retrieval, preprocessing, and storage of Ensembl Core and related external datasets.

The DatabaseManager centralizes all low-level operations required for ID-track analyses, including discovering which Ensembl releases are available for a given organism/assembly, downloading the corresponding MySQL tables, normalizing column names, persisting raw and processed files under a local cache directory, and orchestrating auxiliary look-ups to third-party resources via ExternalDatabases. By funnelling every data-access path through a single object the wider package gains:

  • Stable, reproducible builds - every graph, lookup table, or ID-history file is anchored to the exact Ensembl release, genome assembly, and form (gene, transcript, translation, …) with which the manager was configured.

  • Transparent caching - expensive downloads happen once; subsequent requests are served from disk, making large iterative analyses feasible on modest hardware.

  • Unified version logic - helper methods such as version_uniformize() and check_version_info() guarantee that cross-release identifier changes are captured and resolved consistently across the codebase.

Key public methods/attributes

  • available_releases() — list releases that can be queried and saved locally.

  • change_release() — switch the manager to another Ensembl release in-place.

  • download_table() — fetch a single MySQL table and write it to local_repository.

  • create_external_all() — pull every supported external resource (UniProt, RefSeq, …).

  • organism, form, ensembl_release, genome_assembly — core configuration knobs, surfaced for quick inspection.

The class is stateful: change-mutating helpers update internal cached properties so that the instance always reflects its current configuration. Use the built-in __str__() for a concise, human-readable dump of that state.

Initialize a DatabaseManager for a specific organism, release, and assembly.

param organism:

Canonical species name in Ensembl schema (e.g. "homo_sapiens" or "mus_musculus"). Anything else raises NotImplementedError.

type organism:

str

param form:

Biological entity level of interest—one of "gene", "transcript", "translation", …—governing which stable-ID columns will be expected downstream.

type form:

str

param local_repository:

Absolute or relative path to a writable directory that will hold all downloaded MySQL dumps, intermediate parquet/Feather files, and ready-to-use artefacts. The directory must already exist and be both readable and writable.

type local_repository:

str

param ensembl_release:

Target Ensembl release number. If None the most recent release available for genome_assembly is selected automatically.

type ensembl_release:

Optional[int]

param ignore_before:

Earliest release to include when building cross-release ID histories. Defaults to the minimum release supported by the selected assembly.

type ignore_before:

Optional[int]

param ignore_after:

Latest release to include when building histories. np.inf (the default) disables the upper bound and includes all newer releases.

type ignore_after:

Optional[int | float]

param store_raw_always:

When True raw MySQL tables are always copied to local_repository before conversion; when False they are kept only in memory.

type store_raw_always:

bool

param genome_assembly:

Genome assembly code used in Ensembl core schema names (<organism>_core_<release>_<assembly>). This selects the primary assembly used for data access (e.g. 38 = human GRCh38, 39 = mouse GRCm39, 111 = pig Sscrofa11.1). If omitted, the highest-priority assembly for organism is used. If ensembl_release is provided, the selection is restricted to assemblies that actually contain that release.

type genome_assembly:

Optional[int]

raises ValueError:

If form is not in the supported list or local_repository fails basic path/read/write checks.

raises RuntimeError:

If internal port/release configuration is inconsistent.

raises NotImplementedError:

If organism is not yet supported by the package.

_create_relation_helper(df)[source]

Convert an ID/version matrix into the canonical three-column relationship table.

The helper is shared by DatabaseManager.create_relation_current() and DatabaseManager.create_relation_archive() and is not intended for direct use. It validates the incoming frame, fixes inconsistent version numbers (via DatabaseManager.version_fix() and DatabaseManager.version_fix_incomplete()), converts missing translations to NaN-compatible floats, casts all stable-ID columns to string, and finally compresses each ID + version pair into the compact node label used throughout ID-track graphs.

Parameters:

df (pandas.DataFrame) – A six-column frame with exactly the following names (order irrelevant): gene_stable_id, gene_version, transcript_stable_id, transcript_version, translation_stable_id, translation_version.

Returns:

Three columns—gene, transcript,

translation—deduplicated and index-reset, ready for graph construction.

Return type:

pandas.DataFrame

Raises:

ValueError – If df does not contain the required six columns or if version columns cannot be coerced to the expected numeric dtype.

static _determine_usecols_ids(form)[source]

Determine column subsets needed to fetch identifier tables for a given Ensembl molecular form.

The helper translates a user-facing form string ("gene", "transcript", or "translation") into three ordered lists that drive low-level SQL selects throughout ID-track. Splitting the information this way lets public routines such as DatabaseManager.create_ids() assemble the minimal column set required for each organism/release while still keeping associated keys available for later joins.

Parameters:

form (str) – Molecular form whose identifier columns are requested. Must be one of idtrack._db.DB.forms_in_order ("gene", "transcript", or "translation").

Returns:

  • stable_id_version - always ["stable_id", "version"]; the canonical ID and its version counter.

  • usecols_core - primary-key column for form plus stable_id_version.

  • usecols_asso - foreign-key columns linking form to upstream forms, enabling later joins (e.g., ["transcript_id", "gene_id"] for transcripts).

Return type:

tuple[list[str], list[str], list[str]]

Raises:

ValueError – If form is not in {"gene", "transcript", "translation"}.

_download_table_from_ftp(table_key, usecols=None)[source]

Download a table from Ensembl’s HTTPS MySQL dumps (no direct MySQL connection).

Parameters:
  • table_key (str)

  • usecols (list[str] | None)

Return type:

DataFrame

_ftp_db_dir_url()[source]

Return the HTTPS directory URL for the current core database dump.

Return type:

str

classmethod _ftp_find_core_db_dir(*, organism, genome_assembly, release)[source]

Locate the core DB directory name (may include patch letters) and its HTTPS directory URL.

Parameters:
  • organism (str)

  • genome_assembly (int)

  • release (int)

Return type:

tuple[str | None, str | None]

classmethod _ftp_mysql_root_candidates(*, organism, genome_assembly, release)[source]

Return candidate HTTPS roots to search for an Ensembl core DB directory.

Parameters:
  • organism (str)

  • genome_assembly (int)

  • release (int)

Return type:

tuple[str, …]

classmethod _ftp_schema_for_sql_url(sql_url)[source]

Return a table->columns mapping parsed from an Ensembl <db>.sql.gz schema dump.

Parameters:

sql_url (str)

Return type:

dict[str, list[str]]

_ftp_schema_url()[source]

Return a working schema-dump URL (*.sql.gz or *.sql.gz.bz2) for the current DB dump directory.

Return type:

str

classmethod _get_core_db_index(*, organism, genome_assembly)[source]

Return cached core-DB availability for an (organism, assembly) pair across all configured ports.

The Ensembl public MySQL service can host the same assembly on multiple ports depending on release (e.g. homo_sapiens assembly 37). To support full-history workflows we build a small in-memory index:

  • ports: ports probed for this (organism, assembly) in preference order

  • releases_by_port: releases available for this assembly on each reachable port

  • db_by_port_release: schema name for each available release on each reachable port

  • releases: sorted union of all releases across ports

  • port_for_release: deterministic choice of port for each release (first configured port that has it)

  • db_for_release: chosen schema name for each release (matching port_for_release, includes patch-letter suffixes)

Parameters:
  • organism (str) – Canonical Ensembl organism name (e.g. "homo_sapiens").

  • genome_assembly (int) – Genome assembly version (e.g. 38 for GRCh38).

Returns:

Mapping describing reachable releases/ports for this organism/assembly.

Return type:

dict[str, Any]

Raises:

ValueError – If organism or genome_assembly is not configured.

classmethod _get_core_db_index_from_ftp(*, organism, genome_assembly, ports)[source]

Build a core-DB availability index by probing the Ensembl HTTPS/FTP MySQL dumps.

Parameters:
  • organism (str)

  • genome_assembly (int)

  • ports (list[int])

Return type:

dict[str, Any]

classmethod _is_retryable_http_read_error(exc)[source]

Heuristically decide whether an HTTP read/decompress error is transient and safe to retry.

Parameters:

exc (BaseException)

Return type:

bool

static _iter_exception_chain(exc)[source]

Return the exception chain (__cause__/__context__) as a list.

Parameters:

exc (BaseException)

Return type:

list[BaseException]

static _open_decompressed_http_text(url)[source]

Yield a text stream for an Ensembl dump file, handling .gz, .bz2, and nested .gz.bz2.

Parameters:

url (str)

static _parse_apache_dir_listing_dirs(html)[source]

Extract directory names from an Apache-style directory listing HTML page.

Parameters:

html (str)

Return type:

list[str]

static _parse_apache_dir_listing_files(html)[source]

Extract file names from an Apache-style directory listing HTML page.

Parameters:

html (str)

Return type:

list[str]

classmethod _probe_mysql_core_schemas_by_port(*, organism, genome_assembly, ports)[source]

Return per-port core DB releases and schema names from the live Ensembl MySQL service.

Parameters:
  • organism (str)

  • genome_assembly (int)

  • ports (list[int])

Return type:

tuple[dict[int, set[int]], dict[int, dict[int, str]], dict[int, Exception]]

classmethod _refresh_core_db_index_mysql(index, *, organism, genome_assembly)[source]

Refresh the MySQL-derived portion of a cached core-index in place.

Parameters:
  • index (dict[str, Any])

  • organism (str)

  • genome_assembly (int)

Return type:

dict[str, Any]

property available_releases: list[int]

Return Ensembl releases that are both reachable and within the ignore window.

The set is discovered via available_releases_versions(), filtered against ignore_before / ignore_after, sorted in ascending order, and cached for the lifetime of this DatabaseManager instance. The resulting list represents releases that can safely be queried and cached locally, guaranteeing reproducible downstream analyses.

Returns:

Sorted release numbers satisfying reachability and ignore-window constraints.

Return type:

list[int]

property available_releases_all_assemblies: list[int]

Return all Ensembl releases reachable across every configured assembly.

For clean-handoff species (e.g. mouse: 37 → 38 → 39), no single assembly spans the full release history. Graph construction and YAML template generation therefore need a release catalogue that is the union over all assemblies configured for organism in idtrack._db.DB.assembly_mysqlport_priority.

The list is filtered by the manager’s ignore_before / ignore_after window and cached for the lifetime of this DatabaseManager instance.

Returns:

Sorted release numbers available for at least one configured assembly.

Return type:

list[int]

property available_releases_no_save: list[int]

Return reachable Ensembl releases without persisting the discovery to disk.

Functionally identical to available_releases(), except that the discovered list is not written to the on-disk YAML cache. This helper is useful when users want a quick, read-only view of server availability—e.g., inside CI pipelines—without contaminating the persistent cache. The value is still memoized in memory for the current DatabaseManager instance.

Returns:

Sorted release numbers reachable on the remote MySQL server and compliant with the ignore window.

Return type:

list[int]

available_releases_versions(**kwargs)[source]

Discover valid Ensembl releases for the configured organism and assembly.

Availability is discovered via the cached, multi-port aware core index built by _get_core_db_index(). The resulting union of releases is filtered against the manager’s ignore_before / ignore_after bounds and returned in ascending order.

Parameters:

kwargs – Kept for backward compatibility; currently unused.

Returns:

Sorted list of release numbers that exist on the mirror and comply with the ignore window.

Return type:

list[int]

available_tables_mysql()[source]

Enumerate tables present in the selected Ensembl MySQL schema.

Intended to complement available_databases_mysql(): while that method lists databases (one per organism/release/assembly), this one will drill into the active database and return the table names themselves, such as "gene", "transcript", "xref", and so on.

Raises:

NotImplementedError – Always - the table enumeration logic has not yet been written.

change_assembly(genome_assembly, last_possible_ensembl_release=False)[source]

Clone the manager while targeting a new genome assembly (e.g. GRCh38 → GRCh37).

Genome assemblies are encoded as integers in Ensembl’s schema naming (38 for GRCh38, 37 for GRCh37, 39 for GRCm39, 111 for Sscrofa11.1, …). When last_possible_ensembl_release is True the method automatically picks the most recent Ensembl release that still provides MySQL dumps for the requested assembly, ensuring compatibility. All other settings are copied verbatim.

Parameters:
  • genome_assembly (int) – Assembly code configured under DB.assembly_mysqlport_priority for this manager’s organism.

  • last_possible_ensembl_release (bool) – When True override ensembl_release with the newest version available for genome_assembly. Defaults to False.

Returns:

New manager tied to the requested assembly (and possibly a recalculated release).

Return type:

DatabaseManager

change_form(form)[source]

Clone the manager while switching the biological form of interest.

A “form” denotes the identifier namespace to track—gene, transcript, translation, etc. This method preserves every other configuration knob (organism, release, assembly, cache directory, ignore windows, …) and returns a brand-new instance so that the original object remains unaffected.

Parameters:

form (str) – Target form/namespace recognised by __init__(). Typical values are "gene", "transcript", or "translation".

Returns:

An independent manager identical to self except for form.

Return type:

DatabaseManager

change_release(ensembl_release)[source]

Produce a new manager that targets a different Ensembl release.

The returned instance inherits organism, form, assembly, and all caching parameters, but points every subsequent query (MySQL, FTP, or REST) to ensembl_release. This is the recommended way to traverse releases in scripted analyses without mutating objects in-place.

Parameters:

ensembl_release (int) – Desired Ensembl release number (e.g. 111). Must be available for the current genome assembly or a NotImplementedError may be raised further down the call stack when data retrieval is attempted.

Returns:

Fresh manager initialised for ensembl_release.

Return type:

DatabaseManager

change_release_auto_assembly(ensembl_release)[source]

Clone the manager for ensembl_release while inferring a compatible genome assembly.

Unlike change_release(), this helper allows genome_assembly to change when the requested release is not present in the current assembly (common for clean-handoff species). Assembly inference follows the same priority rules as __init__() with genome_assembly=None: pick the highest-priority configured assembly that contains the requested release.

Parameters:

ensembl_release (int) – Desired Ensembl release number.

Returns:

Fresh manager initialised for ensembl_release with an inferred assembly.

Return type:

DatabaseManager

check_if_change_assembly_works(db_manager, target_assembly)[source]

Evaluate whether db_manager can be cloned to operate on target_assembly.

A lightweight health-check that calls DatabaseManager.change_assembly() inside a try/except block and converts the outcome to a boolean flag rather than letting the exception propagate. It allows batch workflows to skip assemblies that are unavailable or invalid without interrupting processing.

Parameters:
Returns:

True if the assembly switch succeeds without raising ValueError; False otherwise.

Return type:

bool

check_version_info()[source]

Infer whether the organism’s Ensembl IDs come with, without, or mixed versions.

The method scans all releases available for the current genome assembly and inspects the boolean flag in the version_info column of a pre-computed table (get_db("versioninfo")). Three mutually exclusive scenarios exist:

  • All releases lack version suffixes: "without_version"

  • All releases include suffixes: "with_version"

  • A mixture of both states: "add_version" (synthetic versions will be injected)

Returns:

One of "without_version", "with_version", or "add_version". Callers use the

string to decide how to standardise identifier columns.

Return type:

str

Raises:

ValueError – If the version_info column in the source table is not strictly boolean, signalling a corrupted download or schema drift.

create_available_databases()[source]

Discover MySQL databases for the configured organism/assembly.

The manager issues a SHOW DATABASES query against the Ensembl public MySQL mirror and filters names that match ^{organism}_core_[0-9]+_.*$. The resulting list is returned as a single-column dataframe so that callers can seamlessly chain further pandas operations or persist the result.

Returns:

One column named "available_databases" listing all databases that match

the organism, irrespective of Ensembl release or genome assembly.

Return type:

pandas.DataFrame

Raises:

ValueError – If the server response is not a sequence of single-field tuples or if any tuple element is not a string.

create_database_content(just_download=False)[source]

Retrieve and optionally cache external-database metadata for every assembly, release, and form.

The helper iterates over every genome assembly configured for the current organism in idtrack._db.DB.assembly_mysqlport_priority, every available Ensembl release for each assembly, and every identifier form supported by the package, downloading the external_database table for each combination. The resulting frames are concatenated, enriched with assembly, release, form, and organism columns, and returned to the caller. When just_download is True the downloads are still performed (ensuring they are cached on disk for future runs) but an empty dataframe is returned to avoid unnecessary memory use.

Parameters:

just_download (bool) –

  • False - concatenate intermediate results and return the union dataframe (default).

  • True - download and cache each frame but return an empty dataframe.

Returns:

External-database relationships augmented with assembly, release, form, and organism

columns. Empty when just_download is True.

Return type:

pandas.DataFrame

create_external_all(return_mode, narrow_external=True)[source]

Download and collate cross-reference mappings from every supported genome assembly.

The manager cycles through every genome assembly recognised for the current organism (ordered by idtrack._db.DB.assembly_mysqlport_priority), fetches either the filtered external_relevant mapping table (when narrow_external=True) or the full external mapping table (when narrow_external=False) for each via get_db(), labels every row with its source assembly, and finally concatenates the tables. Because this helper is intended for ad-hoc inspection only, it bypasses the get_db() caching layer and therefore never writes the result to the local repository.

Parameters:
  • return_mode (str) – Strategy for handling rows that appear in more than one assembly.

  • narrow_external (bool) –

    If True (default), restrict results to databases enabled in the external YAML configuration (external_relevant). If False, include all external databases provided by the Ensembl MySQL server (external).

    • "all"

      Keep one copy of every unique (release, graph_id, id_db, name_db, ensembl_identity, xref_identity, assembly) combination. Duplicates are resolved within each assembly only.

    • "unique"

      Keep one copy of every unique (release, graph_id, id_db, name_db, ensembl_identity, xref_identity) combination across all assemblies, preferring the assembly with the highest priority. (Currently no downstream use case.)

    • "duplicated"

      Return only the rows that occur in more than one assembly as a pandas.core.groupby.generic.DataFrameGroupBy, keyed by the same column set used for "unique". (Currently no downstream use case.)

Returns:

  • If return_mode is "all" or "unique",

    a de-duplicated cross-reference table with the columns release, graph_id, id_db, name_db, ensembl_identity, xref_identity, and assembly.

  • If return_mode is "duplicated",

    a group-by view containing only duplicated entries.

Return type:

Union[pandas.DataFrame, pandas.core.groupby.generic.DataFrameGroupBy]

Raises:

ValueError – If return_mode is not "all", "unique", or "duplicated".

create_external_db(filter_mode)[source]

Retrieve Ensembl-external-ID relationships and/or database statistics.

This consolidates a complex SQL join—spanning Ensembl core tables gene, transcript, translation and the cross-reference tables xref, object_xref, identity_xref, external_db, and external_synonym—into a single pandas dataframe. It enables downstream analyses such as mapping Ensembl gene models to UniProt, RefSeq, or CCDS identifiers, or summarising which external sources are represented in a given Ensembl release. The result type and granularity are controlled by filter_mode, allowing either the raw relationship rows or a per-database count to be returned.

The query executed is conceptually equivalent to the (simplified) MySQL statement below, though the actual SQL is constructed programmatically for flexibility and performance:

SELECT g.stable_id, t.stable_id, tr.stable_id, x.dbprimary_acc, edb.db_name, es.synonym, ix.*
FROM gene g
JOIN transcript t USING (gene_id)
JOIN translation tr USING (transcript_id)
JOIN object_xref ox ON (g.gene_id = ox.ensembl_id AND ox.ensembl_object_type = "Gene")
JOIN xref x ON (ox.xref_id = x.xref_id)
LEFT JOIN external_db edb ON (x.external_db_id = edb.external_db_id)
LEFT JOIN identity_xref ix ON (ox.object_xref_id = ix.object_xref_id)
LEFT JOIN external_synonym es ON (x.xref_id = es.xref_id)
LIMIT 10;

When tighter genomic scoping is required the gene table can be prefixed with coord_system and seq_region:

FROM coord_system cs
JOIN seq_region  sr USING (coord_system_id)
JOIN gene        g  USING (seq_region_id)

You can experiment interactively against the public Ensembl MySQL mirror:

mysql --user=anonymous --host=ensembldb.ensembl.org -D homo_sapiens_core_105_38 -A
# Schema reference:
# https://m.ensembl.org/info/docs/api/core/core_schema.html
Parameters:

filter_mode (str) –

Controls both the row subset and the output schema. Must be one of:

  • "all" - return every mapping found in MySQL, no post-filtering applied.

  • "relevant" - return only mappings whose external database is marked Include: true in the

    ExternalDatabases.give_list_for_case() YAML configuration.

  • "database" - return a two-column summary (name_db, count) for all external databases.

  • "relevant-database" - as above, but restricted to databases flagged Include: true.

    The special values "relevant" and "relevant-database" implicitly consult the cached external_inst to honour the user’s curated allow-list.

Returns:

  • For "all" / "relevant" - six-column frame

    ["release", "graph_id", "id_db", "name_db", "ensembl_identity", "xref_identity"] holding one row per Ensembl→external identifier edge. graph_id is the Ensembl stable ID (+version), while the two identity columns store Smith-Waterman percent identities (float16) for QC.

  • For "database" / "relevant-database" - two-column frame

    ["name_db", "count"] giving how many distinct graph_id values each external database touches. count is an int64.

Return type:

pandas.DataFrame

Raises:

ValueError – If filter_mode is not one of the accepted literals or if the YAML allow-list claims a database that is absent from the retrieved mappings—indicating the configuration and MySQL data are out of sync.

Notes

Synonym handling - any synonym brought in from external_synonym is prefixed with DB.synonym_id_nodes_prefix, and its name_db is likewise prefixed so that synonym nodes remain distinguishable during graph building. Caching - the heavy MySQL queries are executed only if the processed frame is not already present in the manager’s per-organism HDF5 cache; otherwise the cached frame is read from disk, ensuring repeat calls are inexpensive.

create_id_history(narrow)[source]

Retrieve historical relationships between successive Ensembl stable IDs.

Build a cross-release lineage table mapping every obsolete ID version to its immediate successor for the configured organism, form, and release window. The information is assembled from the Core tables stable_id_event and mapping_session and then normalised so that all identifiers follow the canonical <stable_id>.<version> convention. Downstream graph-construction utilities depend on this table to reconstruct how genes, transcripts, or translations evolve across Ensembl releases.

Parameters:

narrow (bool) – If True drop auxiliary columns (mapping session metadata, assembly labels, creation timestamps, etc.) to minimise on-disk footprint; otherwise return the full schema for exploratory analyses.

Returns:

Seven-column table with the following fields, ordered as listed—

  • old_stable_id - obsolete identifier (empty string for “birth” events).

  • old_version - version number paired with old_stable_id.

  • new_stable_id - successor identifier (empty string for “retirement” events).

  • new_version - version paired with new_stable_id.

  • score - homology score reported by Ensembl (NaN if unavailable).

  • old_release - Ensembl release where the old identifier last appeared.

  • new_release - release where the new identifier first appeared.

Return type:

pandas.DataFrame

Raises:

ValueError – If the identifier delimiter idtrack._db.DB.id_ver_delimiter is found inside any *_stable_id field, indicating malformed input.

create_id_history_fixed(narrow, inspect)[source]

Create a corrected ID-history table that repairs cyclic or duplicated version transitions (deprecated).

Certain edge cases in the raw idhistory extraction—e.g. Homo sapiens ENSG00000232423 at release 105— produce sequences like 1 2, 2 3, 1 2 where an already retired version resurfaces later on. Such cycles violate the monotonic version semantics assumed by graph algorithms. This helper rewrites the offending rows so that once a version is superseded it never reappears, transforming the above sequence into the logically consistent 3 2. The routine is retained for reproducibility but superseded by DatabaseManager.create_id_history().

Parameters:
  • narrow (bool) – Propagated to the underlying data fetch—when True start from the column-reduced idhistory_narrow view instead of the full table.

  • inspect (bool) – When True add diagnostic columns (e.g. changed_old and changed_new) to aid manual auditing of the corrections; when False return only the cleaned canonical schema.

Returns:

Corrected seven-column table old_stable_id, old_version, new_stable_id,

new_version, score, old_release, new_release—ready for serialization and downstream use.

Return type:

pandas.DataFrame

Note

This function is deprecated and will be removed in a future major release once the core extractor fully addresses the ordering anomaly.

create_ids(form)[source]

Retrieve and normalise raw Ensembl identifier records for the requested molecular form.

This method pulls the appropriate MySQL table(s) for form, copes with schema differences across Ensembl releases (e.g. the historical *_stable_id split tables), coerces data types, and standardises column names so that downstream graph-building steps all consume the same shape. It finishes by delegating to DatabaseManager.version_uniformize() to ensure the Version field is either a proper integer or NaN across the entire DataFrame.

Parameters:

form (str) – Target molecular form - "gene", "transcript", or "translation". Anything else triggers a ValueError.

Returns:

A de-duplicated, index-reset table whose columns depend on form:

  • gene - gene_id, gene_stable_id, gene_version

  • transcript - transcript_id, gene_id, transcript_stable_id, transcript_version

  • translation - translation_id, transcript_id, translation_stable_id, translation_version

All ID columns are int64 except the *_stable_id strings; version columns are int64 or float64 (with NaN when absent).

Return type:

pandas.DataFrame

create_relation_archive()[source]

Retrieve a cross-release gene-transcript-translation mapping table.

This legacy helper pulls the Ensembl gene_archive table—spanning all releases for the current organism—via DatabaseManager.get_table(), drops columns unrelated to identifier mapping, and passes the result to DatabaseManager._create_relation_helper(). Because the archive contains known gaps, the preferred workflow is to call DatabaseManager.create_relation_current() once per release and concatenate the outputs.

Returns:

Same schema as

DatabaseManager.create_relation_current()gene, transcript, translation—but potentially with missing rows because Ensembl did not always back-populate older releases.

Return type:

pandas.DataFrame

create_relation_current()[source]

Build a current-release gene-transcript-translation mapping table.

The routine fetches the raw stable-ID/​version tables for genes, transcripts and translations via DatabaseManager.get_db(), merges them into a single wide frame, and then delegates to DatabaseManager._create_relation_helper() to harmonise version columns and compress the information into three canonical node labels ("<stable_id>.<version>"). The resulting mapping is the authoritative per-release link between molecular forms and is consumed by downstream graph-building utilities such as DatabaseManager.create_graph().

Returns:

Three columns—gene, transcript, and translation—with one row per transcript. The translation column may contain empty strings where non-coding transcripts have no peptide. All data are UTF-8 strings; duplicates are removed and the index is reset.

Return type:

pandas.DataFrame

create_release_id()[source]

Return deduplicated stable-identifier/version pairs for the current form and release.

Raw identifiers are fetched via DatabaseManager.get_db(), normalised with DatabaseManager.version_fix(), trimmed to the canonical columns, and sanity-checked. Two integrity rules are enforced: (1) the delimiter idtrack._db.DB.id_ver_delimiter must not appear inside any stable identifier, and (2) every stable identifier must be unique after deduplication. Violations raise ValueError.

Returns:

Two-column dataframe [{form}_stable_id, {form}_version] with duplicates removed.

Return type:

pandas.DataFrame

Raises:

ValueError – If the delimiter is present inside any stable identifier or if identifiers are not unique after deduplication.

create_version_info()[source]

Determine whether each Ensembl release stores identifiers with or without version suffixes.

Ensembl stable identifiers can appear either with a .version facet (e.g. ENSG00000139618.17) or without it (e.g. YAL001C in S. cerevisiae). For robust cross-release tracking the package needs to know which convention applies to every release of the current organism. The method loops over available_releases, downloads the raw identifier table for self.form, and inspects the <form>_version column:

  • All values NaN → the release uses unversioned identifiers.

  • No values NaN → the release uses versioned identifiers.

  • Mixed NaN / non-NaN → unsupported; raises NotImplementedError.

The outcome is encoded as a Boolean flag per release and later consumed by check_version_info() to decide whether version strings should be kept, stripped, or synthesised.

Returns:

Two-column table with:
  • ensembl_release - integer release number.

  • version_info - True if all identifiers lack a version suffix, False if

    all identifiers include a version suffix.

Return type:

pandas.DataFrame

Raises:

NotImplementedError – If any individual release contains a mixture of versioned and unversioned identifiers, indicating an inconsistent upstream annotation.

download_table(table_key, usecols=None)[source]

Download a raw Ensembl MySQL table and return it as a DataFrame.

The method forms the low-level backbone of all table acquisition in IDTrackDocs. It opens a direct connection to the Ensembl Core (or comparable) MySQL schema configured on the current DatabaseManager instance, issues a SELECT statement against table_key, converts the results into a pandas.DataFrame, and performs a minimal sanitisation pass (bytes-to-string decoding, column subset validation, logging). Public code is expected to call DatabaseManager.get_table(), which wraps this helper with caching and post-processing, but keeping this routine separate allows fine-grained testing, mocking, and reuse in advanced workflows.

Parameters:
  • table_key (str) – Name of the raw table as it appears in the remote Ensembl database (e.g. 'gene', 'mapping_session', 'xref'). Must exist in the schema returned by DatabaseManager.mysql_database().

  • usecols (Optional[list[str]]) – Sequence of column names to project; None retrieves the entire table. Column order is preserved. An empty list is treated the same as None.

Returns:

A frame containing the requested columns in the exact order supplied via

usecols (or all columns if usecols is None). Index is monotonic and zero-based.

Return type:

pandas.DataFrame

Raises:

ValueError – If any element of usecols is missing from table_key, or if the query returns binary payloads that cannot be coerced into native Python types.

property external_inst: ExternalDatabases

Instantiate and cache an ExternalDatabases helper for this manager.

The instance mirrors the configuration of the surrounding DatabaseManager—organism, Ensembl release, identifier form, local repository path, and genome assembly—so that all interactions with external data sources remain consistent throughout the session. Because the property is backed by functools.cached_property, the helper is created exactly once and reused on subsequent accesses, eliminating redundant network or file-system look-ups.

Returns:

A lazily created, configuration-matched helper object.

Return type:

ExternalDatabases

file_name(df_type, *args, ensembl_release=None, **kwargs)[source]

Resolve HDF5 hierarchy key and absolute file path for a dataframe request.

This internal helper centralises every rule that DatabaseManager uses to build HDF5 hierarchy keys and their corresponding on-disk filenames, ensuring that any two call-sites confronted with the same combination of organism, genome assembly, Ensembl release, dataframe kind, and optional column subset produce identical results. By funnelling every I/O operation through this method the wider package avoids silent cache misses, duplicate downloads, and hard-to-trace inconsistencies in downstream analytics. Public code is expected to invoke higher-level wrappers such as DatabaseManager.get_db(); use this routine only when implementing new caching utilities or in low-level tests.

Parameters:
  • df_type (str) – Category of dataframe whose name is required. Accepted values are "processed", "mysql", and "common"; any other string triggers ValueError.

  • ensembl_release (int, optional) – Ensembl release to encode in the filename. If None, the current DatabaseManager.ensembl_release is used instead.

  • kwargs – Additional keyword arguments forwarded to the helper that handles the selected df_type (currently only usecols for the mysql path).

  • args

    Positional arguments interpreted according to df_type:

    • processed - df_indicator (str): symbolic label such as "idhistory" or "idsraw_gene". The manager appends DatabaseManager.form so that artefacts for different biological forms do not collide.

    • mysql - table_key (str): raw MySQL table name (e.g. "gene", "exon"). An optional usecols (list[str]) must then be supplied via kwargs; the column list is embedded in the hierarchy using the delimiter held in DatabaseManager._column_sep.

    • common - df_indicator (str): same as the processed case but without the form suffix, allowing cross-form artefacts (e.g. "availabledatabases") to share a single key.

Returns:

Two-element tuple (hierarchy_key, file_path) where hierarchy_key is

the internal node path (e.g. "ens111_mysql_gene_COL_gene_id") and file_path is the absolute path to <local_repository>/<organism>_assembly-<assembly>.h5. The path is not created on disk—callers remain responsible for reading or writing the HDF5 file.

Return type:

tuple[str, str]

Raises:

ValueError – If df_type is not one of the accepted categories or if the positional/keyword argument combination does not satisfy the expectations for that category (e.g. missing table_key when df_type is "mysql").

get_db(df_indicator, create_even_if_exist=False, save_after_calculation=True, overwrite_even_if_exist=False)[source]

Retrieve or create a cached data table defined by an indicator string.

This method is the central gateway for all tabular resources managed by DatabaseManager. It interprets a compact indicator string, decides whether the requested table already exists in the local HDF5 repository, and either loads the cached copy or triggers the appropriate builder (create_* helper) to download/assemble it. A consistent naming convention is maintained so that subsequent calls with the same indicator transparently reuse the on-disk cache, ensuring reproducible builds and minimal network traffic.

Supported base indicators

  • external — cross-reference database registry; optional qualifier

    relevant | database | relevant-database narrows the view.

  • idsraw — raw Ensembl identifiers for a given form (``gene``, ``transcript``, ``translation``); requires the form as qualifier.

  • ids — release-specific identifier table (no qualifier).

  • externalcontent — summary of per-database content.

  • relationcurrent — current gene/ID relationships.

  • relationarchive — historical gene/ID relationships across releases.

  • idhistory — full ID history; qualifier narrow restricts to current IDs.

  • versioninfo — version comparison across releases.

  • availabledatabases — list of locally cacheable resources.

Additional indicators may be introduced by subclass extensions; consult the module documentation for the authoritative list.

Parameters:
  • df_indicator (str) – Compact descriptor of the table to retrieve. Must follow the base[qualifier] pattern described above.

  • create_even_if_exist (bool) – Force a rebuild/download even if a cached copy is present. Defaults to False.

  • save_after_calculation (bool) – Persist a newly created table to the local HDF5 store. Has no effect when the table is merely loaded from disk. Defaults to True.

  • overwrite_even_if_exist (bool) – When saving, replace an existing HDF5 key with the same hierarchy (file-internal path). Defaults to False.

Returns:

The requested dataset. The exact shape, index, and column layout depend on df_indicator; see the indicator list above for semantic details.

Return type:

pandas.DataFrame | pandas.Series

Raises:

ValueError – If df_indicator is malformed, references an unsupported resource, or its qualifier violates the expected pattern (e.g., missing form for idsraw).

get_release_date()[source]

Return a mapping of Ensembl release numbers to their publication dates.

The future implementation will query the meta table of each reachable release—or fall back to the Ensembl REST API—to build a dictionary such as {105: date(2022, 11, 1), 106: date(2023, 2, 7), …}. Down-stream routines can then translate between absolute dates and release numbers, enabling chronology-aware analyses and reporting.

Raises:

NotImplementedError – Always - date discovery is not yet implemented.

get_table(table_key, usecols=None, create_even_if_exist=False, save_after_calculation=True, overwrite_even_if_exist=False)[source]

Download, cache, or read a raw MySQL table for the current release.

A high-level wrapper that coordinates three steps:

  1. Path resolution - determines the HDF5 file and internal key under the local repository that belong to table_key (and usecols, if provided).

  2. Fetch or reuse - if the target key is absent, unreadable, or forcibly refreshed, delegates to download_table() to query the MySQL server; otherwise loads the dataframe from disk.

  3. Persistence - optionally stores the freshly downloaded dataframe back to disk, shrinking the number of future network calls.

Parameters:
  • table_key (str) – Name of the MySQL table (e.g. "gene", "xref", "mapping_session").

  • usecols (list[str] | None) – Column subset to retrieve. None (default) selects all columns.

  • create_even_if_exist (bool) – Ignore any on-disk cache and re-download the table unconditionally.

  • save_after_calculation (bool) – Persist the dataframe to the computed HDF5 path when True.

  • overwrite_even_if_exist (bool) – Replace an existing HDF5 key even when it is already present.

Returns:

The requested raw table with column order mirroring usecols when supplied,

otherwise the server’s natural order.

Return type:

pandas.DataFrame

Raises:

ValueError – If usecols is an empty list, not a list, or otherwise fails basic validation.

id_ver_from_df(dbm_the_ids)[source]

Assemble fully qualified node names from a stable-ID / version DataFrame.

This convenience routine converts a two-column frame—usually produced by DatabaseManager.get_db() with the ids form—into the canonical node labels used throughout ID-track graphs (e.g. ENSG00000000001.1). It first validates that the input columns match self._identifiers (typically ["gene_stable_id", "gene_version"] or analogous for the current form), then delegates per-row processing to DatabaseManager.node_dict_maker() and DatabaseManager.node_name_maker(). The resulting list may be fed directly into downstream graph builders or written to disk for later reuse.

Parameters:

dbm_the_ids (pandas.DataFrame) – Two-column frame containing the stable identifiers and their Ensembl version numbers. The column order and names must exactly match self._identifiers; otherwise an exception is raised.

Returns:

Ordered list where each element is either

"<ID>.<version>" when a valid numeric version is present or simply "<ID>" when the version is None / NaN / an alternative marker (see idtrack._db.DB.alternative_versions).

Return type:

list[str]

Raises:

ValueError – If dbm_the_ids does not contain the expected column names stored in self._identifiers.

property mysql_database: str

Return the canonical Ensembl Core schema name for the current organism, release, and assembly.

The schema naming convention is deterministic:

{organism}_core_{ensembl_release}_{genome_assembly}[<patch>]

For multi-port assemblies (e.g. sus_scrofa assembly 102), the port is selected in __init__() using _get_core_db_index(). Once the release is validated to exist on that chosen port, the schema name itself does not require another server-side discovery query.

Returns:

Schema name like "homo_sapiens_core_111_38".

Return type:

str

Raises:

ValueError – If the current release is not available for this (organism, assembly) pair.

static node_dict_maker(id_entry, version_entry)[source]

Return a normalized ID/Version dictionary from raw column values.

This helper creates the canonical structure consumed by DatabaseManager.node_name_maker() and higher-level graph utilities, ensuring that version numbers are strictly integers whenever possible. It also recognises special placeholders defined in idtrack._db.DB.alternative_versions (e.g. "Retired" or "Void") and passes them through unchanged so that downstream code can handle deprecated or missing entries appropriately.

Parameters:
  • id_entry (str) – Stable identifier portion preceding the delimiter (e.g. "ENSG00000000001").

  • version_entry (Any) – Raw version value following the delimiter (e.g. 1 in "ENSG00000000001.1"). May be float, int, str, None, NaN, or an alternative placeholder such as "Retired".

Returns:

{"ID": id_entry, "Version": version_entry} with Version coerced to int when it represents a whole number.

Return type:

dict[str, Any]

Raises:

ValueError – If version_entry is numeric but contains a fractional component (e.g. 1.2), indicating a malformed identifier that cannot be represented as an integer version.

static node_name_maker(node_dict)[source]

Concatenate ID and Version into a single node label.

Given the miniature dictionary returned by DatabaseManager.node_dict_maker(), this helper builds the string representation that uniquely identifies a biological entity within the graph layer. When a numeric version is available, it appends that value to the stable ID using idtrack._db.DB.id_ver_delimiter ("." by default). For organisms or datasets lacking versioned identifiers, it falls back to the bare stable ID to preserve compatibility.

Parameters:

node_dict (dict[str, Any]) – Mapping with exactly two keys, "ID" and "Version", as produced by DatabaseManager.node_dict_maker().

Returns:

Either "<ID>.<version>" or "<ID>" depending on whether a

non-null, non-alternative version is present.

Return type:

str

tables_in_disk()[source]

List all dataframes cached for this manager on local disk.

The helper inspects the HDF5 file located at the path generated by file_name() (df_type="common") and returns every key it contains. When the file does not exist yet, an empty list is returned instead of raising.

Returns:

Sorted HDF5 keys corresponding to dataframes already materialised for this manager.

Return type:

list[str]

version_fix(df, version_str, version_info=None)[source]

Apply a global ID-version policy to a DataFrame.

Depending on the organism and its historical annotation quirks, identifiers may (1) never include a version, (2) always include a version, or (3) require a synthetic version when mixing cross-release data. The version_info flag encodes that policy:

  • "without_version" — strip all versions (set column to NaN).

  • "with_version" — cast column to int64 (all values must exist).

  • "add_version" — fill missing entries with DB.first_version.

Parameters:
  • df (pandas.DataFrame) – Frame whose version_str column needs harmonising.

  • version_str (str) – Name of the column that stores version numbers.

  • version_info (Optional[str]) – One of "add_version", "without_version", or "with_version". When None (default) the method calls check_version_info() to determine the correct policy automatically.

Returns:

Same object df with version_str updated in-place.

Return type:

pandas.DataFrame

Raises:

ValueError – If version_info is not recognised.

version_fix_incomplete(df_fx, id_col_fx, ver_col_fx)[source]

Clean up version columns when some identifiers are entirely missing.

Ensembl translation tables occasionally encode parent IDs without a version while descendants retain one, producing frames where id_col_fx is NaN but ver_col_fx contains a number. This helper splits the frame, delegates to version_fix() for each subset, then stitches the pieces back together so that every row obeys a single “with/without/add version” policy.

Parameters:
  • df_fx (pandas.DataFrame) – Data to harmonise. The frame must include id_col_fx and ver_col_fx.

  • id_col_fx (str) – Column holding the stable part of the identifier (e.g. "translation_id").

  • ver_col_fx (str) – Column holding the integer version suffix.

Returns:

Frame whose ver_col_fx is consistent with the organism-level policy

determined by check_version_info().

Return type:

pandas.DataFrame

version_uniformize(df, version_str)[source]

Normalise a Version column so every entry is either an int or NaN.

This post-processing helper finalises the output of DatabaseManager.create_ids(). Ensembl releases differ: some assign an explicit integer version to every stable identifier, whereas others omit the suffix entirely. Downstream code expects a uniform dtype, so this routine coerces the designated column to a proper integer when all entries are present or fills the entire column with np.nan when none are. Mixed presence is forbidden because it would break the ID-version pairing logic used by DatabaseManager.node_name_maker().

Parameters:
  • df (pandas.DataFrame) – Frame returned by create_ids(); must already contain a column named version_str.

  • version_str (str) – Name of the column that holds version information (e.g. "gene_version").

Returns:

Same object df with version_str either cast to int64 or overwritten

with np.nan for every row.

Return type:

pandas.DataFrame

Raises:

NotImplementedError – If some rows have a version and others do not, indicating an Ensembl release with inconsistent schema. Such a release is currently unsupported.

Parameters:
  • organism (str)

  • form (str)

  • local_repository (str)

  • ensembl_release (int | None)

  • ignore_before (int | None)

  • ignore_after (int | float | None)

  • store_raw_always (bool)

  • genome_assembly (int | None)

class ExternalDatabases(organism, ensembl_release, form, local_repository, genome_assembly)[source]

Bases: object

Manage third-party metadata for Ensembl entities through YAML side-car files.

This helper encapsulates everything related to the external (i.e. non-Ensembl) databases that can be linked to a given organism, genome assembly, release, and biological form (gene / transcript / translation). Examples of such resources include ArrayExpress, RefSeq, Uniprot, HGNC, and dozens of smaller annotation providers. Rather than hard-coding those relationships, the wider ID-Track toolkit stores them in a human-readable YAML file that lives next to the local data cache managed by _database_manager.DatabaseManager.

The YAML workflow is:

  1. create_template_yaml() enumerates every known combination and writes a template where each entry is marked Include: false.

  2. A user (or an automated post-processing step) reviews the template, toggling Include to true for the resources they need.

  3. The modified YAML is saved under local_repository; subsequent calls to load_modified_yaml() return it as a plain dict for downstream logic.

  4. validate_yaml_file_up_to_date() warns if the user file lags behind a newer template (e.g. because a later Ensembl release introduced extra tables).

  5. Utility helpers such as give_list_for_case() expose convenient filtered views—e.g. all databases that should be downloaded for the current form, or all releases supported by assembly 38.

In short, ExternalDatabases provides a single, version-controlled “contract” describing which third-party tables belong in an ID-track run, while granting users explicit opt-in control over optional resources.

Instantiate a YAML controller tied to a specific organism, release, and assembly.

The constructor mirrors the core configuration of _database_manager.DatabaseManager so that both objects operate on the exact same coordinate system. No I/O is performed at construction time; paths are merely recorded, and loggers are configured. Heavy-weight actions—such as scanning the cache for existing YAMLs or writing new ones—happen lazily when the corresponding methods are called.

Parameters:
  • organism (str) – Canonical Ensembl species identifier in snake_case (e.g. "homo_sapiens"). Case-insensitive but must match Ensembl conventions.

  • ensembl_release (int) – Target Ensembl release number (e.g. 110). Must correspond to a release that actually exists for organism and genome_assembly.

  • form (str) – Entity level—"gene", "transcript", or "translation". Any other value raises ValueError in higher-level validation.

  • local_repository (str) – Writable directory where YAML files and downloads are cached. The directory need not pre-exist; if missing, most public methods will attempt to create it.

  • genome_assembly (int) – Genome assembly code as used in Ensembl core schema naming (e.g. 38 = human GRCh38, 37 = human GRCh37, 39 = mouse GRCm39, 111 = pig Sscrofa11.1). Used to disambiguate multiple assemblies available for the same organism/release pair.

create_template_yaml(df)[source]

Generate a template YAML enumerating external-database options.

This helper scans df—typically the dataframe returned by idtrack._database_manager.DatabaseManager.create_database_content()—and writes a scaffold configuration file to file_name_template_yaml(). The file lists every organismform → database combination observed in df, grouped by genome assembly and Ensembl release. For each entry the template records whether the database should be included when building an ID-history graph, its integer Database Index, and an empty Potential Synonymous placeholder that future versions may use to flag overlapping resources.

Users are expected to edit the generated file—changing Include from false to true where appropriate—and rename it by appending _modified to the filename before the package will load it. A warning to that effect is emitted via logging.Logger.warning().

The resulting YAML resembles the structure below (truncated for brevity):

homo_sapiens:
    gene:
        ArrayExpress:
            Assembly:
                "37":
                    Ensembl release: 79,80,81,82,83,84,85,86,87,88,89
                    Include: false
                "38":
                    Ensembl release: 79,80,81,82,83,84,85,86,87,88,89
                    Include: false
            Database Index: 0
            Potential Synonymous: ""
        Clone-based (Ensembl):
            Assembly:
                "37":
                    Ensembl release: 79,80,81,82,83,84,85
                    Include: false
                "38":
                    Ensembl release: 79,80,81,82,83,84,85
                    Include: false
            Database Index: 5
            Potential Synonymous: ""

Editing guidelines

  • Set Include to true for every assembly of the databases you need.

  • Save the edited file with _modified appended to the base name so that downstream routines load the customised version.

Parameters:

df (pandas.DataFrame) – Dataframe containing at least the columns ["organism", "form", "name_db", "assembly", "release"]. It should be produced by idtrack._database_manager.DatabaseManager.create_database_content() so that the expected schema is guaranteed.

Raises:

ValueError – If df contains duplicate assembly entries for the same organism/form/database triple, causing an internal consistency check to fail.

Notes

The Potential Synonymous is now all empty. In the following versions, it is aimed to integrate a feature that prevent to heve synonymous databases in the list. Likewise, Database Index has now no use case, in the program. It is important to follow the final warning raised by the method. ‘’Please edit the file based on requested external databases and add ‘_modified’ to the file name.’’. The editing should be done by converting Include sections from false to true. It is recommended to make the change for each assembly for a given database.

file_name_modified_yaml(mode)[source]

Resolve the path to a modified YAML file customised by the user or shipped with the package.

The method supports two modes that map to different storage locations:

  • "configured" - the user-edited file living in ExternalDatabases.local_repository.

  • "default" - the read-only fallback bundled under <package_root>/default_config for quick starts and unit tests.

By funnelling every lookup through this routine, higher-level helpers such as ExternalDatabases.load_modified_yaml() remain agnostic about the underlying directory structure and can focus on validation and parsing instead.

Parameters:

mode (str) – Either "configured" or "default" selecting the corresponding search location.

Returns:

Absolute path of the requested YAML file.

Return type:

str

Raises:

ValueError – If mode is not one of the recognised values.

file_name_template_yaml()[source]

Return absolute path to the template YAML configuration file.

A helper that deterministically builds the filename used by ExternalDatabases.create_template_yaml() when it first scaffolds the external-database configuration for organism. Centralising the logic here keeps every component of idtrack that may need the path (tests, CLI tools, future maintenance scripts) in perfect sync with a single implementation. The method performs no I/O; it merely concatenates ExternalDatabases.local_repository and the conventional filename pattern "<organism>_externals_template.yml" so callers can decide whether to create, read, or overwrite the file.

Returns:

Absolute path of <organism>_externals_template.yml located inside ExternalDatabases.local_repository.

Return type:

str

give_list_for_case(give_type)[source]

Return database names or assembly codes extracted from the external-DB YAML file.

The helper provides a lightweight way for higher-level components (e.g. DatabaseManager) to discover which external resources—or which genome assemblies—are currently eligible according to the user-editable YAML configuration created by ExternalDatabases.create_template_yaml(). Instead of forcing the caller to parse the YAML structure manually, the method filters the entries for the manager’s organism, form, Ensembl release and genome assembly and returns the requested slice.

Parameters:

give_type (str) –

Kind of list to return. Accepted values are

  • "db" external-database names (str) whose Include flag is true for the current organism, form, assembly and Ensembl release.

  • "assembly" genome-assembly codes (int) for which the YAML enables at least one external database (Include: true) at the current Ensembl release.

Returns:

  • When give_type is "db", a list of database names.

  • When give_type is "assembly", a list of assembly codes.

Return type:

list[str] | list[int]

Raises:

ValueError – If give_type is not "db" nor "assembly" or if an unexpected internal inconsistency is encountered while traversing the YAML structure.

load_modified_yaml()[source]

Load the user-edited or default YAML configuration and verify release compatibility.

This convenience wrapper searches for the configured YAML file first; if it does not exist or lacks read permissions a warning is logged and the default YAML file shipped with the package is tried instead. Failure to locate either file aborts the process with FileNotFoundError. After loading, the method delegates to ExternalDatabases.validate_yaml_file_up_to_date() to ensure that the currently requested Ensembl release is represented in the configuration.

Returns:

Parsed YAML content keyed by {organism form database Assembly {...}}.

Return type:

dict

Raises:

FileNotFoundError – If neither the configured nor the default YAML file can be accessed.

validate_yaml_file_up_to_date(read_yaml_file)[source]

Assert that the YAML configuration lists the active Ensembl release.

The external-database mapping evolves with each Ensembl release. This helper extracts the set of releases encoded in read_yaml_file—no matter how deeply nested—and verifies that ExternalDatabases.ensembl_release is present. Triggering an exception here prevents downstream graph-construction logic from silently operating on incomplete or outdated metadata, prompting users to regenerate or update the YAML file before proceeding.

Parameters:

read_yaml_file (dict) – Dictionary produced by ExternalDatabases.load_modified_yaml() containing the loaded YAML structure.

Raises:

ValueError – If the current Ensembl release is absent from the YAML configuration.

class TheGraph(*args, **kwargs)[source]

Bases: MultiDiGraph

Represent a bio-identifier multigraph with IDTrack-specific helpers.

The class extends networkx.MultiDiGraph to model historical and cross-reference relationships between Ensembl identifiers (genes, transcripts, translations) and third-party database accessions (UniProt, RefSeq, …). It is built by idtrack._graph_maker.GraphMaker, then queried by idtrack.Track for high-performance path-finding across Ensembl releases and external resources.

Additional cached properties (e.g. rev, combined_edges, and hyperconnective_nodes) collapse expensive aggregate calculations into single attribute look-ups, while helpers such as attach_included_forms() record which biological forms were merged into a particular instance. Together these conveniences allow downstream algorithms to traverse millions of edges without the memory overhead of duplicating graphs or recomputing summaries.

Instantiate the multigraph and configure package logging.

All positional and keyword arguments are forwarded verbatim to networkx.MultiDiGraph, allowing callers to pre-seed the graph with nodes, edges, or name/metadata attributes exactly as they would with a vanilla NetworkX constructor. After delegating to super().__init__, the method initialises two convenience attributes:

  • log — a dedicated logging.Logger named "the_graph" for

    structured, per-instance diagnostics.

  • available_forms — a placeholder set to None until

    attach_included_forms() is called by idtrack._graph_maker.GraphMaker.

Parameters:
  • args (Any) – Positional arguments accepted by networkx.MultiDiGraph.__init__().

  • kwargs (Any) – Keyword arguments accepted by networkx.MultiDiGraph.__init__().

_attach_included_forms(available_forms)[source]

Record which Ensembl forms are present in the merged graph.

Graphs for gene, transcript, and protein are first built independently by GraphMaker and then merged into a single TheGraph instance. This helper runs after that merge to store the subset of forms that actually made it into the final graph—information required by several cached properties (e.g. available_external_databases) for consistency checks and downstream analyses. Calling the method before the merge would mis-report available forms and corrupt those caches.

Parameters:

available_forms (list[str]) – Exact list of included forms (typically ["gene", "transcript", "protein"]). Order is preserved so callers can rely on a deterministic iteration sequence.

Return type:

None

static _combined_edges(node_list, the_graph)[source]

Aggregate database/assembly/release metadata for the edges of node_list.

The routine is the work-horse behind the TheGraph.combined_edges family of cached properties. It iterates over every node in node_list, inspects each outgoing (or, when the_graph is a reversed view, incoming) edge, and builds a deterministic description of which external database, genome assembly, and Ensembl release the connection originates from.

Edges that link two nodes of the same node-type are ignored so that backbone history links (gene ↔ gene, transcript ↔ transcript, …) do not pollute the output (as tested in idtrack._track_tests.TrackTest.is_edge_with_same_nts_only_at_backbone_nodes()). For edges whose database key is one of the generic Ensembl forms (ensembl_gene, ensembl_transcript, …) the key is rewritten to the assembly-specific variant (e.g. assembly_38_ensembl_gene) to keep assemblies logically separate in downstream analyses.

Parameters:
  • node_list (NodeView | list[str]) – Nodes whose edge metadata will be consolidated. Accepts either a plain list or the networkx view returned by graph.nodes.

  • the_graph (nx.MultiDiGraph) – Graph to inspect. Pass self for the native orientation or self.rev when a reverse walk is required.

Returns:

Mapping {node: {database: {assembly: set[int]}}} that summarises every

admissible edge attached to the requested nodes.

Return type:

dict

static _combined_edges_genes_helper(the_result)[source]

Merge per-neighbour edge metadata for gene-centric queries.

This helper is used exclusively by TheGraph.combined_edges_genes() and TheGraph.combined_edges_assembly_specific_genes() to post-process the dictionaries returned by TheGraph._combined_edges(). Because backbone gene nodes have no outgoing edges except to other gene nodes, the caller invokes TheGraph._combined_edges() on a reversed graph and receives one nested dictionary per neighbour. The present routine

  1. Flattens those per-neighbour sub-dicts so that information from

    multiple neighbours of the same external database and assembly is unified.

  2. Re-labels the generic ensembl_gene key to the assembly-qualified

    form assembly_<N>_ensembl_gene so that the provenance of every entry remains explicit and consistent with the rest of the code base.

Parameters:

the_result (dict) – Nested mapping produced by TheGraph._combined_edges() for a single gene node. The structure is {neighbour: {database: {assembly: set[int]}}}.

Returns:

Collapsed mapping {database: {assembly: set[int]}} where all neighbour-level

dictionaries have been merged and database names have been renamed to their assembly-specific counterparts when appropriate.

Return type:

dict

_get_active_ranges_of_id(input_id)[source]

Compute Ensembl-release ranges for a single identifier, choosing logic by node type.

This private helper inspects the input_id and dispatches to an internal routine tailored to the node’s role in the graph:

  • _get_active_ranges_of_id_backbone() - deals with backbone nodes

    that form the primary versioned lineage.

  • _get_active_ranges_of_id_nonbackbone() - handles assembly-specific

    or auxiliary identifiers recorded in one of the combined-edges lookup tables.

Parameters:

input_id (str) – Identifier whose life-span across Ensembl releases is requested. Must exist in nodes.

Returns:

Ordered, non-overlapping [[start_rel, end_rel], …]

where both ends are inclusive.

Return type:

list[list[int]]

_node_trios(the_id)[source]

Compute all origin trios for a single node.

The routine identifies the node-type, chooses the appropriate combined_edges cache, and expands any Ensembl release ranges so that every individual release is represented. Alternative-assembly backbone genes and assembly-specific genes receive special handling to ensure the correct database label is recorded.

Parameters:

the_id (str) – Canonical node name used inside the graph.

Returns:

Unique triples (<database>, <assembly>, <release>) describing

every context in which the_id occurs.

Return type:

set[tuple[str, int, int]]

property available_external_databases: set[str]

Return the set of external databases represented in the graph.

This helper inspects every node whose node-type flag matches idtrack._db.DB.nts_external and records the database name attached to the outbound edges. The resulting set is cached so that downstream routines—such as validating user-supplied database names or determining which third-party resources must be fetched—can query the information in O(1) time instead of re-scanning the graph.

Returns:

Unique names of all third-party (non-Ensembl) databases present

in the current TheGraph instance.

Return type:

set[str]

property available_external_databases_assembly: dict[int, set[str]]

Return external databases available for each genome assembly.

For every assembly identifier in available_genome_assemblies, this method gathers the subset of external databases that are connected—directly or indirectly—to nodes annotated with that assembly. The per-assembly view is vital when users need to restrict conversions to genomes with consistent annotation coverage (e.g., choosing GRCh38-only resources for a human data set).

Returns:

Mapping from assembly number (for example 37 or

38) to the set of external databases that have at least one entry linked to that assembly.

Return type:

dict[int, set[str]]

property available_genome_assemblies: set[int]

Return the set of genome assemblies represented in the current graph.

The helper scans every identifier edge table cached on the instance (e.g. combined_edges, combined_edges_genes, combined_edges_assembly_specific_genes) and extracts the assembly component of each edge key. It therefore answers the question “Which genome builds does this graph actually know about?” Several public utilities depend on this information when validating user-supplied assembly arguments or iterating across assemblies in reproducible order (see DB.assembly_mysqlport_priority for organism-scoped priorities).

Returns:

Unique genome assembly identifiers (e.g. 38 for human GRCh38, 37 for human GRCh37, 39 for mouse GRCm39) present anywhere in the graph.

Return type:

set[int]

property available_releases_given_database_assembly: dict[tuple[str, int], set]

Map (database, assembly) pairs to the Ensembl releases in which they occur.

This expensive, cached property lets callers quickly answer “Which Ensembl releases contain at least one node from database **D* on assembly A?”* Internally it delegates the per-pair work to the nested available_releases_given_database_assembly._inline_available_releases() helper, then augments the mapping with additional information gleaned from several idtrack.DB look-ups (e.g. DB.nts_assembly, DB.nts_base_ensembl, DB.nts_ensembl). Although heavy, the routine is indispensable for test suites and diagnostic notebooks that must reason about historical coverage across many releases.

Returns:

A dictionary whose keys are (database_name, assembly)

tuples and whose values are the sets of Ensembl release numbers in which that pair is represented.

Return type:

dict[tuple[str, int], set[int]]

calculate_caches(for_test=False)[source]

Eagerly materialise every @cached_property to prime the cache.

Accessing a cached property for the first time triggers an expensive computation. Batch-loading all of them up-front improves latency for subsequent graph queries and simplifies unit-test expectations because no additional properties are computed lazily in the background.

The optional for_test flag activates a few heavyweight diagnostics that are normally skipped in production but useful for test suites and profiling.

Parameters:

for_test (bool) – If True (default), also compute caches that exist solely for testing or sanity-check purposes (e.g. external_database_connection_form). Set to False to warm only the properties required at run-time.

Return type:

None

property combined_edges: dict

Aggregate outgoing-edge metadata for every non-gene node in the graph.

This cached view pre-computes, for each backbone or external identifier, which external databases, genome assemblies, and Ensembl releases are reachable through outgoing edges—while purposely excluding Ensembl gene and assembly-specific gene nodes. The summary accelerates synonym search and other traversal routines in idtrack.track.Track.pathfinder() because consumers can consult a compact dictionary instead of repeatedly iterating raw NetworkX edges and attributes.

Returns:

Nested mapping of the form {node_name: {database_name: {assembly: set[int]}}}, where

  • node_name (str) - Identifier of the start node whose edges were inspected.

  • database_name (str) - Canonical name of the external database or Ensembl sub-type

    (e.g. uniprot, refseq_rna, assembly_x_ensembl_gene).

  • assembly (str) - UCSC-style assembly label (e.g. GRCh38); None when the edge

    is not assembly-scoped.

  • set[int] - Collection of Ensembl release numbers in which the connection is valid.

Return type:

dict

Notes

Edges that link two nodes of the **same* node-type are ignored,* ensuring the dictionary focuses on cross-type relationships that matter for ID translation.

property combined_edges_assembly_specific_genes: dict

Aggregate incoming‐edge metadata for assembly-specific Ensembl gene nodes.

Assembly-specific gene identifiers (e.g. GRCh37:ENSG00000123456) represent loci that differ between reference builds. This property mirrors the logic of combined_edges_genes() but targets nodes not captured by that property, ensuring the three cached dictionaries are mutually exclusive and collectively exhaustive. Because each such gene belongs to exactly one assembly, the returned structure always contains a single assembly key per outer node.

Returns:

Mapping {assembly_specific_gene_id: {database_name: {assembly: set[int]}}} where the sole

assembly key matches the assembly implied by the node’s own identifier.

Return type:

dict

property combined_edges_genes: dict

Aggregate incoming-edge metadata for Ensembl gene nodes.

Gene nodes only possess incoming edges (toward the gene); therefore the calculation traverses the graph in reverse (self.rev) to collect equivalent information to combined_edges(), but restricted solely to nodes whose idtrack._db.DB.node_type_str is DB.nts_ensembl["gene"]. The result merges edge data from all contributing external databases so that downstream callers receive one consolidated view per gene.

Returns:

Nested mapping {gene_id: {database_name: {assembly: set[int]}}}.

A single gene may appear under multiple assemblies when reference genomes share that transcript locus.

Return type:

dict

static compact_ranges(list_of_ranges)[source]

Collapse adjacent or touching integer ranges into the smallest possible set.

In the IDTrack graph every Ensembl identifier is active for one or more contiguous release intervals. Storing those intervals as [[start, end], …] is convenient but can become redundant when consecutive ranges abut each other. compact_ranges performs an in-place, O(n) forward sweep that merges any pair of ranges where the gap between end of the first and start of the next is ≤ 1, returning a new list that covers the exact same discrete releases with the fewest possible intervals. The helper is a cornerstone for many caching utilities (e.g. TheGraph.get_active_ranges_of_id()) and therefore optimised for speed and minimal allocations.

Parameters:

list_of_ranges (list[list[int]]) – Sorted, non-overlapping, inclusive ranges in the form [[start, end], …]. All numbers must be positive integers and start end for every range.

Returns:

A new list containing the minimal, non-overlapping, inclusive ranges that exactly

cover the union of list_of_ranges.

Return type:

list[list[int]]

property external_database_connection_form: dict[str, str]

Infer which Ensembl identifier form each external database connects to.

External databases link to exactly one “form” of Ensembl identifier—gene, transcript, or translation—determined upstream by idtrack._external_databases.ExternalDatabases. The method walks the neighborhood of every external-database node, tallies the node-type of its Ensembl neighbours, and assigns the majority form. A mis-annotation that connects an external node directly to a non-Ensembl node is interpreted as a schema violation and aborts with ValueError.

Returns:

Dictionary whose keys are external-database names and

whose values are one of "gene", "transcript", or "translation", indicating the form of Ensembl ID to which the database links.

Return type:

dict[str, str]

Raises:

ValueError – If any external-database node is found connected to a node that is not an Ensembl identifier, indicating an inconsistent graph state.

get_active_ranges_of_base_id_alternative(base_id)[source]

Return the Ensembl-release intervals during which a base gene identifier is active.

The routine unifies child-level history into an easy-to-query representation. A base Ensembl ID (e.g. ENSG00000123456) has one or more versioned descendants (ENSG00000123456.1, ENSG00000123456.2, …) whose lifetimes can never overlap. By walking the immediate neighbours of base_id and unioning every child’s get_active_ranges_of_id_ensembl_all_inclusive result, the method derives exactly the releases in which any descendant existed. This summary read-out is used by higher-level diagnostics (for example, range-overlap sanity checks) and by algorithms that need to reason about the birth and retirement of genes at the stable-ID level.

Parameters:

base_id (str) – Stable Ensembl gene identifier without version suffix. The node must have node_type == DB.nts_base_ensembl["gene"] inside the graph.

Returns:

Sorted, non-overlapping [start, end] slices inclusive at both ends.

end may be np.inf when the gene is still present in the most recent release.

Return type:

list[list[int]]

property get_active_ranges_of_id: dict[str, list[list]]

Return inclusive Ensembl-release intervals in which every node in the graph is biologically active.

The convenience wrapper iterates over all nodes currently stored in this idtrack.the_graph.TheGraph instance and delegates the heavy lifting to _get_active_ranges_of_id(). The latter performs node-type-specific logic (backbone vs. assembly-specific) to determine contiguous release windows—at no point does this method examine which genome assembly the release originated from, because for downstream tasks (lifecycle analysis, deprecation reports, etc.) only the presence/absence across release numbers matters.

Returns:

Mapping {node_id: [[start_rel, end_rel], ...]}

where every inner two-element list is an inclusive range. Ranges are sorted in ascending order and guaranteed not to overlap.

Return type:

dict[str, list[list[int]]]

get_active_ranges_of_id_ensembl_all_inclusive(the_id)[source]

Return the inclusive Ensembl-release ranges during which the_id is active across all assemblies.

This helper generalises get_active_ranges_of_id(), which only reports activity on the graph’s main assembly, by folding in evidence from every other assembly represented in combined_edges_genes. The resulting timeline therefore reflects all times at which the identifier (or any assembly-specific sibling) existed in Ensembl—crucial when downstream analyses must ignore assembly boundaries, e.g. when tracking identifier synonymy across genome builds. After merging, the routine validates that the main-assembly slice remains consistent with the authoritative backbone cache and aborts with a detailed error if divergence is detected.

Parameters:

the_id (str) – Ensembl gene identifier—either backbone (ENSG…) or assembly-qualified (assembly_<code>_ensembl_gene). The node’s DB.node_type_str must be one of DB.nts_ensembl["gene"] or the set in DB.nts_assembly_gene.

Returns:

A list of [start, end] pairs (inclusive, sorted, non-overlapping) covering every Ensembl release in which the_id was present on any assembly.

Return type:

list[list[int]]

Raises:

ValueError – If (1) activity inferred from combined_edges_genes disagrees with get_active_ranges_of_id() for the main assembly, or (2) the_id is not a recognised Ensembl-gene node type.

get_external_database_nodes(database_name)[source]

Collect identifiers that appear at least once in the specified external database.

The graph stores one node per identifier and attaches metadata—such as its origin database—to each node via self.nodes[node_name]. This helper filters that dictionary, returning every node whose metadata marks it as an external identifier belonging to database_name. The result is often fed into downstream integrity checks or exported so that analysts can cross-reference original accession lists.

Parameters:

database_name (str) – Name of the external resource (e.g. "UniProtKB"). Must be one of the values returned by TheGraph.available_external_databases().

Returns:

All unique node names (accessions) associated with database_name.

Return type:

set[str]

get_id_list(database, assembly, release)[source]

Return node identifiers for a specific (database, assembly, release) slice of the multigraph.

This helper exists primarily for unit-testing and exploratory analysis. Internally, the graph stores node metadata in the memory-intensive node_trios cache, keyed by a triple (database_or_node_type, assembly, release). get_id_list() hides that complexity, walking the full node set and extracting only those identifiers whose tuple key matches the requested slice. Because the traversal touches every node, the method is slow and scales poorly compared with the vectorised access paths used in production code. It is therefore not called in performance-critical workflows; its main purpose is to generate deterministic ground-truth lists that test-suites can compare against.

The method also reproduces legacy Ensembl behaviour: when database resolves to the canonical Ensembl gene node type on the primary assembly, identifiers whose Version attribute is one of idtrack._db.DB.alternative_versions are still included, ensuring that versioned and unversioned IDs appear together—exactly as they do in public Ensembl MySQL dumps.

Parameters:
  • database (str) – External database name for external nodes (e.g. "uniprot", "refseq") or an Ensembl node-type label such as "gene", "transcript", or "translation". Ensembl labels must match the keys defined in idtrack._db.DB.nts_ensembl.

  • assembly (int) – Genome assembly identifier (e.g. 38 for human GRCh38) that must be present in available_genome_assemblies().

  • release (int) – Ensembl release number (e.g. 111) corresponding to the graph snapshot of interest.

Returns:

A list of unique node names (identifiers) in insertion order that belong to the requested (database, assembly, release) tuple. The list may include versioned Ensembl genes as noted above.

Return type:

list[str]

Notes

The helper performs a linear scan over networkx.MultiDiGraph.nodes, so its runtime is O(|V|) and memory footprint equals that of node_trios. Prefer dedicated graph queries for production workloads and reserve this method for tests or ad-hoc inspection.

static get_intersecting_ranges(lor1, lor2, compact=True)[source]

Return the set of releases common to both input range lists.

The routine computes the pairwise intersection between every range in lor1 and every range in lor2, yielding a list of ranges where the two original lists overlap. Optionally the result may be passed through TheGraph.compact_ranges() to merge adjacent slices and guarantee a minimal representation. Because the helper is frequently used inside path-finding algorithms it trades clarity for raw performance and therefore assumes both inputs are already sorted, non-overlapping, and inclusive as produced elsewhere in the library.

Parameters:
  • lor1 (list[list[int]]) – First list of inclusive, ascending, non-overlapping ranges.

  • lor2 (list[list[int]]) – Second list of ranges with the same invariants as lor1.

  • compact (bool) – When True (default) the raw intersections are passed to TheGraph.compact_ranges() before being returned.

Returns:

Inclusive integer ranges where lor1 and lor2 overlap. The list is empty when

no overlap exists.

Return type:

list[list[int]]

get_next_edge_releases(from_id, reverse)[source]

List the Ensembl releases reachable by the next (or previous) edges from from_id.

The method scans the immediate neighbourhood of a backbone gene node and extracts the release numbers that mark either the next chronological transition (reverse = False) or the previous one (reverse = True). It respects graph directionality, skips non-backbone connections, collapses duplicate multi-edges, and treats infinite self-loops as “still active” when stepping forward in time. The result is a de-duplicated, easy-to-use list that higher-level path-finding algorithms can feed directly into release-oriented traversals.

Parameters:
  • from_id (str) – Ensembl gene identifier that must belong to the backbone (DB.external_search_settings["nts_backbone"]).

  • reverse (bool) – If False return forward (old → new) transition releases; if True return backward (new → old) releases.

Returns:

Sorted list of unique Ensembl release numbers adjacent to from_id in the chosen temporal

direction.

Return type:

list[int]

Raises:

ValueError – If from_id is not a backbone node—i.e. its node_type does not match DB.external_search_settings["nts_backbone"].

get_two_nodes_coinciding_releases(id1, id2, compact=True)[source]

Determine releases in which both graph nodes are simultaneously active.

Graph nodes (Ensembl genes, transcripts, proteins, or external IDs) exist only for defined release intervals. When integrating annotations it is often necessary to know the time span where two nodes co-exist—for example, when building an orthogonal mapping table or validating edge chronology. The method retrieves each node’s active ranges via TheGraph.get_active_ranges_of_id(), computes their intersection with TheGraph.get_intersecting_ranges(), and optionally compacts the result. The returned list therefore represents every Ensembl release in which id1 and id2 are valid simultaneously.

Parameters:
  • id1 (str) – Identifier of the first node (must exist in self.nodes).

  • id2 (str) – Identifier of the second node (must exist in self.nodes).

  • compact (bool) – Forwarded to TheGraph.get_intersecting_ranges(). When True (default) the final ranges are minimised; when False the raw intersections are returned.

Returns:

Inclusive release intervals [[start, end], …] where id1 and id2 overlap. The list is empty if the nodes never co-occur.

Return type:

list[list[int]]

property hyperconnective_nodes: dict[str, int]

Return hyper-connective external nodes and their out-degree counts.

Hyper-connective nodes are external identifiers whose out-degree (number of outgoing edges) exceeds idtrack._db.DB.hyperconnecting_threshold. Because such nodes may participate in tens of thousands of mappings, they explode the breadth-first frontier of the synonym pathfinder algorithm and become a major performance bottleneck. The algorithm therefore ignores these nodes, sacrificing a small amount of theoretical precision for a substantial speed-up.

In practice the precision penalty is negligible: hyper-connective nodes tend to be coarse-grained identifiers that already suffer from low mapping specificity (for example, generic protein or transcript accessions re-used across many unrelated biological entities). Meaningful, one-to-one synonym relationships are almost always reachable through alternative external identifiers. Consequently, ignoring hyper-connective nodes both accelerates the search and often improves the overall relevance of the results.

The value is computed lazily on first access and memoised via functools.cached_property(), so the underlying query runs at most once per TheGraph instance.

Returns:

Mapping from external node identifier to its out-degree, limited to nodes whose

out-degree is greater than idtrack._db.DB.hyperconnecting_threshold and whose idtrack._db.DB.node_type_str equals idtrack._db.DB.nts_external.

Return type:

dict[str, int]

static is_point_in_range(lor, p)[source]

Check whether a single integer lies inside any range in lor.

The helper performs a linear scan over lor (assumed sorted and non-overlapping) and returns as soon as p falls between a [start, end] pair. It is intentionally lightweight because it is called inside tight loops that filter large identifier sets by Ensembl release.

Parameters:
  • lor (list[list[int]]) – Inclusive, ascending, non-overlapping ranges against which p is tested.

  • p (int) – The release number to evaluate.

Returns:

True when p is covered by at least one range in lor; False otherwise.

Return type:

bool

static list_to_ranges(lst)[source]

Compact a sorted list of releases into minimal inclusive ranges.

The helper converts monotonically increasing, duplicate-free release numbers into a run-length representation (e.g. [1, 2, 3, 5] [[1, 3], [5, 5]]). It is the logical inverse of TheGraph.ranges_to_list() and is frequently used to post-process the raw release sets collected from edge metadata.

Parameters:

lst (list[int]) – Releases strictly increasing with no repetitions. Supplying an unsorted or duplicate-containing list leads to undefined behaviour.

Returns:

Non-overlapping [start, end] intervals covering exactly the input

elements. Each inner list is inclusive; singleton releases become [r, r].

Return type:

list[list[int]]

property lower_chars_graph: dict[str, str]

Map lowercase node identifiers to their canonical graph node names.

The cached mapping enables case-insensitive queries against the graph by translating a lowercase version of every node into the exact identifier stored in self.nodes. ID-resolution helpers such as node_name_alternatives() rely on this cache to recover the intended node even when callers supply mixed-case or lowercase strings.

During construction the method iterates once over all nodes, lowers each identifier, and asserts that no two distinct nodes collide after lower-casing. The result is memoised via functools.cached_property, so subsequent accesses are O(1).

Returns:

{lowercase_id: original_id} giving a one-to-one mapping from

lowercase node identifiers to the exact strings used in the graph.

Return type:

dict[str, str]

Raises:

ValueError – If two or more nodes become identical after converting to lowercase, indicating ambiguous casing in the underlying graph.

node_name_alternatives(identifier)[source]

Resolve a raw query identifier to the exact graph node label that ID-Track expects.

The routine shields downstream path-finding code from the myriad ways users may spell or format biological identifiers. It walks through a well-defined priority list—direct lookup, case-blind match, version-suffix trimming, and dash/underscore substitutions—before finally retrying the whole sequence with the synonym: prefix used by synonym_id_nodes_prefix. This makes interactive exploration tolerant to typos such as lower-case gene symbols (actbACTB) or versioned Ensembl IDs written with underscores (ENSG00000123456_2ENSG00000123456.2).

Parameters:

identifier (str) – Raw identifier supplied by the caller. May be an Ensembl ID, external database key, or any variant handled by the heuristics described above.

Returns:

  • The canonical node label or None when no match is possible.

  • True when identifier had to be modified (case change, suffix strip, etc.); False

    when an exact graph hit was found.

Return type:

tuple[Optional[str], bool]

Notes

Internally this is a thin wrapper that delegates the heavy lifting to the private _node_name_alternatives() helper, then retries once with the synonym prefix if the first pass fails. The helper itself is further decomposed into specialised sub-functions—see their individual docstrings for details.

property node_trios: dict[str, set[tuple]]

Return a full node → trio-set cache.

Builds the complete mapping once and stores it as a functools.cached_property. The mapping is memory-heavy but accelerates downstream helpers that repeatedly need the (<database>, <assembly>, <release>) origin of many nodes.

Returns:

Node identifier → the set of unique

(database, assembly, release) combinations in which that node is active.

Return type:

dict[str, set[tuple[str, int, int]]]

Notes

The builder simply iterates ``self.nodes`` and delegates the per-node logic to :py:meth:`_node_trios`. Expect a multi-second start-up on large graphs.

ranges_to_list(lor)[source]

Explode inclusive ranges back into a sorted list of releases.

This is the inverse of TheGraph.list_to_ranges(). Each [start, end] slice is expanded inclusive of both boundaries; if end is np.inf the interval is closed with max(self.graph["confident_for_release"]) so that downstream numeric operations continue to work on finite integers. The union of all expanded ranges is returned in ascending order without duplicates.

Parameters:

lor (list[list[int | float]]) – List of inclusive, non-overlapping [start, end] pairs. start must be > 0; end may be np.inf to denote open-ended activity.

Returns:

Strictly increasing sequence of releases represented by lor.

Return type:

list[int]

property rev: TheGraph

Return a view of the same graph with all edge directions reversed.

The call delegates to networkx.MultiDiGraph.reverse() with copy=False, meaning the returned object re-uses the underlying data structures and therefore consumes no additional memory. Use this property whenever a temporal walk must proceed backwards in history (e.g. when resolving identifiers from a newer to an older Ensembl release).

Returns:

A non-copying, lazily constructed reverse-orientation

view that honours every node and edge attribute of the original graph.

Return type:

TheGraph

class GraphMaker(db_manager)[source]

Bases: object

Creates ID history graph.

It includes Ensembl gene ID history. Ensembl ID history is obtained from Ensembl resources, which shows the connection between different Ensembl base IDs or different versions of the same Ensembl base ID. Ensembl transcripts (with base IDs and versions) are connected to gene, and Ensembl proteins are connected to transcripts. Additionally, a selected set of external databases are connected to the related Ensembl IDs: for example UniProt IDs are associated with proteins, while RefSeq transcript IDs are associated with transcripts. The GraphMaker class also saves the resulting graph into the defined temporary directory for later calculations.

Class initialization.

Parameters:

db_manager (DatabaseManager) – Needed to download all necessary tables and data frames. It contains the temporary directory to save the resultant graph.

Raises:

ValueErrorGraphMaker has to be created with the latest release possible in db_manager. If not, the exception is raised.

construct_graph(narrow=False, form_list=None, narrow_external=True)[source]

Main method to construct the graph.

It creates the graph with Ensembl gene, transcript and protein information. It also adds DB.nts_base_ensembl[f] nodes into the graph, which has only base Ensembl gene ID (no version). External database entries described in ExternalDatabases will be part of the graph. Normally, user is not expected to use this method, as the method is utilized in get_graph method.

Parameters:
  • narrow (bool) – Determine whether a some more information should be added between Ensembl gene IDs. For example, which genome assembly is used, or when was the connection is established. For usual uses, no need to set it True.

  • form_list (list | None) – Determine which forms (transcript, translation, gene) should be included. If None, then include all possible forms defined in DatabaseManager. It has to be list composed of following strings: ‘gene’, ‘transcript’, ‘translation’.

  • narrow_external (bool) – If set False, all possible external databases defined in Ensembl MySQL server will be included into the graph. The graph will be immensely larger, and the ID history travel calculation will be very slow. Additionally, the success of ID conversion under such a setting it has not been tested yet.

Returns:

Resultant multiedge directed graph.

Raises:

ValueError – Unexpected error.

Return type:

TheGraph

construct_graph_form(narrow, db_manager)[source]

Creates a graph with connected nodes based on historical relationships between each Ensembl IDs.

Parameters:
  • narrow (bool) – See parameter in Graph.construct_graph.narrow

  • db_manager (DatabaseManager) – The method reads ID history dataframe, and Ensembl IDs lists at each Ensembl release, provided by DatabaseManager.

Returns:

Resultant multi edge directed graph.

Raises:

ValueError – Unexpected error.

Return type:

TheGraph

create_file_name(narrow, form_list=None, narrow_external=True)[source]

File name creator which includes some information regarding the construction process.

Facilitates to recognize the graph based on file name.

Parameters:
  • narrow (bool) – See parameter in Graph.construct_graph.narrow

  • form_list (list[str] | None)

  • narrow_external (bool)

Returns:

Absolute file path in the temporary directory provided by DatabaseManager.

Return type:

str

export_disk(g, file_path, overwrite)[source]

Write the pickle file in the provided file path, which contains the graph.

Parameters:
  • g (TheGraph) – Multi edge directed graph object to stor in the disk.

  • file_path (str) – Absolute target path, provided by Graph.create_file_name()

  • overwrite (bool) – See parameter in Graph.get_graph.overwrite_even_if_exist

get_graph(narrow=True, create_even_if_exist=False, save_after_calculation=True, overwrite_even_if_exist=False, *, form_list=None, narrow_external=True)[source]

Simplifies the graph construction process.

Parameters:
  • narrow (bool) – See parameter in Graph.construct_graph.narrow

  • create_even_if_exist (bool) – Determine whether create the graph even if it exists. If there is no graph in the provided temporary directory, the graph will be created regardless.

  • save_after_calculation (bool) – Determine whether resultant graph will be saved or not.

  • overwrite_even_if_exist (bool) – If the graph will be saved, determine whether the program should overwrite. If False, it does not re-saves the calculated (or loaded) graph.

  • form_list (list[str] | None)

  • narrow_external (bool)

Returns:

Resultant multi edge directed graph, which can be used in all future calculations.

Return type:

TheGraph

initialize_downloads()[source]

Initialize the external database downloads.

Raises:

NotImplementedError – Not implemented yet. Currently, the necessary data sources are downloaded when needed during the graph construction process.

static read_exported(file_path)[source]

Read the pickle file in the provided file path, which contains the graph.

Parameters:

file_path (str) – Absolute path of the file of interest.

Returns:

Resultant multi edge directed graph.

Raises:

FileNotFoundError – When there is no file in the provided directory.

Return type:

TheGraph

static remove_non_gene_trees(graph, forms_remove=None)[source]

Removes the edges between the nodes with the same node type and removes abstract nodes (Void and Retired).

The nodes between two the same DB.node_type_str will be removed. Also, the nodes with versions DB.no_new_node_id and DB.no_old_node_id will be also removed.

Parameters:
  • graph (TheGraph) – The output of Graph.construct_graph or Graph.construct_graph_form.

  • forms_remove (list | None) – Determine which node type are of interest.

Returns:

Resultant multi edge directed graph.

Return type:

TheGraph

static split_id(id_to_split, which_part)[source]

Simpler method to retrieve ID or Version part of a node name.

Parameters:
  • id_to_split (str) – Query node name.

  • which_part (str) – Either ‘Version’ or ‘ID’.

Returns:

The requested substring of the node name.

Raises:

ValueError – If ‘which_part’ is assigned to some other value than ‘Version’, or ‘ID’.

Return type:

str | float

update_graph_with_the_new_release()[source]

When new release arrive, just add new nodes.

Raises:

NotImplementedError – Not implemented yet. Currently, the user is expected to recreate whole graph using get_graph method. Note that not all databases need to be re-downloaded, the program will only download the new release, and re-construct the graph.

class VerifyOrganism(organism_query)[source]

Bases: object

Resolve a tentative organism identifier to the formal Ensembl species name and its latest supported release.

The class shields end-users from the quirks of the Ensembl REST payload by converting any synonym—common name, scientific name, assembly accession, or NCBI taxon ID—into the canonical Ensembl species identifier (e.g. homo_sapiens) and the newest Ensembl release that still hosts that species. Because the mapping is refreshed on every instantiation through a live call to the Ensembl REST API, downstream workflows in idtrack always rely on up-to-date metadata rather than a possibly stale local cache.

After construction the instance offers two high-level helpers:

>>> resolver = VerifyOrganism("human")
>>> resolver.get_formal_name()      # 'homo_sapiens'
>>> resolver.get_latest_release()   # 117  (example)

Both helpers are backed by two public dataframes created during initialisation:

  • name_synonyms_dataframe — maps every synonym returned by the REST service to the chosen formal name and flags synonyms that are ambiguous across species.

  • ensembl_release_dataframe — one-row table (indexed by formal_name) holding the latest Ensembl release number.

Initialise the resolver and pre-fetch synonym/release tables from the Ensembl REST API.

The constructor immediately invokes fetch_organism_and_latest_release(), downloading the complete species list from {DB.rest_server_api}{DB.rest_server_ext} so that all subsequent look-ups run entirely in-memory. Any exceptions raised during that fetch are allowed to propagate unchanged so that callers can handle network or data-quality issues explicitly.

Parameters:

organism_query (str) – Organism identifier supplied by the user—common name ("human"), shorthand ("hsapiens"), taxon ID (9606), or fully qualified Ensembl species name ("homo_sapiens"). The value is converted to lower case before processing.

fetch_organism_and_latest_release(connect_timeout, read_timeout)[source]

Query the Ensembl REST API once and build lookup tables for species synonyms and latest releases.

This internal utility performs a single call to /info/species on the Ensembl REST server, parses the returned JSON, and constructs two pandas dataframes:

  • name_synonyms_df - one row per synonym, with columns synonym, formal_name and ambiguous

    (True if the synonym belongs to more than one species).

  • latest_ensembl_releases_df - indexed by formal_name and holding a single ensembl_release

    integer column.

Consolidating the REST query in one place avoids repeated network traffic and provides a cache-friendly structure for subsequent lookups.

Parameters:
  • connect_timeout (int) – Seconds to wait while establishing the TCP connection to the Ensembl server.

  • read_timeout (int) – Seconds to wait for the server to send the full response after the connection has been established.

Returns:

(name_synonyms_df, latest_ensembl_releases_df) as described above.

Return type:

tuple[pandas.DataFrame, pandas.DataFrame]

Raises:
  • TimeoutError – If the combined (connect_timeout, read_timeout) limit is exceeded.

  • ValueError – If the JSON schema differs from the expected {"species": [...]} structure or required keys are missing.

get_formal_name()[source]

Resolve the user’s organism query to the canonical Ensembl species name.

The method performs an exact match against the synonym column of name_synonyms_dataframe, which was pre-populated from the Ensembl REST species endpoint. Synonyms include scientific names, common names, NCBI TaxIDs, assembly accessions and other aliases, allowing flexible user input while guaranteeing that only one formally recognised organism is selected before any expensive data retrieval begins.

Returns:

The canonical Ensembl species identifier (always lower-case, e.g. "homo_sapiens").

Return type:

str

Raises:
  • KeyError – If the query string does not match any synonym in the dataframe.

  • ValueError – If the query matches more than one formal name, indicating an ambiguous synonym.

get_latest_release()[source]

Return the latest Ensembl release number associated with the queried organism.

This helper calls get_formal_name() to resolve the user-supplied organism query to the canonical Ensembl species name, then looks up that key in the dataframe prepared at instantiation time. Down-stream routines (e.g. database connectors, file download helpers) rely on this value to decide which Ensembl release to fetch, ensuring the entire pipeline stays on a single, internally consistent genome build.

Returns:

The most recent Ensembl release available for the resolved organism.

Return type:

int

class Track(db_manager, **kwargs)[source]

Bases: object

Bidirectional path-finding resolver for biological identifiers.

Track builds and queries a bio-ID multigraph that stitches together Ensembl history edges (genes, transcripts, proteins) and cross-reference edges to external databases (UniProt, RefSeq, …). Given a source identifier, a target Ensembl release, and/or a target database, the class:

  1. Normalises the source to an Ensembl gene node when necessary.

  2. Time-travels through historical edges—forward or backward—until it reaches the requested release, optionally “beaming-up” through external IDs when the backbone is disconnected.

  3. Converts the resolved Ensembl gene into the requested external database (or returns the gene itself) while annotating the result with confidence scores and the full traversal path.

Two mutually-recursive engines power the search:

  • _recursive_function — depth-first search along temporal edges.

  • _recursive_synonymous — search for synonymous nodes at a single release to enable the external “beam-up”.

graph

The pre-computed bio-ID graph produced by _graph_maker.GraphMaker.

Type:

networkx.MultiDiGraph

version_info

Metadata about the graph build (Ensembl releases included, build date, Git commit, etc.).

Type:

dict

_external_entrance_placeholder

Sentinel node IDs that mark artificial edges used when an external ID is pulled onto the Ensembl backbone (False → -1, True → 10001).

Type:

dict[bool, int]

_external_entrance_placeholders

Sorted list of the sentinel values above.

Type:

list[int]

Create a Track resolver and load (or build) its graph.

Parameters:
  • db_manager (DatabaseManager) – Connection manager that knows how to fetch Ensembl and cross-reference tables from a local cache or a live MySQL mirror. The same instance is forwarded to _graph_maker.GraphMaker.

  • kwargs – Additional keyword arguments forwarded verbatim to _graph_maker.GraphMaker.get_graph(). Common flags include force_rebuild (recompute the graph from scratch), species (restrict to one taxon), and cache_dir (override on-disk cache location).

property _calculate_node_scores_helper

Build and cache helper look-ups for node-scoring.

The property constructs two complementary data structures:

  • filter_set - the union of

    (1) every external-database node-type present in the graph, and (2) every Ensembl-specific node-type (gene, transcript, translation, …) across all assemblies. This set can therefore be passed unmodified to synonymous_nodes() to ask for “anything that is not an assembly-less backbone gene”.

  • ensembl_include - a mapping

    {form → set(node_type_str)} where each value lists the node-types that should be considered equivalent to that form (e.g. gene, transcript, translation) when computing richness metrics.

Returns:

(filter_set, ensembl_include) exactly as described above.

Return type:

tuple[set[str], dict[str, set[str]]]

_choose_relevant_synonym_helper(from_id, synonym_ids, to_release, from_release, mode)[source]

Select the most temporally relevant synonym(s) for an Ensembl gene-ID family.

The method evaluates each candidate in synonym_ids against the target release to_release and, when applicable, the source release from_release. Its job is to decide where the path should enter the Ensembl backbone and whether the remainder of the traversal must run in reverse (new → old) order.

Selection strategy

  1. Fixed `from_release` - If the caller already knows the release of the starting node, every candidate is paired with that same release and the correct reverse flag is derived trivially.

  2. Non-backbone start - When the starting node is not an Ensembl-gene backbone ID, the synonym whose active range edge is closest (or farthest, per mode) to to_release is chosen.

  3. Backbone start - If the query is itself an Ensembl-gene, the algorithm first looks for overlapping ranges between the query and each synonym; if none overlap, it falls back to the distance rule described in step 2.

param from_id:

Identifier from which the path search will start.

type from_id:

str

param synonym_ids:

Ensembl IDs considered synonyms of from_id (typically the same gene with different version numbers).

type synonym_ids:

Sequence[str]

param to_release:

Target Ensembl release that the overall conversion aims for.

type to_release:

int

param from_release:

Release in which from_id is known to be active. If None, the method infers a suitable release for each candidate.

type from_release:

int | None

param mode:

Either ‘closest’ or ‘distant’—controls whether the synonym chosen should minimise or maximise its distance to to_release.

type mode:

str

returns:

One or more triplets of the form [synonym_id, entry_release, reverse] where:

  • synonym_id - the chosen synonym,

  • entry_release - release at which to join the backbone, and

  • reverse - True if the subsequent history walk must run backwards in time.

rtype:

list[list[Union[str, int, bool]]]

raises ValueError:

If no synonym satisfies the distance/overlap criteria or if mode is invalid.

Parameters:
  • to_release (int)

  • from_release (int | None)

  • mode (str)

_create_priority_list_ensembl(from_id, to_release)[source]

Build a priority list of assemblies in which from_id is active.

The priorities are the numeric assembly rankings defined in DB.assembly_mysqlport_priority (smaller numbers mean higher priority).

Parameters:
  • from_id (str) – Ensembl gene identifier.

  • to_release (int) – Target Ensembl release; only assemblies that contain this release are considered.

Returns:

Sorted list of priority values (ascending).

Return type:

list[int]

_ensure_assembly_priority_cache()[source]

Ensure per-graph assembly-priority caches exist.

Some test fixtures construct Track via Track.__new__() (bypassing __init__). Keep the conversion code robust by lazily initialising the per-graph assembly-priority mapping.

Return type:

None

_final_conversion(converted, cnvt, final_database, ens_release, return_path, return_ensembl_alternative, prevent_assembly_jumps=True, account_for_hyperconnected_nodes=False)[source]

Convert an Ensembl gene node to the requested external database.

Convert an Ensembl gene node to the requested external database and merge the result back into converted.

The routine:

  1. Builds every legal synonym path from cnvt to final_database that is active in ens_release (or in any release as a fallback).

  2. Computes assembly-jump penalties for each path.

  3. Calls _final_conversion_dict_prepare() to create the conversion sub-dict.

  4. Optionally falls back to returning the Ensembl gene itself when no synonym exists and return_ensembl_alternative is True.

Parameters:
  • converted (dict) – The current accumulator being built by convert().

  • cnvt (str) – Ensembl gene identifier that is undergoing final conversion.

  • final_database (str) – Target external database.

  • ens_release (int) – Target Ensembl release.

  • return_path (bool) – If True, embed the path(s) that lead to each synonym.

  • return_ensembl_alternative (bool) – When no synonym can be found, add a fallback entry that keeps the Ensembl gene.

  • prevent_assembly_jumps (bool) – If True, disallow conversion paths that cross between different genome assemblies. Defaults to False.

  • account_for_hyperconnected_nodes (bool) – If True, skip nodes that are marked as hyperconnective (very high connectivity) to prevent search explosion and low-quality paths. Defaults to True.

Returns:

The same converted dict, updated in place (and also returned for convenience).

Return type:

dict

Raises:

EmptyConversionMetricsError – Raised when no valid conversion metrics are available and no alternative conversion path can be found.

static _final_conversion_dict_prepare(confidence, sysns, paths, min_priority_list, len_priority_list, add_ass_jump_list, final_database)[source]

Assemble the final-conversion section that will be attached to a candidate path.

The section contains a global conversion-confidence flag plus one entry per synonym that survived the path-finding stage. When paths is None the structure is identical but omits the ‘the_path’ member to save memory.

Parameters:
  • confidence (int | float) – Heuristic confidence for the whole conversion step - 0 for “perfect”, larger values for fallback scenarios, np.inf when no conversion was possible.

  • sysns (list) – List of synonym identifiers in the same order as the metric lists below.

  • paths (list[list] | None) – One walk (edge list) per synonym, or None if the caller does not want to expose paths.

  • min_priority_list (list) – Minimum assembly priority reached by each walk.

  • len_priority_list (list) – Number of distinct assembly priorities encountered by each walk.

  • add_ass_jump_list (list) – Additional assembly-jump penalty incurred during the synonym hop itself.

  • final_database (str) – Name of the database these synonyms belong to (e.g. ‘uniprot’ or DB.nts_ensembl[DB.backbone_form]).

Returns:

Nested dictionary ready to be stored under the key ‘final_conversion’.

Return type:

dict

static _minimum_assembly_jumps_helper(step_pri, current_priority, priorities, assembly_priority=None)[source]

Internal worker for minimum_assembly_jumps().

Given the priority sets for the remaining edges, iterate until all have been consumed while updating the current assembly priority and counting how often it must drop.

Parameters:
  • step_pri (list[int]) – Priority values of the edge currently under consideration.

  • current_priority (int) – Priority value inherited from previous steps.

  • priorities (list[list[int]]) – Priority lists for the rest of the path, already sorted for correct bisecting.

  • assembly_priority (list[int] | None) – Optional global priority lattice. If None, it is computed from step_pri, current_priority, and priorities.

Returns:

Same three-tuple as documented in minimum_assembly_jumps().

Return type:

tuple[int, list[int], int]

_path_score_sorter_all_targets(dict_of_dict, from_id, to_release)[source]

Select the overall best target(s).

Select the overall best target(s) once every candidate Ensembl node has itself been reduced to its single best path.

The method linearises several per-path metrics into an importance order (see the tuple at the top of the function), then:

  1. Computes that ordered score for each pair (ensembl_gene, final_target).

  2. Finds the global minimum; if multiple pairs tie:
    • Prefer the target whose identifier is identical to from_id.

    • If more than one Ensembl gene still tie, fall back on

      calculate_node_scores() to favour the “richer” node.

  3. Returns a pruned copy of dict_of_dict that contains only the surviving Ensembl genes, each with only the winning final_elements entry. Additional provenance is written to filter_scores.

Parameters:
  • dict_of_dict (dict) – Nested result of calculate_score_and_select(). Keys are candidate Ensembl genes; values are dictionaries that already contain one best path per final target.

  • from_id (str) – Original query identifier; used to break ties in favour of “same as input”.

  • to_release (int) – Target Ensembl release; forwarded to calculate_node_scores() during tie-breaking.

Returns:

A reduced version of dict_of_dict holding only the winner(s) and enriched with a

final_elements[*][‘filter_scores’] sub-dict that records the filters applied.

Return type:

dict

Raises:
  • AssertionError – If node-score tie-breaking results in an empty candidate set.

  • ValueError – If dict_of_dict is empty.

static _path_score_sorter_single_target(lst_of_dict)[source]

Select the best score dictionary for one conversion target.

The input is a list of dictionaries produced by calculate_score_and_select(). Each dictionary is converted into a tuple according to the lexicographic importance order

(“assembly_jump”, “external_jump”, “external_step”, “edge_scores_reduced”, “ensembl_step”)

and the dictionary with the smallest tuple is returned.

Parameters:

lst_of_dict (list[dict]) – Candidate score dictionaries for this target.

Returns:

The chosen “winner” score dictionary.

Return type:

dict

Raises:

ValueError – If the input list is empty.

_recursive_synonymous(_the_id, synonymous_ones, synonymous_ones_db, filter_node_type, the_path=None, the_path_db=None, depth_max=0, from_release=None, ensembl_backbone_shallow_search=False, account_for_hyperconnected_nodes=True)[source]

Helper method to be used in _graph.Track.synonymous_nodes().

Recursively explore the bio-ID graph to collect synonymous paths starting at _the_id and ending on a node whose type is a member of filter_node_type.

A path is a list of node identifiers (_the_path) together with a parallel list of their node-type strings (_the_path_db). The search is breadth-limited: the depth of a path is defined as the maximum count of any single node-type it contains (e.g. a path with three external nodes has depth 3). Recursion stops when that depth would exceed depth_max.

Additional pruning rules:

  • The walk never visits the same node twice (no cycles).

  • It never traverses two consecutive edges whose source and target

    share the same node-type—this prevents “time-travel” within the Ensembl history backbone.

  • When ensembl_backbone_shallow_search is True, the search is

    restricted to the reverse direction except for node-types listed in DB.nts_bidirectional_synonymous_search.

On reaching a terminating node the method appends the discovered paths to synonymous_ones and synonymous_ones_db. It does not return anything. Results are accumulated in synonymous_ones and synonymous_ones_db.

Parameters:
  • _the_id (str) – Identifier of the starting node (Ensembl or external).

  • synonymous_ones (list) – Mutable list that will receive each successful identifier path.

  • synonymous_ones_db (list) – Mutable list that will receive the corresponding node-type paths.

  • filter_node_type (set[str]) – Allowed node-types for the final node of a path (e.g. {‘ensembl_gene’}).

  • the_path (list | None) – Current path leading to _the_id; None for the root invocation.

  • the_path_db (list | None) – Node-type counterpart of the_path; None for the root invocation.

  • depth_max (int) – Maximum allowed depth as defined above.

  • from_release (int | None) – If given, only keep terminal nodes that are active in this Ensembl release.

  • ensembl_backbone_shallow_search (bool) – Activate the shallow, mostly-reverse search mode described above.

  • account_for_hyperconnected_nodes (bool) – If True, skip nodes that are marked as hyperconnective (very high connectivity) to prevent search explosion and low-quality paths. Defaults to True.

calculate_node_scores(the_id, ens_release)[source]

Rank competing Ensembl targets by the “richness” of their synonyms.

The method counts, within a radius of two synonym hops, how many unique identifiers of various categories point to each candidate and returns the counts as negative integers so that smaller is better for the up-stream sorter.

Parameters:
  • the_id (str) – Identifier that is being converted.

  • ens_release (int) – Target Ensembl release; only synonyms active in this release are considered.

Returns:

[-ext, -form₁, -form₂] where
  • ext - number of distinct external-database synonyms.

  • form₁ - number of distinct synonyms of the most important Ensembl form (typically gene).

  • form₂ - number of distinct synonyms of the second form (typically transcript or translation).

Return type:

list

Raises:

ValueError – If the graph does not expose exactly the two expected non-backbone forms, or if a synonym node’s type cannot be mapped to external, form₁, or form₂.

calculate_score_and_select(all_possible_paths, reduction, remove_na, from_releases, to_release, score_of_the_queried_item, return_path, from_id)[source]

Collapse a set of candidate paths into the single best path per target.

For each path produced by the search engine the function:

1. Computes an edge-score aggregate using reduction while handling missing values as directed by remove_na. 2. Tallies external statistics (steps, jumps, initial conversion confidence) and assembly statistics (number of priority drops, final priority). 3. Packs all metrics into a dictionary and stores it under the key of the path’s final destination node. 4. Keeps only the lexicographically “smallest” dictionary per destination via _path_score_sorter_single_target().

Parameters:
  • all_possible_paths (tuple) – Sequence of edge-lists representing every admissible walk returned by the path-finder.

  • reduction (Callable) – Function such as np.mean or sum used to collapse edge weights into one number.

  • remove_na (str) – How to treat NaN edge weights - one of ‘omit’, ‘to_1’, ‘to_0’.

  • from_releases (Iterable[int]) – Release that each path starts from; must align with all_possible_paths.

  • to_release (int) – Target release - needed to know whether an edge is traversed forward or reverse.

  • score_of_the_queried_item (float) – Fallback weight for the implicit edge that represents the query ID itself.

  • return_path (bool) – If True, embed the full edge-list inside each score dict under the key ‘the_path’.

  • from_id (str) – Original identifier being converted - echoed back in the score dict for traceability.

Returns:

Mapping {destination_id → best_score_dict}. Each score dict contains (inter alia) assembly_jump, external_jump, external_step, edge_scores_reduced, and ensembl_step.

Return type:

dict

Raises:

ValueError – If an unexpected edge encoding is encountered, if an edge score is invalid/∞, or if remove_na is set to an unknown mode.

choose_relevant_synonym(the_id, depth_max, to_release, filter_node_type, from_release)[source]

Wrapper that discovers, clusters, and ranks synonymous Ensembl candidates for a given identifier.

The function performs three steps:

  1. Discover paths to all Ensembl-gene nodes that share the same biological identity (synonymous_nodes).

  2. Cluster those paths by gene ID (ignoring version).

  3. Rank each cluster with _choose_relevant_synonym_helper(), selecting the entry release (and direction) that best suits to_release.

Parameters:
  • the_id (str) – Source identifier (Ensembl or external).

  • depth_max (int) – Maximum depth passed to synonymous_nodes(); governs how far the synonym search is allowed to roam through external nodes.

  • to_release (int) – Target Ensembl release required by the overall conversion.

  • filter_node_type (set[str]) – Node-types that the synonym search must terminate on (usually {‘ensembl_gene’}).

  • from_release (int | None) – Known active release of the_id. If None, the helper will infer one.

Returns:

A list whose elements are

[synonym_id, entry_release, reverse, identifier_path, node_type_path]

where the last two items reproduce the path returned by synonymous_nodes().

Return type:

list[list[Any]]

Notes

The method purposefully keeps **all* equally-ranked candidates; further tie-breaking is deferred to the main path-scoring routine.*

convert(from_id, from_release=None, to_release=None, final_database=None, reduction=<function mean>, remove_na='omit', score_of_the_queried_item=nan, go_external=True, prioritize_to_one_filter=False, return_path=False, deprioritize_lrg_genes=True, return_ensembl_alternative=True)[source]

End-to-end ID conversion workflow.

Starting from from_id the routine

  1. Determines the correct time-travel direction if from_release is unspecified.

  2. Enumerates all admissible paths with get_possible_paths() (forward and/or reverse).

  3. Collapses those paths with calculate_score_and_select().

  4. Optionally converts the surviving Ensembl gene(s) into final_database via _final_conversion().

  5. Optionally applies a final global selection with _path_score_sorter_all_targets().

The output structure mirrors this decision tree and, when return_path is True, embeds the full edge list so that callers can audit every hop.

Parameters:
  • from_id (str) – Source identifier (Ensembl, UniProt, RefSeq, …).

  • from_release (int | None) – Starting Ensembl release. None → infer from the graph.

  • to_release (int | None) – Target Ensembl release. Defaults to the newest release contained in the graph.

  • final_database (str | None) – External database to convert into. None → stay on the Ensembl gene.

  • reduction (Callable) – Function (e.g. numpy.mean) used to collapse per-edge weights. Must accept an iterable of floats and return a float.

  • remove_na (str) – Strategy for NaN edge weights - ‘omit’, ‘to_1’, or ‘to_0’.

  • score_of_the_queried_item (float) – Weight assigned to the implicit edge that represents from_id itself.

  • go_external (bool) – Allow jumps through external databases when the backbone is disconnected.

  • prioritize_to_one_filter (bool) – After all scoring, keep only the single globally best target.

  • return_path (bool) – Embed the full edge list(s) in the returned dictionary.

  • deprioritize_lrg_genes (bool) – If True and other results exist, drop LRG_* genomic regions from the final set.

  • return_ensembl_alternative (bool) – When converting to an external database, also return the Ensembl gene as a fallback.

Returns:

  • dict - Structured result as described above.

  • None - No admissible path was found.

Return type:

dict | None

Raises:

ValueError – For non-callable reduction, unsupported remove_na modes, unknown final_database values, or logical inconsistencies detected during processing.

convert_optimized_multiple()[source]

Placeholder for a batch-optimised converter.

The intended behaviour is to accept multiple query IDs and choose a conversion target for each such that cross-sample clashes (e.g. duplicate loci) are minimised.

Note

This method is a placeholder for future implementation. Use idtrack._api.API.convert_identifier_multiple() for batch conversions until this optimised version is available.

Raises:

NotImplementedError – Always - the optimisation strategy is not yet implemented.

edge_key_orientor(n1, n2, n3)[source]

Return the stored orientation of a multigraph edge.

For multigraphs every logical edge is stored once, but the caller may hold (u, v, k) or (v, u, k). This helper resolves the ambiguity so that subsequent attribute look-ups succeed.

Parameters:
  • n1 (str) – One endpoint of the edge.

  • n2 (str) – The other endpoint.

  • n3 (int) – Edge key (index) within the networkx multi-edge.

Returns:

A triple that is guaranteed to exist as written in self.graph.

Return type:

tuple[str, str, int]

Raises:

AssertionError – If neither orientation is present in the graph.

static get_from_release_and_reverse_vars(lor, p, mode)[source]

Derive a list of (release, reverse) tuples.

Derive a list of (release, reverse) tuples that indicate which Ensembl release to start the graph walk from and whether that walk should move backwards in time.

Given a collection of active-range intervals lor and a pivot release p, the algorithm selects one or two release points per interval depending on mode:

  • ‘closest’ - choose the release nearest to p within or at

    the ends of the interval.

  • ‘distant’ - choose the release farthest from p within the

    interval.

The boolean in each tuple is True when the walk should start after the selected release and move backwards (i.e. “reverse mode”), and False when it should move forwards.

Parameters:
  • lor (list) – List of inclusive (first_release, last_release) intervals in ascending order.

  • p (int) – Pivot release around which “closest” or “distant” is evaluated.

  • mode (str) – Either ‘closest’ or ‘distant’.

Returns:

Release / reverse-flag pairs, ordered in the sequence they should be tried by the path-finder.

Return type:

list[tuple[int, bool]]

Raises:

ValueError – If an interval in lor is malformed, mode is not recognised, or internal consistency checks fail.

get_next_edges(from_id, from_release, reverse, debugging=False)[source]

Enumerate chronologically admissible history edges from a node.

Starting at from_id and release from_release, the method scans outgoing (or incoming, when reverse is True) edges whose timestamps allow the path to advance in the desired temporal direction. It collapses duplicate “same-ID” transitions and flags self-loops so that later heuristics can treat branch points and tips differently.

Parameters:
  • from_id (str) – Current node from which the search will step.

  • from_release (int) – Release at which the current node is known to exist.

  • reverse (bool) – False to walk forward in history (old → new), True to walk backward (new → old).

  • debugging (bool) – If set, disables the duplicate-edge collapse so that unit tests can inspect the raw edge set.

Returns:

Sorted list of edge descriptors, each of which is

[edge_release, is_self_loop, src_node, dst_node, multiedge_key].

Return type:

list[list[Union[int, bool, str, int]]]

Raises:

ValueError – If inconsistent multi-edges (same nodes, same release) are detected—this signals a corrupted graph build.

get_possible_paths(from_id, from_release, to_release, reverse, go_external=True, increase_depth_until=2, increase_jump_until=0, from_release_inferred=False)[source]

Run path_search() under progressively relaxed settings.

Run path_search() under progressively relaxed settings until at least one viable path is found—or every relaxation level is exhausted.

Four search stages are attempted in order:

  1. Backbone-only - external jumps disabled.

  2. External enabled - allow external jumps; increment synonym depth

    and jump limit after each failure up to increase_depth_until/increase_jump_until.

  3. Backbone with multiple-Ensembl transition - external disabled but

    permit starting release to shift on external nodes.

  4. External + multiple-transition - most permissive search, with

    iterative depth/jump relaxation as in stage 2.

Parameters:
  • from_id (str) – Identifier to convert.

  • from_release (int) – Release at which the search begins.

  • to_release (int) – Desired target release.

  • reverse (bool) – Traverse the Ensembl history backwards if True, forwards otherwise.

  • go_external (bool) – If False, skip any stage that requires external jumps.

  • increase_depth_until (int) – Additional synonym-search depth to allow beyond the default.

  • increase_jump_until (int) – Additional external-jump count to allow beyond the default.

  • from_release_inferred (bool) – Reserved for future use. Indicates that from_release was chosen automatically rather than provided by the user.

Returns:

All paths discovered by the most restrictive stage that yielded at least one result, returned as an immutable tuple.

Return type:

tuple[tuple[tuple[str, str, int]]]

Notes

The function copies and mutates DB.external_search_settings internally; the caller’s copy is not modified.

identify_source(dataset_ids, mode)[source]

Infer the most likely origin (assembly and/or Ensembl release) of a heterogeneous identifier list.

The function tallies how often each origin triple appears among dataset_ids and returns the counts sorted in descending order.

Parameters:
  • dataset_ids (list[str]) – Collection of identifiers to analyse.

  • mode (str) – Granularity of the origin to extract - one of - ‘complete’ → (assembly, db, release) - ‘ensembl_release’ → release only - ‘assembly’ → assembly only - ‘assembly_ensembl_release’ → (assembly, release)

Returns:

Pairs (origin, count) sorted by frequency.

Return type:

list[tuple[Any, int]]

Raises:

ValueError – If mode is not one of the recognised values.

minimum_assembly_jumps(the_path, step_pri=None, current_priority=None)[source]

Compute the penalty incurred by assembly downgrades along a path.

Each path step may be annotated with one or more candidate assemblies. These are translated into priority values via the organism-scoped configuration DB.assembly_mysqlport_priority. The algorithm walks the path, tracking the current priority and counting how many times it must drop to a lower priority value—each drop constitutes an “assembly jump” penalty.

Parameters:
  • the_path (Iterable[tuple]) – Sequence of edge descriptors; each element is either (n1, n2, k) or (n1, n2, k, release).

  • step_pri (list[int] | None, optional) – Priority list for the first edge. If None, it is derived from the_path.

  • current_priority (int | None, optional) – Starting priority. If None, initialised to max(step_pri).

Returns:

  • assembly_jump - total number of priority drops.

  • step_pri - priority list of the last processed edge.

  • current_priority - priority value after the final edge.

Return type:

tuple[int, list[int], int]

Enumerate every admissible history path from from_id at from_release to to_release.

The algorithm performs a depth-first traversal of the Ensembl history edges. Whenever it becomes stranded on a non-backbone node it may “beam-up” via a synonym path through an external database, subject to the constraints in external_settings:

  • synonymous_max_depth - maximum depth of a synonym search.

  • jump_limit - maximum number of external “beam-up” jumps allowed.

  • nts_backbone - canonical node-type of the Ensembl backbone.

Additional flags control the initial conditions:

  • Setting external_jump to np.inf disables external jumps.

    Setting it to None enables them with the counter reset to 0.

  • multiple_ensembl_transition allows the algorithm to time-travel to

    a different release while still on an external node; this is useful when from_release was inferred and might not actually connect.

Parameters:
  • from_id (str) – Identifier to start the search from.

  • from_release (int) – Release number where from_id is considered active.

  • to_release (int) – Target Ensembl release.

  • reverse (bool) – If True, traverse the graph backwards in time; otherwise forwards.

  • external_settings (dict) – Copy of DB.external_search_settings that governs depth, jump limits, and backbone node-type.

  • external_jump (float | None) – Current external-jump count (None starts from zero, np.inf forbids any jump).

  • multiple_ensembl_transition (bool) – Permit the synonym engine to select a different release for an external node when no path exists at from_release.

Returns:

A set of edge-lists. Each edge is stored as (src, dst, key); an empty walk that terminates immediately is represented by ((None, from_id, None),).

Return type:

set[tuple[tuple[str, str, int]]]

Notes

The method is intentionally side-effect free; it constructs all intermediate data on the stack and returns a fresh set.

path_step_possible_assembly_jumps(n1, n2, n3, n4=None)[source]

Return the genome-assembly codes that can legally be used for a single edge.

The helper inspects the edge that connects n1n2 and filters the assemblies recorded on that edge against the release constraint n4:

  • None - the edge is treated as backbone history; the result is the

    graph-wide default assembly (usually the build on which the backbone was constructed).

  • int - keep only assemblies whose release set contains that single release.

  • set[int] - keep assemblies whose release set intersects the provided set.

Parameters:
  • n1 (str) – Source node identifier.

  • n2 (str) – Destination node identifier.

  • n3 (int) – Edge key within the NetworkX multigraph.

  • n4 (int | set[int] | None, optional) – Release filter as described above.

Returns:

Sorted list of assembly codes (species-specific integers; e.g. [37, 38] for human).

Return type:

list[int]

Raises:

ValueError – If n4 is of an unsupported type.

should_graph_reversed(from_id, to_release)[source]

Determine the temporal orientation of the graph walk.

Given an identifier that is active in one or more release intervals, the routine decides whether the subsequent path-finder must move forward in time, backward in time, or explore both directions in parallel in order to reach the target release.

The decision is based on the closest boundary of every active interval returned by Track.get_from_release_and_reverse_vars() (mode=’closest’).

Parameters:
  • from_id (str) – The starting identifier (Ensembl gene, transcript, protein, or external ID).

  • to_release (int) – The Ensembl release the user wishes to convert to.

Returns:

  • ‘forward’ - walk old → new, starting at the earliest

    release in which from_id is active &nbsp;&nbsp;&nbsp;→ return (‘forward’, start_release)

  • ’reverse’ - walk new → old, starting at the latest

    active release &nbsp;&nbsp;&nbsp;→ return (‘reverse’, start_release)

  • ’both’ - split search: one forward walk and one reverse

    walk &nbsp;&nbsp;&nbsp;→ return (‘both’, (forward_start, reverse_start))

Return type:

tuple[str, Union[int, tuple[int, int]]]

Raises:

ValueError – If from_id is never active in or around to_release (i.e. no viable starting release can be found).

synonymous_nodes(the_id, depth_max, filter_node_type, from_release=None, ensembl_backbone_shallow_search=False, account_for_hyperconnected_nodes=True)[source]

Public wrapper around _recursive_synonymous().

The method returns all minimal-length synonym paths emanating from the_id.

The function first runs a default depth search determined by DB.external_search_settings[‘synonymous_max_depth’]. If no synonym is found and depth_max is greater than that default, a second, deeper search is attempted.

For every distinct target node the shortest path is kept; longer paths to the same target are discarded.

Parameters:
  • the_id (str) – Source identifier.

  • depth_max (int) – Maximum search depth to try if the default search fails.

  • filter_node_type (set[str]) – Node-types that are acceptable for the target node(s). Must not include the generic ‘external’ type—specify the concrete external DB instead.

  • from_release (int | None) – Constrain targets to those active in this Ensembl release.

  • ensembl_backbone_shallow_search (bool) – If True, restricts the graph traversal as explained in _recursive_synonymous().

  • account_for_hyperconnected_nodes (bool) – If True, skip nodes that are marked as hyperconnective (very high connectivity) to prevent search explosion and low-quality paths. Defaults to True.

Returns:

A list whose elements are [identifier_path, node_type_path] pairs,

each representing the minimal route to one synonymous node.

Return type:

list[list[list[str]]]

Raises:

ValueError – If filter_node_type improperly contains the generic external type, or if depth_max is incompatible with ensembl_backbone_shallow_search.

class TrackTests(*args, **kwargs)[source]

Bases: Track, ABC

Developer-facing integrity-test harness for Track.

This module defines TrackTests, a mix-in that adds an extensive white-box test suite to a populated idtrack.Track instance. The class is for developers only; it should never be used in production pipelines. Every public method beginning with is_ returns a boolean that tells whether a specific invariant holds. Methods beginning with history_ execute heavier, end-to-end conversions and collect rich statistics. The class is intended to be mixin-ed into a concrete Track subclass—or instantiated standalone—after the underlying graph and lookup tables have been fully built. It performs purely read-only operations and therefore imposes no risk of mutating state.

All test methods share the following contract:

  • They never raise on failure—return-value only—so they can be run in bulk without interrupting your session.

  • A return value of True means the invariant holds; False means a violation was detected.

  • Where useful, a verbose flag gives a tqdm progress bar so long- running checks remain user-friendly.

Typical use:

tests = TrackTests(...)
tests.is_id_functions_consistent_ensembl()  # Raises if inconsistent.

Note

The class is not designed for production; instantiate it only in test suites or interactive debugging sessions.

Initialize the test harness.

All positional and keyword arguments are forwarded verbatim to __init__. Besides constructing the underlying graph, the initializer sets up a dedicated logging.Logger named "track_tests" so individual test routines can emit structured diagnostics without polluting the main application log.

Parameters:
  • args – Positional arguments accepted by __init__.

  • kwargs – Keyword arguments accepted by __init__.

_format_history_travel_testing_report(res, include_header=False, line_separation_at_end=True)[source]

Format a complete history travel testing report with metrics.

Parameters:
  • res (dict[str, Any]) – Results dictionary from history_travel_testing containing conversion metrics and parameters.

  • include_header (bool) – If True, include the header section with source/target information. Defaults to False.

  • line_separation_at_end (bool) – If True, append a blank line separator at the end of the report. Defaults to True.

Returns:

Lines of formatted text for the complete report.

Return type:

list[str]

_format_history_travel_testing_report_header(p)[source]

Format the header section for a history travel testing report.

Parameters:

p (dict[str, Any]) – Parameters dictionary containing from_database, from_assembly, from_release, to_database, and to_release.

Returns:

Lines of formatted text for the report header.

Return type:

list[str]

history_travel_testing(from_release, from_assembly, from_database, to_release, to_database, go_external, prioritize_to_one_filter, convert_using_release, from_fraction=1.0, verbose=True, verbose_detailed=False, return_ensembl_alternative=False)[source]

Run an end-to-end Ensembl-history conversion and collect granular QA metrics.

The routine samples identifiers from from_database/from_release (optionally down-sampling via from_fraction) and converts each one to to_database/to_release using idtrack.Track.convert(). It is intentionally non-fatal: every failure mode is caught, logged and tallied so that large regression suites can run unattended. All results are returned in a single nested metrics dictionary whose structure mirrors the printable report produced by format_history_travel_testing_report().

The statistics fall into four conceptual groups; each counter not only records an absolute event count but also serves as a red-flag indicator for specific classes of mapping pathology. Use the guidelines below to interpret the numbers and decide whether a run is healthy, questionable, or action-required.

  • Failure / anomaly counters

    • history_voyage_failed_gracefully - the converter raised EmptyConversionMetricsError.

    • history_voyage_failed_unknown - any other unexpected exception.

    • query_not_in_the_graph - source ID absent from the graph.

    • lost_item - traversal finished but produced no final IDs.

    • lost_item_but_the_same_id_exists - special case of lost_item when the target and source DB are both Ensembl-gene and the target ID still exists in the graph.

    • found_ids_not_accurate - at least one returned target ID is not part of the authoritative ids_to reference set.

  • Mapping quality

    • one_to_one_ids - queries that resolved to exactly one target ID.

    • one_to_multiple_ids - queries with > 1 admissible targets.

    • one_to_multiple_final_conversion - subset of the above where exactly one traversal path was found (heuristics eliminated alternatives).

  • Collision analysis

    The list clashing_id_type == [clash_one_one, clash_multi_multi, clash_multi_one] classifies target IDs that were reached by more than one query:

    • clash_one_one - every colliding query was 1→1.

    • clash_multi_multi - every colliding query was 1→many.

    • clash_multi_one - mixture of 1→1 and 1→many queries (most alarming category).

  • Timings & book-keeping

    • time - wall-clock runtime in seconds.

    • conversion - per-query mapping result for all successful traversals.

    • converted_item_dict / converted_item_dict_reversed - raw per-ID caches used to derive the higher-level counters above.

    • parameters - echo of the function arguments.

    • ids - the sampled from and reference to ID sets.

Parameters:
  • from_release (int) – Ensembl release number of the source IDs.

  • from_assembly (int) – Genome assembly code of the source IDs.

  • from_database (str) – Node-type / database of the source IDs.

  • to_release (int) – Ensembl release number of the target IDs.

  • to_database (str) – Node-type / database to convert into.

  • go_external (bool) – Permit temporary detours through external IDs when native Ensembl history edges break.

  • prioritize_to_one_filter (bool) – Prefer 1→1 mappings over 1→many when multiple paths exist.

  • convert_using_release (bool) – Pass from_release straight into idtrack.Track.convert() instead of letting it infer the starting point.

  • from_fraction (float) – Fraction (0 < x ≤ 1) of the ids_from population to sample; speeds up smoke tests.

  • verbose (bool) – Show tqdm progress bar (coarse).

  • verbose_detailed (bool) – Embed live metric counters in the tqdm postfix.

  • return_ensembl_alternative (bool) – Forwarded to idtrack.Track.convert().

Raises:

ValueError – If either database argument refers to an Ensembl node-type (must use backbone helpers instead) or if from_fraction is outside the open interval (0, 1].

Returns:

Nested metrics dictionary with the layout described above.

Use format_history_travel_testing_report() for a human-readable summary.

Return type:

dict

Notes

  • All counters are absolute counts - divide by len(metrics['ids']['from']) to obtain rates.

  • The collision analysis is inspired by the clash statistics logic implemented at the end of the function and helps spot discrepant “unique” IDs that suddenly become ambiguous. Keeping all three clash counters at zero is the gold standard for a healthy build.

history_travel_testing_random(from_fraction, include_ensembl_source=True, include_external_source=True, include_ensembl_destination=True, include_external_destination=True, verbose=True, verbose_detailed=False, strict_forward=False, convert_using_release=False, prioritize_to_one_filter=True, return_result=False)[source]

Convenience wrapper around history_travel_testing().

The routine generates a random but internally consistent test case via history_travel_testing_random_arguments_generator(), logs the chosen parameters (unless verbose is False) and delegates the heavy lifting to history_travel_testing().

Parameters:
  • from_fraction (float) – Fraction of IDs to sample from the source set.

  • strict_forward (bool) – Forwarded to the argument generator.

  • convert_using_release (bool) – Forwarded to history_travel_testing().

  • prioritize_to_one_filter (bool) – Forwarded to history_travel_testing().

  • verbose (bool) – Show coarse progress information.

  • verbose_detailed (bool) – Include extended per-ID counters in the progress bar.

  • return_result (bool) – If True, return the metrics dictionary.

  • include_ensembl_source (bool) – Include Ensembl databases as valid sources.

  • include_external_source (bool) – Include external databases as valid sources.

  • include_ensembl_destination (bool) – Include Ensembl databases as valid destinations.

  • include_external_destination (bool) – Include external databases as valid destinations.

Returns:

The metrics dictionary returned by history_travel_testing().

Return type:

dict

history_travel_testing_random_arguments_generator(strict_forward, include_exclude_list)[source]

Generate a plausible random parameter set for history_travel_testing().

The helper picks compatible source/target assemblies, releases and databases so the subsequent conversion test has a realistic chance to succeed. When strict_forward is True the target release is guaranteed to be the source release (no time-travel back).

Parameters:
  • strict_forward (bool) – Enforce a non-decreasing release direction.

  • include_exclude_list (list[bool]) – A 4-element list of booleans controlling inclusion of [include_ensembl_source, include_external_source, include_ensembl_destination, include_external_destination].

Returns:

Keys from_assembly, from_release, to_release, from_database, to_database ready to be splatted into history_travel_testing().

Return type:

dict

how_many_corresponding_path_ensembl(from_release, from_assembly, to_release, go_external, verbose=True)[source]

Count history paths between two Ensembl releases.

The method iterates over all Ensembl-gene stable IDs that exist in from_release/from_assembly. For every ID that is present in the graph it calls idtrack.Track.get_possible_paths() and records how many distinct paths the searcher finds to to_release.

The routine is non-destructive; it merely provides a quick way to gauge the density of the history graph or to spot releases where path-finding was unexpectedly difficult.

Parameters:
  • from_release (int) – Source Ensembl release number.

  • from_assembly (int) – Source genome assembly code.

  • to_release (int) – Target Ensembl release number.

  • go_external (bool) – If True history paths are allowed to temporarily leave the Ensembl lineage via external databases.

  • verbose (bool) – Show a tqdm progress bar (default True).

Returns:

A list of two-element sub-lists [stable_id, n_paths] where n_paths is

  • an int ≥ 0 when the ID was in the graph, or

  • None when the source ID was absent.

Return type:

list[list[Union[str, int, None]]]

is_base_is_range_correct(verbose=True)[source]

Verify consistency of base-gene active-range calculations.

Each “base Ensembl gene” node (node_type == 'base_ensembl_gene') has an active release range—the list of Ensembl releases during which descendants of the gene were present. There are two independent ways to obtain this information:

  1. High-level helper graph.get_active_ranges_of_base_id_alternative - a cached convenience wrapper.

  2. Low-level reconstruction by aggregating the combined_edges table and converting the set of releases into compact [start, end] slices via graph.list_to_ranges.

This test iterates through all base-gene nodes and asserts that the two methods deliver byte-identical results.

Parameters:

verbose (bool) – If True (default) show a tqdm progress bar that updates with the current node under inspection.

Returns:

True if every base-gene has matching ranges; False as soon as a single mismatch is encountered.

Return type:

bool

is_combined_edges_dicts_overlapping_and_complete()[source]

Check edge-dictionary partitioning invariants.

The Track graph materialises three edge cachescombined_edges and its two specialised siblings—each storing adjacency and release metadata for a different subset of nodes:

  • combined_edges - all nodes, including backbone genes.

  • combined_edges_genes - stable Ensembl genes (non-assembly-specific).

  • combined_edges_assembly_specific_genes - genes that exist only on a single assembly.

The design contract says:

  1. Disjointness - No node key may appear in more than one dictionary.

  2. Completeness - The union of the dictionaries must cover all graph nodes except those that represent alternative database versions (e.g. “EnsemblMetazoa”) which are intentionally kept separate.

This routine enforces both rules.

Returns:

True if the dictionaries are pair-wise disjoint and collectively cover every eligible node; False otherwise.

Return type:

bool

is_edge_with_same_nts_only_at_backbone_nodes()[source]

Assert same-node-type edges exist only between backbone genes.

The graph is a multilayer network where nodes of different

This method traverses every base-gene node and checks that condition.

Returns:

True when no overlaps are found; False otherwise. Each offending base ID triggers a warning with the conflicting ranges.

Return type:

bool

is_final_external_conversion_robust(convert_using_release=False, database=None, ens_rel=None, verbose=True, from_fraction=1.0, prioritize_to_one_filter=False)[source]

Validate Ensembl→external conversion against MySQL ground truth.

A random external database is chosen for every genome assembly. For the selected combination the method grabs the authoritative mapping table (graph-ID → external ID set) from MySQL and converts the same graph-IDs with idtrack.Track.convert().

Parameters:
  • convert_using_release (bool) – Whether to pin the from_release when calling the converter. Keeping this True usually speeds up the search and mimics user-facing behaviour.

  • verbose (bool) – Print the current assembly/database/release being tested.

  • prioritize_to_one_filter (bool) – If True, apply tie-breaking to select a single best target when multiple candidates exist.

  • ens_rel (int | None) – Specific Ensembl release to test. If None, a random release is chosen.

  • from_fraction (float) – Fraction of identifiers to sample for testing (0.0-1.0). Defaults to 1.0 (all identifiers).

  • database (str | None) – Specific external database to test. If None, a random database is chosen.

Returns:

True if every converted set equals the MySQL reference,

False upon the first deviation.

Return type:

bool

Raises:

ValueError – Raised when test parameters are invalid or incompatible.

is_id_functions_consistent_ensembl(verbose=True)[source]

Ensure Ensembl ID-list helpers agree with the SQL back-end.

For every release listed in graph.graph["confident_for_release"] the test compares two independent sources of Ensembl-gene IDs for the current genome assembly:

  1. IDs retrieved directly from MySQL via DatabaseManager.

  2. IDs returned by idtrack.Track.get_id_list() from the graph.

A mismatch means either the graph was built incompletely or the helper functions drift out of sync with the database schema.

Parameters:

verbose (bool) – If True (default) show a tqdm progress-bar while iterating through the releases.

Returns:

True when all releases produce identical sets; else False (a descriptive warning is logged).

Return type:

bool

is_id_functions_consistent_ensembl_2(verbose=True)[source]

Cross-check Ensembl ID range helpers against raw edge data.

For every backbone Ensembl-gene node the routine computes the list of active release ranges in two distinct ways:

  1. Raw computation - by flattening graph.combined_edges_genes and compacting the releases with idtrack.Track.list_to_ranges().

  2. Cached lookup - via the lazily built dictionary graph.get_active_ranges_of_id.

The two lists must match exactly. A divergence would indicate that the cached helper is out of sync with the authoritative edge structure.

Parameters:

verbose (bool) – If True (default) wrap the iteration in a tqdm bar.

Returns:

True if all gene nodes pass; False after the first failure (a warning is emitted).

Return type:

bool

is_id_functions_consistent_external(verbose=True)[source]

Check external-ID list helpers against the raw MySQL tables.

For every combination of assembly, Ensembl release (limited to graph.graph["confident_for_release"]) and external database this test performs the following steps:

  1. Query the authoritative list of external IDs directly from the MySQL snapshot via DatabaseManager.

  2. Ask the in-memory graph for the same list via idtrack.Track.get_id_list().

  3. Normalise node names through idtrack.Track.node_name_alternatives() to cope with the occasional “_1” suffix.

  4. Compare the two sets. A mismatch is logged and the method returns False immediately.

The exhaustive traversal is expensive (minutes for large genomes) but ensures the graph`s indexing helpers never drift from the actual database content.

Parameters:

verbose (bool) – If True (default) display a tqdm progress bar and emit log messages at INFO level. When False the method runs silently.

Returns:

True when every single comparison matched, False as soon as an inconsistency is encountered.

Return type:

bool

is_node_consistency_robust(verbose=True)[source]

Check for illegal neighbour relationships and multi-edges.

The graph may contain exactly one edge between nodes of different node-types. Nodes of the same node-type are only allowed when that type is the Ensembl backbone (ensembl_gene). Any deviation - a lateral same-type connection or >1 multi-edge - is logged and aborts the test.

Parameters:

verbose (bool) – Print offending nodes when a violation is detected.

Returns:

True when the graph satisfies the topology rules, False otherwise.

Return type:

bool

is_range_functions_robust(verbose=True)[source]

Detect overlapping release ranges among sibling Ensembl IDs.

A base Ensembl-gene ID is the stable identifier that groups multiple versioned Ensembl-gene records (siblings). The gene-history model requires that the release ranges of sibling IDs never overlap - each release must be covered by exactly one child stable ID.

This method traverses every base-gene node and checks that condition.

Parameters:

verbose (bool) – If True (default) display a tqdm progress-bar.

Returns:

True when no overlaps are found; False otherwise. Each offending base ID triggers a warning with the conflicting ranges.

Return type:

bool

random_dataset_source_generator(assembly, include_external, include_ensembl, for_final_database, only_backbone_tests, release_lower_limit=None, form=None)[source]

Pick a random (<database>, <assembly>, <release>) tuple.

The function guarantees that the triple actually exists in the graph and - if release_lower_limit is provided - honours the minimum release constraint.

Parameters:
  • assembly (int) – Genome assembly code used in Ensembl core schema names (e.g. 38 = human GRCh38, 39 = mouse GRCm39, 111 = pig Sscrofa11.1).

  • include_ensembl (bool) – Whether Ensembl backbone databases may be returned as database.

  • release_lower_limit (int | None) – Smallest permissible Ensembl release number for the returned triple. None disables the filter.

  • form (str | None) – Restrict the draw to a particular connection form (protein/coding/gene). None means no restriction.

  • only_backbone_tests (bool) – If True, restrict selection to backbone-only databases (skip external databases entirely).

  • include_external (bool) – Whether external (non-Ensembl) databases may be included in the selection pool.

  • for_final_database (bool) – If True, exclude Ensembl assembly-specific databases from selection (useful when selecting final conversion targets).

Returns:

(<database>, <assembly>, <release>) or None when no matching release exists.

Return type:

tuple | None

Raises:

ValueError – Raised when no valid database/release combination can be found.

class DB[source]

Bases: object

Store constants shared across IDTrack modules for Ensembl data access and graph construction.

This class centralizes every constant that multiple components (e.g. idtrack.graph.GraphMaker, idtrack.pathfinder.PathFinder) rely on when talking to the Ensembl FTP mirror, REST API, and public MySQL instances. Housing the values in one immutable namespace prevents circular imports, ensures a single source of truth, and simplifies testing. DB is never instantiated; import the class and reference its attributes directly.

id_ver_delimiter

Character separating a stable identifier from its version suffix.

Type:

str

first_version

Default version assumed when an ID lacks an explicit version component.

Type:

int

connection_timeout

TCP connect timeout in seconds used by both FTP and REST clients.

Type:

int

reading_timeout

Socket read timeout in seconds applied to FTP and REST operations.

Type:

int

ensembl_ftp_base

Hostname of the Ensembl public FTP mirror.

Type:

str

rest_server_api

Root URL of the Ensembl REST API.

Type:

str

rest_server_ext

Resource path appended to rest_server_api to query species metadata.

Type:

str

mysql_host

Hostname of the Ensembl public MySQL server.

Type:

str

myqsl_user

Username for anonymous MySQL access.

Type:

str

mysql_togo

Placeholder string kept for backward compatibility when assembling connection URLs.

Type:

str

assembly_mysqlport_priority

Organism-aware mapping {organism -> {assembly -> {...}}} defining: - Ports: ordered list of MySQL ports to try for that assembly - Priority: assembly priority within the organism (1 = newest / preferred)

Type:

dict[str, dict[int, dict[str, Any]]]

mysql_port_min_release

Minimum Ensembl release supported by each public MySQL port.

Type:

dict[int, int]

all_assemblies

Union of every configured assembly code across supported organisms.

Type:

set[int]

main_assembly

Backward-compatibility default assembly (human GRCh38 = 38).

Type:

int

synonym_id_nodes_prefix

Prefix inserted before node identifiers that represent synonym edges.

Type:

str

no_old_node_id

Sentinel used when a historical ID is retired.

Type:

str

no_new_node_id

Sentinel used when no future successor exists.

Type:

str

alternative_versions

Two sentinels—no_new_node_id and no_old_node_id.

Type:

set[str]

hyperconnecting_threshold

Maximum allowable out-degree before a node is considered hyper-connected and ignored by breadth-first expansions.

Type:

int

node_type_str

Edge/Node attribute key holding the node type value.

Type:

str

nts_external

Canonical node type assigned to entities originating outside Ensembl.

Type:

str

forms_in_order

Stable ordering of Ensembl entity forms (gene, transcript, translation). Order matters when inferring parent/child relationships.

Type:

list[str]

backbone_form

Form selected as backbone for graph traversals (always gene).

Type:

str

nts_ensembl

Map each canonical form to its namespaced node type (geneensembl_gene, etc.).

Type:

dict[str, str]

nts_ensembl_reverse

Reverse mapping of nts_ensembl.

Type:

dict[str, str]

nts_assembly

Form-to-assembly-specific node type map.

Type:

dict[str, dict[str, str]]

nts_assembly_reverse

Reverse mapping of nts_assembly.

Type:

dict[str, dict[str, str]]

nts_base_ensembl

Reduced node type names stripped of assembly suffixes.

Type:

dict[str, str]

nts_base_ensembl_reverse

Reverse mapping of nts_base_ensembl.

Type:

dict[str, str]

Node types for which synonym searches are performed bidirectionally.

Type:

set[str]

nts_assembly_gene

Every node type that represents a gene, regardless of assembly.

Type:

set[str]

connection_dict

Edge attribute key whose value stores connection metadata dictionaries.

Type:

str

conn_dict_str_ensembl_base

Constant placed under connection_dict when the edge points to an Ensembl data source.

Type:

str

external_search_settings

Default limits for outward traversal into external databases. Keys are jump_limit, synonymous_max_depth, and nts_backbone.

Type:

dict[str, Any]

placeholder_na

Sentinel stored in HDF5 datasets where a true NA/None value is not permitted or would break downstream type expectations.

Type:

str

UTF8

The literal string "utf-8"—a canonical spelling of the UTF-8 encoding name used when writing variable-length strings to HDF5 files.

Type:

str

UTF8_STR

Pre-configured variable-length UTF-8 string dtype created via h5py.string_dtype(). Pass this value when creating HDF5 datasets that should hold arbitrary Unicode text to avoid hard-coding datatypes throughout the codebase.

Type:

h5py.Datatype