Reference
- class API(local_repository)[source]
Bases:
objectProvide a high-level façade for building graphs and converting biological identifiers with IDTrack.
This class centralises common workflows so users can quickly initialise the underlying graph (for a chosen organism and Ensembl release), configure logging, and run identifier-related operations. Internally it delegates to lower-level components such as
idtrack.DatabaseManagerfor data access andidtrack.Track(oridtrack.TrackTests) for graph traversal and matching. It is intended as the primary entry point for day-to-day tasks like resolving an organism name, constructing the working graph snapshot, converting identifiers between releases or external databases, and inspecting available external data sources.Bind the interface to a local repository used for data downloads and on-disk caches.
This initialiser wires up a dedicated logger for the API layer and records the path where IDTrack will keep its working files. The actual graph and tracking objects are created lazily (e.g. by
idtrack.API.build_graph()) so that simply constructingidtrack.APIis inexpensive.- Parameters:
local_repository (str) – Absolute (recommended) or relative path to a writable directory where the package may store downloaded resources and precomputed artefacts. The caller is responsible for ensuring the path exists and is accessible.
- log
Logger named
"api"for progress messages and diagnostics.- Type:
logging.Logger
- logger_configured
Falseuntilidtrack.API.configure_logger()is called.- Type:
bool
- local_repository
The given repository path.
- Type:
str
- track
Placeholder for the active tracker; populated after
idtrack.API.build_graph()is invoked.- Type:
idtrack.Track | idtrack.TrackTests
- _require_track()[source]
Return the active tracker or raise a clear error if the graph is not built yet.
- Return type:
- build_graph(organism_name, snapshot_release, genome_assembly=None, return_test=False, calculate_caches=True)[source]
Build the bio-ID graph for an organism and prepare the path-finding engine.
This method wires together the high-level components used throughout IDTrack. It first creates a
idtrack._database_manager.DatabaseManagerthat ignores releases newer thansnapshot_release. It then instantiatesidtrack._track.Track(oridtrack._track_tests.TrackTestswhen testing), which loads or builds the underlyingidtrack._the_graph.TheGraphviaidtrack._graph_maker.GraphMaker. Optionally, it primes all graph caches to improve query latency. The resulting resolver is stored onself.trackfor subsequent conversions and inspections.- Parameters:
organism_name (str) – Canonical Ensembl species name, typically the output of
idtrack.API.resolve_organism().snapshot_release (int) – Ensembl release anchoring this build. Data from later releases are ignored to ensure reproducible results.
genome_assembly (int | None) – Genome assembly code used in Ensembl core schema names (
<organism>_core_<release>_<assembly>). This selects the primary assembly for the snapshot (default: highest-priority/newest configured for the organism). The snapshot graph can still include additional assemblies within the snapshot window; useidtrack.API.list_genome_assemblies()to inspect what is present.return_test (bool) – If
True, initialiseidtrack._track_tests.TrackTestsinstead of the standardidtrack._track.Trackto enable test and diagnostics helpers. Defaults toFalse.calculate_caches (bool) – If
True, eagerly compute the graph’s cached properties. When combined withreturn_test=True, test-only caches are included. Defaults toTrue.
- Return type:
None
See also
idtrack.API.get_database_manager(),idtrack.API.calculate_graph_caches(),idtrack._track.Track,idtrack._track_tests.TrackTests
- calculate_graph_caches(for_test=False)[source]
Prime the working graph by eagerly computing all cached properties.
This helper reduces first-call latency and makes test runs deterministic by batch-computing every
@cached_propertyexposed byidtrack._the_graph.TheGraph. Use it afteridtrack.API.build_graph()has attached a populatedidtrack._track.Track(oridtrack._track_tests.TrackTests) toself.track. Internally it forwards toidtrack._the_graph.TheGraph.calculate_caches().- Parameters:
for_test (bool) – If
True, also compute heavyweight, test-only caches such asidtrack._the_graph.TheGraph.external_database_connection_formandidtrack._the_graph.TheGraph.available_releases_given_database_assembly. Defaults toFalse.- Return type:
None
- classify_multiple_conversion(matchings)[source]
Group batch-conversion results into semantic bins for downstream reporting.
This post-processing step takes the per-identifier results produced by
idtrack.API.convert_identifier_multiple()(or a compatible list ofidtrack.API.convert_identifier()payloads) and organises them into logically meaningful categories. The bins distinguish between “no match,” one-to-one vs. one-to-many mappings, whether the output differs from the input, and whether a reported target is an Ensembl fallback due to a missing external synonym.- Parameters:
matchings (list[dict[str, Any]]) – Collection of dictionaries returned by
idtrack.API.convert_identifier_multiple(). Each element must contain, at minimum, the keys"query_id","target_id","no_corresponding","no_conversion", and"no_target"as described inidtrack.API.convert_identifier().- Returns:
- A dictionary of category → list-of-results. Categories are not
mutually exclusive; an item can appear in multiple bins (e.g. a changed 1→1 mapping also appears in the general 1→1 bin). Keys are:
"input_identifiers": All input result objects, echoed unchanged (convenient for summary counts)."matching_1_to_0": Queries that could not be mapped to any target (eitherno_corresponding=Trueorno_conversion=True). Indicates a 1→0 outcome."matching_1_to_1": Queries with exactly one target in"target_id". Includes both unchanged and changed outputs, and may overlap with"changed_only_1_to_1"or"alternative_target_1_to_1"."matching_1_to_n": Queries with more than one target in"target_id"(n > 1). May overlap with"changed_only_1_to_n"or"alternative_target_1_to_n"."changed_only_1_to_1": Strict subset of 1→1 where the single"target_id"[0]is different from"query_id"(i.e. the identifier changed across releases/databases)."changed_only_1_to_n": Strict subset of 1→n where none of the"target_id"entries equal"query_id"(the original identifier is not present among the alternatives)."alternative_target_1_to_1": Cases with exactly one"target_id"andno_target=True. This flags Ensembl fallbacks where the external database lacked a synonym; the single reported value is not a genuine external match."alternative_target_1_to_n": As above, but with multiple entries in"target_id"(n > 1) whileno_target=True(typically multiple Ensembl-side candidates with no external synonym).
- Return type:
dict[str, list[dict[str, Any]]]
- Raises:
ValueError – If any element in matchings has an empty
"target_id"list despiteno_correspondingandno_conversionboth beingFalse(indicates an unexpected upstream state).
See also
idtrack.API.convert_identifier_multiple(),idtrack.API.print_binned_conversion(),idtrack.API.convert_identifier().Notes
The function does not mutate input dictionaries. The binning logic is intentionally overlapping so that “summary” buckets (
matching_*) can be used alongside “diagnostic” buckets (changed_only_*,alternative_target_*) without additional passes.
- configure_logger(level=None)[source]
Configure process-wide logging with a concise, time-stamped console format.
This method is idempotent per
idtrack.APIinstance: the first call sets up a basic configuration for the Python logging system (time, level, logger name, and message). Subsequent calls on the same instance will not reconfigure logging and instead emit an informational message viaidtrack.API.log.- Parameters:
level (int | str | None) – Desired logging level (e.g.
logging.INFO,"INFO",logging.DEBUG). IfNone, defaults tologging.INFO.- Return type:
None
Notes
The configuration applies to the root logger and therefore affects logging for the entire Python process, not only this package. Call this early in your application if you want IDTrack’s log output formatted consistently with the rest of your program.
- convert_identifier(identifier, from_release=None, to_release=None, final_database=None, strategy='best', explain=False)[source]
Resolve a raw identifier and convert it to a target Ensembl release and (optionally) an external database.
This high-level helper wraps
idtrack._track.Track.convert()and returns a compact, user-oriented summary of the result. It first normalises identifier to the canonical graph node label withidtrack._the_graph.TheGraph.node_name_alternatives(), then invokes the path-finding and final-conversion pipeline to reach the requested to_release and final_database. The output is designed for interactive use and downstream tooling: it reports whether the query is present in the graph, whether a conversion could be computed, and (if requested) the full path(s) followed through the Ensembl backbone and the external database hop.- Parameters:
identifier (str) – Query identifier to resolve. May be an Ensembl stable ID, gene symbol, or a known synonym; case and common punctuation variations are tolerated by the normaliser.
from_release (int | None) – Ensembl release the identifier originates from. If
None, the direction of time travel is inferred automatically. Supplying a value constrains the search to forward/reverse travel.to_release (int | None) – Target Ensembl release. If
None, the newest release available in the graph is used.final_database (str | None) – Name of the external database to convert into (e.g.
"uniprot"). IfNone, the result remains on the Ensembl gene backbone (reported asidtrack._db.DB.nts_ensembl[idtrack._db.DB.backbone_form]).strategy (Literal["all", "best"]) – Selection strategy applied after scoring all admissible targets.
"best"keeps a single globally best target;"all"keeps all scored targets. Defaults to"best".explain (bool) – If
True, include the concatenated edge list(s) that show how each result was reached.
- Returns:
Dictionary describing the conversion outcome with the following keys.
"target_id"(list[str]): Unique identifiers in the requested final_database. Whenstrategy="best"and a target exists, this list contains exactly one element. If final_database isNone, the list contains the Ensembl gene ID(s)."last_node"(list[tuple[str, str]]): Pairs of(ensembl_gene_id, target_id)for every surviving candidate. The first element is the final Ensembl node reached by time travel; the second is the chosen target in final_database (or the Ensembl gene itself when staying on the backbone)."final_database"(str | None): The database name the target_id values come from.Noneonly when the query was not found at all; otherwise this is either final_database or the Ensembl backbone labelidtrack._db.DB.nts_ensembl[idtrack._db.DB.backbone_form]."graph_id"(str | None): Canonical node label used internally by the graph for identifier (e.g."ACTB"for the symbol"actb").Nonewhen the query has no corresponding graph node."query_id"(str): Echo of the original identifier argument for bookkeeping."no_corresponding"(bool):Trueif the query could not be matched to any graph node (nothing to convert). In this case"graph_id"isNoneand the other fields are empty orNone."no_conversion"(bool):Trueif the query exists in the graph but no admissible path to to_release and/or final_database could be constructed (a 1→0 mapping)."no_target"(bool):Trueif an Ensembl gene was reached but the requested final_database yielded no synonym. The result may fall back to returning the Ensembl gene itself; this flag lets callers distinguish that fallback from a genuine external match."the_path"(dict[tuple[str, str], tuple[tuple]]): Present only when explain isTrue. Maps each(target_id, ensembl_gene_id)pair to an ordered tuple of edges representing the full walk: first the Ensembl history segment that reaches the gene, then the final-conversion hop into the external database. Each edge is expressed in the internal format used byidtrack._track.Trackand may include auxiliary fields (e.g. release markers).
- Return type:
dict[str, Any]
- Raises:
ValueError – If strategy is not
"all"or"best".
Notes
Interactions between the boolean flags:
no_corresponding=True⇒ no conversion is attempted;graph_idisNone;target_idis[].no_conversion=True⇒ query exists but path scoring/selecting produced no admissible target.no_target=True⇒ Ensembl history succeeded but the external database lacked a synonym; callers may still receive an Ensembl fallback target.
When
strategy="best", the scoring and tie-breakers are those implemented byidtrack._track.Track.calculate_score_and_select()and its callers. When"all", no global tie-break is applied and all scored targets are returned.
- convert_identifier_multiple(identifier_list, verbose=True, pbar_prefix='', **kwargs)[source]
Convert a batch of identifiers and aggregate per-query conversion metadata.
This is a thin, progress-enabled wrapper around
idtrack.API.convert_identifier(). It iterates over identifier_list in order, forwards**kwargsto the single-item converter, and collects each per-identifier result. Use this helper for bulk operations where you want progress feedback and a uniform result structure that mirrors the single-call API.- Parameters:
identifier_list (list[str]) – Input identifiers to resolve and convert. Each element is passed to
idtrack.API.convert_identifier()as its identifier argument, in the same order.verbose (bool) – If
True, display atqdmprogress bar (throttled to avoid excessive redraws). Set toFalseto disable the progress bar. Defaults toTrue.pbar_prefix (str) – Optional label shown before the progress bar text (for distinguishing concurrent runs). Defaults to an empty string.
kwargs –
Keyword arguments forwarded verbatim to
idtrack.API.convert_identifier(). Common options include:from_release(int | None): Origin Ensembl release of the input identifier.to_release(int | None): Target Ensembl release to which to time-travel.final_database(str | None): Name of the external database to project into (e.g."uniprot"). IfNone, results stay on the Ensembl backbone and are reported asidtrack._db.DB.nts_ensembl[idtrack._db.DB.backbone_form].strategy(Literal["best","all"]): Selection policy after scoring candidates.explain(bool): IfTrue, include full path details in the result (see"the_path"below).
- Returns:
One element per input identifier, preserving input order. Each dictionary matches the schema returned by
idtrack.API.convert_identifier().- Return type:
list[dict[str, Any]]
See also
idtrack.API.convert_identifier(),idtrack.API.classify_multiple_conversion(),idtrack.API.print_binned_conversion().Notes
The output list preserves the order of identifier_list. Items are independent; failures for one query do not prevent processing of the others.
- external_database_forms()[source]
Return the Ensembl form each external database connects through.
Provides a compact view of how third-party databases attach to the Ensembl backbone (
"gene","transcript", or"translation") viaidtrack._the_graph.TheGraph.external_database_connection_form().- Returns:
Mapping of external database name → Ensembl form (e.g.,
"gene").- Return type:
dict[str, str]
- get_database_manager(organism_name, snapshot_release, genome_assembly=None, ignore_before=None)[source]
Create a database manager configured for an organism and a release-bounded snapshot.
Construct and return
idtrack._database_manager.DatabaseManagerbound toorganism_nameand configured to ignore data newer thansnapshot_release. The manager centralizes all download, caching, and version logic for graph builds and identifier conversions. The biological form is initialised fromidtrack._db.DB.backbone_form, and all artefacts are stored underidtrack.API.local_repository.- Parameters:
organism_name (str) – Canonical Ensembl species name (e.g.
"homo_sapiens").snapshot_release (int) – Most recent Ensembl release to include; later releases are ignored for reproducibility.
genome_assembly (int | None) – Genome assembly code used in Ensembl core schema names (
<organism>_core_<release>_<assembly>). This selects the primary assembly for the snapshot (e.g.38= human GRCh38,37= human GRCh37,39= mouse GRCm39,111= pig Sscrofa11.1). IfNone(default), the highest-priority assembly configured for the organism is used. Note that the resulting snapshot graph can still include additional assemblies within the snapshot window; useidtrack.API.list_genome_assemblies()to inspect what is present.ignore_before (int | None) – Earliest Ensembl release to include in the snapshot window. When
None(default), use the earliest release supported by the public Ensembl MySQL/FTP dumps (seeidtrack._db.DB.mysql_port_min_release). This default ensures multi-assembly history is retained for clean-handoff species (e.g. mouse) where older assemblies live entirely in earlier releases.
- Returns:
- A manager ready for use by graph-building and
conversion routines.
- Return type:
Notes
Any exceptions raised by
idtrack._database_manager.DatabaseManagerpropagate unchanged.
- infer_identifier_source(id_list, mode='assembly_ensembl_release', report_only_winner=True)[source]
Infer the most likely source (database/assembly/release) for a heterogeneous identifier list.
This helper estimates which origin best explains the given IDs so users can pick a sensible graph configuration before running conversions at scale. Internally it resolves each input to a canonical node (where possible), consults
idtrack._the_graph.TheGraph.node_triosto recover known origins, and tallies them viaidtrack._track.Track.identify_source(). Under development: both the public signature and the scoring details may change in future releases.- Parameters:
id_list (list[str]) – Identifiers to analyse. Each item should be a string; non-existent IDs are safely ignored (and logged) during the tally.
report_only_winner (bool) – If
True, return the single highest-count origin for the requested mode. IfFalse, return all candidate origins ranked by descending count.mode (str) –
Granularity of the origin to infer.
One of: -
"complete"→ return triples(database, assembly, release). -"ensembl_release"→ return the Ensembl release only (int). -"assembly"→ return the genome assembly only (int). -"assembly_ensembl_release"→ return pairs(assembly, release).
- Returns:
- The inferred origin(s),
depending on report_only_winner:
- If report_only_winner is
True: mode == "complete"→(database: str, assembly: int, release: int)mode == "ensembl_release"→release: intmode == "assembly"→assembly: intmode == "assembly_ensembl_release"→(assembly: int, release: int)
- If report_only_winner is
- If report_only_winner is
False: A list of
(origin, count)pairs where origin has the corresponding shape above.
- If report_only_winner is
- Return type:
tuple[str, int, int] | tuple[int, int] | int | list[tuple[object, int]]
- list_ensembl_releases()[source]
List Ensembl releases reachable for the configured organism and assembly.
Wraps
idtrack._database_manager.DatabaseManager.available_releases(), honoring any ignore window configured in the manager. The result is sorted in ascending order.- Returns:
Sorted release numbers that can be queried and cached locally.
- Return type:
list[int]
- list_external_databases()[source]
Return the set of third-party (non-Ensembl) databases represented in the current graph.
- Returns:
- Unique external database names discovered via
- Return type:
set[str]
- list_external_databases_by_assembly()[source]
Map each genome assembly to the external databases present in that slice of the graph.
Delegates to
idtrack._the_graph.TheGraph.available_external_databases_assembly()to reveal which third-party resources are available per assembly for the loaded organism/release window.- Returns:
Mapping of assembly → set of external database names.
- Return type:
dict[int, set[str]]
- list_genome_assemblies()[source]
List genome assemblies represented in the currently loaded graph.
Exposes the assembly identifiers discovered when the graph was built. This is a thin wrapper over
idtrack._the_graph.TheGraph.available_genome_assemblies()and requires thatidtrack.API.build_graph()has been called.- Returns:
Unique genome assembly identifiers present in the graph (e.g.,
38for GRCh38).- Return type:
set[int]
- print_binned_conversion(classified)[source]
Log a structured multi-line summary of binned conversion results with percentages and rest counts.
- Parameters:
classified (dict[str, list[dict]]) – Output from classify_multiple_conversion.
- Return type:
None
- resolve_organism(tentative_organism_name)[source]
Normalize a tentative organism name and fetch the latest supported Ensembl release.
This shields callers from Ensembl naming quirks by resolving a user-provided synonym (e.g. common name, shorthand, taxon ID) to the canonical Ensembl species identifier (e.g.
"homo_sapiens") and to the newest Ensembl release that still hosts that species. The lookup delegates toidtrack._verify_organism.VerifyOrganism, ensuring subsequent graph construction and data access use a consistent, up-to-date pair.- Parameters:
tentative_organism_name (str) – Organism descriptor in any supported synonym form (e.g.
"human","hsapiens","9606", or"homo_sapiens"). Matching is case-insensitive.- Returns:
(formal_name, latest_release)whereformal_nameis the canonical Ensembl speciesstring and
latest_releaseis the most recent Ensembl release number known for that species.
- Return type:
tuple[str, int]
Process-scoped SOCKS bridging for restricted servers.
The primary entry point is idtrack.ConnectionBridge. It enables IDTrack to run on servers without direct
internet access (e.g. HPC clusters) by routing the current Python process through a SOCKS5 proxy provided by an
SSH reverse tunnel such as ssh -R 1080 user@server.
The bridge is intentionally lightweight and process-scoped:
It does not modify system-wide proxy configuration.
It only affects the current interpreter (one Python process / one Jupyter kernel).
It is reversible via
idtrack.ConnectionBridge.stop()and also cleaned up best-effort at interpreter exit.
- class ConnectionBridge(proxy_host='127.0.0.1', proxy_port=1080, *, set_env_proxy=True)[source]
Bases:
objectRoute this Python process’ outgoing TCP connections through an SSH-provided SOCKS proxy.
Many restricted environments block outbound internet access from compute nodes. IDTrack needs outbound access to Ensembl services (REST/HTTPS, FTP over HTTPS, and sometimes public MySQL). If you can SSH into the server from a machine with internet access, you can expose a SOCKS5 proxy on the server via OpenSSH remote dynamic forwarding:
ssh -R 1080 user@server
Then, inside Python on the server (or inside a Jupyter notebook kernel running on the server), enable the bridge:
import idtrack b = idtrack.ConnectionBridge(proxy_port=1080) b.start() # applies process-scoped networking changes # ... run IDTrack ... b.stop() # restores the previous networking configuration
Internals (for maintainers / power users)
start()monkeypatchessocket.sockettosocks.socksocket(PySocks) and optionally sets the environment variablesALL_PROXYandall_proxyso subprocesses spawned from this process inherit the proxy.A private, process-wide
_BridgeStatesingleton stores the original socket class, environment variables, and PySocks default proxy to ensurestop()can restore the prior state precisely. The singleton also implements a simple reference counter so multipleConnectionBridgeinstances can share the same active bridge.Notes
The bridge affects only the current Python process (one Jupyter kernel). Closing the Python process/kernel automatically removes the monkeypatch.
To avoid surprises, call
start()before the first network access in your program.Status messages are emitted via the logger named
"connection_bridge"and, whenverbose=True, printed to stdout for immediate visibility in notebooks.
- param proxy_host:
SOCKS proxy host on the server. With
ssh -R 1080 ...this is typically"127.0.0.1".- param proxy_port:
SOCKS proxy port on the server. Must match the port used in the SSH command.
- param set_env_proxy:
If
True(default), setALL_PROXY/all_proxywhile active so subprocesses inherit the proxy configuration.
Create a new bridge controller without applying any network changes.
- param proxy_host:
SOCKS proxy host on the server (default
127.0.0.1).- param proxy_port:
SOCKS proxy port on the server (default
1080).- param set_env_proxy:
If
True, setALL_PROXY/all_proxywhile active so subprocesses inherit the proxy.
- log
Logger named
"connection_bridge"for structured diagnostics.
- proxy_host
Effective proxy host for this instance.
- proxy_port
Effective proxy port for this instance.
- set_env_proxy
Whether this instance sets proxy environment variables when activating the bridge.
- _emit(message, *, verbose, level=20)[source]
Emit a status message via the instance logger and (optionally) stdout.
- Parameters:
message (str)
verbose (bool)
level (int)
- Return type:
None
- classmethod _emit_global(message, *, verbose, level=20)[source]
Emit a message without requiring an instance (used by
atexitcleanup).- Parameters:
message (str)
verbose (bool)
level (int)
- Return type:
None
- classmethod _force_disable_bridge(*, verbose)[source]
Disable the bridge regardless of which instance started it (best-effort).
This method is used by the
atexithook and by unit tests to ensure a clean process state. It intentionally bypasses instance-level bookkeeping (e.g.self._startedflags).- Parameters:
verbose (bool) – If
True, print a status message to stdout.- Return type:
None
- static _format_proxy_url(host, port)[source]
Return a SOCKS proxy URL suitable for environment variables.
- Parameters:
host (str)
port (int)
- Return type:
str
- static _require_pysocks()[source]
Import and return the PySocks module (import name:
socks).- Returns:
Imported
socksmodule.- Return type:
Any
- Raises:
ImportError – If PySocks is not installed.
- static _restore_socks_default_proxy(socks_module, original_proxy)[source]
Restore the PySocks default proxy configuration (best effort).
- Parameters:
socks_module (Any)
original_proxy (Any)
- Return type:
None
- property is_active: bool
Return
Trueif this instance currently holds an active bridge reference.
- start(*, test=True, verbose=True)[source]
Enable the bridge for the current Python process.
The bridge is reference-counted across instances in the current interpreter. If another
ConnectionBridgealready enabled the bridge with the same proxy host/port, callingstart()will simply increment the internal counter and return.- Parameters:
test (bool) – If
True(default), runtest_connection()after enabling the bridge. If the test fails, the bridge is automatically disabled again and the method returnsFalse.verbose (bool) – If
True(default), print status messages to stdout.
- Returns:
Trueif the bridge is enabled (and the optional test succeeds), otherwiseFalse.- Return type:
bool
- Raises:
RuntimeError – If a bridge is already active in this process but configured with a different proxy host/port.
- stop(*, verbose=True)[source]
Disable the bridge and restore normal networking for this process.
If multiple
ConnectionBridgeinstances are active, the bridge is only fully disabled once the last instance callsstop().- Parameters:
verbose (bool) – If
True(default), print status messages to stdout.- Return type:
None
- test_connection(*, verbose=True, timeout_s=15.0)[source]
Verify connectivity to Ensembl services through the active bridge.
The Ensembl REST ping is treated as the authoritative signal for success. MySQL connectivity checks are reported as warnings because IDTrack can fall back to HTTPS/FTP in some workflows.
- Parameters:
verbose (bool) – If
True(default), print status messages to stdout.timeout_s (float) – Timeout (seconds) for the REST request.
- Returns:
Trueif Ensembl REST is reachable, otherwiseFalse.- Return type:
bool
- Raises:
RuntimeError – If the bridge is not active in this process.
- Parameters:
proxy_host (str)
proxy_port (int)
set_env_proxy (bool)
- class HarmonizeFeatures(project_name, data_h5ad_dict, project_local_repository, idtrack_local_repository, target_ensembl_release, final_database='HGNC Symbol', organism_name='homo_sapiens', graph_last_ensembl_release=114, verbose_level=2, debugging_variables=False, converted_id_column='converted_id')[source]
Bases:
objectHarmonize gene/feature identifiers across multiple single-cell expression datasets.
This manager streamlines the otherwise error-prone task of bringing heterogeneous gene identifiers (Ensembl IDs, gene symbols, etc.) into a single, version-controlled namespace before integrated downstream analysis. Under the hood it leverages
idtrack.api.APIto resolve identifier mappings through a pre-computed Ensembl graph, handles one-to-many and one-to-zero conversions, logs any ambiguous or inconsistent matches, and finally produces harmonisedanndata.AnnDataobjects ready for comparative or joint analysis.The public workflow is intentionally simple:
feature_harmonizer()— convert a single dataset and return the filteredAnnDataplus before/after feature counts.unify_multiple_anndatas()— apply harmonisation across all supplied datasets and return an integrated object.get_idtrack_matchings_for_all_datasets()— inspect the raw IDTrack matchings used.
Instances keep several diagnostic attributes (e.g.
removed_conversion_failed_identifiers,multiple_ensembl_dict) so that users can audit every decision that removed or altered a feature.- Parameters:
project_name (str) – Human-readable label used in log messages and derived output file names.
data_h5ad_dict (dict[str, str]) – Mapping dataset_alias → absolute .h5ad path of the source single-cell expression matrices.
project_local_repository (str) – Writable directory where harmonised outputs, logs, and temporary artefacts will be stored.
idtrack_local_repository (str) – Local clone or cache directory understood by
idtrack.api.API; used to read the pre-built identifier graph.target_ensembl_release (int) – Ensembl release that all identifiers will be converted to. Must be ≤ graph_last_ensembl_release.
final_database (str) – Canonical namespace kept after conversion (e.g.
"HGNC Symbol"). Defaults to"HGNC Symbol".organism_name (str) – Ensembl-style organism short name (e.g.
"homo_sapiens"). Defaults to"homo_sapiens".graph_last_ensembl_release (int) – Highest release present in the on-disk IDTrack graph. Defaults to
114.verbose_level (Literal[0, 1, 2]) – Logging verbosity; 0 = errors only, 1 = warnings, 2 = info. Defaults to
2.debugging_variables (bool) – Retain heavy intermediate structures for post-mortem inspection. Defaults to
False.converted_id_column (str) – Column name used to store converted identifiers inside the resulting
AnnData.varDataFrame. Defaults to"converted_id".
- idt
Lazily initialised IDTrack interface used for all identifier look-ups.
- Type:
idtrack.api.API
- multiple_ensembl_dict
Map of collapsed IDs to all Ensembl IDs that were originally associated with the same target identifier.
- Type:
dict[str, list[str]]
- removed_conversion_failed_identifiers
Features that failed conversion and were dropped from each dataset.
- Type:
dict[str, set[str]]
- kept_conversion_failed_identifiers
Non-convertible features kept because they were consistently non-convertible across all datasets.
- Type:
dict[str, set[str]]
- removed_inconsistent_identifier_matching
Features whose mappings disagreed between datasets and were therefore removed for consistency.
- Type:
dict[str, set[str]]
Instantiate the harmoniser and perform lightweight validation.
The constructor merely prepares the harmonisation context: it validates input paths, configures logging, and primes IDTrack. Heavy work—graph initialisation, identifier matching, gene-symbol resolution—happens lazily when the first harmonisation method is called.
- Parameters:
project_name (str) – See
HarmonizeFeatures().data_h5ad_dict (dict[str, str]) – See
HarmonizeFeatures().project_local_repository (str) – See
HarmonizeFeatures().idtrack_local_repository (str) – See
HarmonizeFeatures().target_ensembl_release (int) – See
HarmonizeFeatures().final_database (str) – See
HarmonizeFeatures().organism_name (str) – See
HarmonizeFeatures().graph_last_ensembl_release (int) – See
HarmonizeFeatures().verbose_level (Literal[0, 1, 2]) – See
HarmonizeFeatures().debugging_variables (bool) – See
HarmonizeFeatures().converted_id_column (str) – See
HarmonizeFeatures().
- Raises:
ValueError – If verbose_level is not 0, 1, or 2.
- static _get_column_as_series(df, column)[source]
Safely extract a single column from a DataFrame, always returning a Series.
When a DataFrame has duplicate column names,
df[column]returns a DataFrame instead of a Series. This method handles such edge cases by selecting the first matching column when duplicates exist.- Parameters:
df (pd.DataFrame) – The DataFrame to extract from.
column (str) – The column name to extract.
- Returns:
The column data as a Series.
- Return type:
pd.Series
- _initialize()[source]
Populate diagnostic structures for failed or ambiguous identifier conversions.
Called once by
HarmonizeFeatures.__init__(), this routine scans every input dataset and updates several reporting attributes (for exampleremoved_conversion_failed_identifiersorremoved_inconsistent_identifier_matching). It also derivesmultiple_ensembl_dict, a reverse map of ambiguous Ensembl ID → source identifiers, enabling downstream inspection of one-to-many relationships.Returns
None: All results are stored on self for later inspection.Internally the method:
Extracts raw feature identifiers from each
anndata.AnnDatafile.Classifies identifiers into failure or inconsistency categories.
Records per-dataset membership via
reporter_dict_creator().Builds the
multiple_ensembl_listused byHarmonizeFeatures.create_multiple_ensembl_dict().Touches
datataset_conversion_dataframe_issues()so the cached-property is built eagerly.
- Return type:
None
- _initialize_idt()[source]
Instantiate the IDTrack interface on first use.
The public API defers expensive graph loading until it is actually required. This helper therefore checks whether
idtisNoneand, if so, loads the on-disk identifier graph described by idtrack_local_repository and graph_last_ensembl_release, then configures release filters so that subsequent look-ups always target target_ensembl_release. Re-invocations are no-ops.Returns
None: Theidtattribute is populated and ready for queries.- Return type:
None
- property conversion_failed_but_consistent_identifiers: set[str]
Identify non-convertible identifiers that are consistently absent across all datasets.
An identifier that fails conversion in every dataset can be retained (or at least logged once) without jeopardising dataset comparability. This property computes the set intersection of
conversion_failed_identifiersacross datasets and makes the result available for selective retention or downstream visualisation.- Returns:
Identifiers that were never convertible but appeared in every dataset examined.
- Return type:
set[str]
- property conversion_failed_identifiers: set[str]
Return identifiers that could not be converted in at least one dataset.
The property wraps
dict_1_to_not_1()and filters its"1-to-0"category so that downstream code can quickly query irrecoverable failures without iterating over the entire diagnostic structure.- Returns:
- Identifiers that failed conversion in at least one dataset
or have inconsistent mappings (1-to-0, 1-to-n, or n-to-1 where not all datasets share the same mapping).
- Return type:
set[str]
- create_dataset_conversion_dataframe(gene_list, initialization_run)[source]
Build a two-column mapping table for a single dataset’s feature identifiers.
The routine transforms every source identifier in gene_list into the target namespace defined by
self.final_databaseand Ensembl gene IDs. The resulting convertible subset is written into a newpandas.DataFramewith three columns—_ENSEMBL_GENE_COLUMN,self.final_database, and"Query ID"—while problematic identifiers are annotated or filtered according to the rules established during_initialize().When called by
_initialize()(initialization_runTrue), the method writes provisional mappings without inspecting post-initialisation overrides. In subsequent calls (initialization_runFalse) it resolves single-Ensembl ambiguities viaself.datataset_conversion_dataframe_issues["final_database_chosen_single_ensembl_dict"]to guarantee a one-to-one relation between indices and feature rows.- Parameters:
gene_list (Union[list[str], pd.Index]) – Ordered collection of source identifiers to convert for the current dataset.
initialization_run (bool) –
Trueif invoked from_initialize(); disables the single-Ensembl disambiguation step applied in later passes.
- Returns:
Mapping table ready to become
adata.var. Columns are_ENSEMBL_GENE_COLUMN,self.final_database, and the original"Query ID"for traceability.- Return type:
pandas.DataFrame
- Raises:
AssertionError – If diagnostic sets such as
conversion_failed_identifierswere not populated—indicating an incorrect call order—or if unexpected duplicate target IDs remain after processing.ValueError – When malformed conversion entry.
- create_intersection_column_values(adata_var)[source]
Flag features present in every study after harmonisation.
The merged
.vartable produced byunify_multiple_anndatas()contains one gene-symbol column per study, each namedf"{self.converted_id_column}_{handle}"where handle is the dictionary key that identifies the originating dataset. A cell in one of those columns holds the gene symbol originally reported by the study, oridtrack._db.DB.placeholder_naif the gene was absent or could not be mapped to the target namespace.This helper collapses the per-study presence/absence information into a single boolean intersection flag, later exposed to users as
adata.var["intersection"]. A value of1indicates that the feature survived the intersect filter—i.e., it has a valid symbol in all studies—whereas0marks features missing from at least one dataset. The resulting NumPy vector is inserted by the caller; this routine is intentionally pure and side-effect free.- Parameters:
adata_var (pandas.DataFrame) – The
.vartable of the already concatenatedanndata.AnnDataobject. It must contain one or more columns whose names start withf"{self.converted_id_column}_"; each such column is assumed to encode the gene symbol for a particular study.- Returns:
A 1-D array of
int(values0or1) withlen(adata_var)elements. The i-th entry equals1if the i-th feature is present (non-idtrack._db.DB.placeholder_na) in every per-study symbol column; otherwise it is0.- Return type:
numpy.ndarray
- create_multiple_ensembl_dict()[source]
Reverse map ambiguous Ensembl target IDs to their originating source identifiers.
During scanning,
_initialize()collects every (source_id, target_ensembl_id) pair that falls outside the consistent one-to-one category intomultiple_ensembl_list. This helper consolidates that list into a dictionary keyed bytarget_ensembl_idwith a sorted list of associatedsource_idvalues, allowing auditors to quickly discover all inputs that collapsed onto the same Ensembl record.- Returns:
{target_ensembl_id: [source_id₁, source_id₂, …]}with duplicates removed and values sorted alphanumerically.- Return type:
dict[str, list[str]]
- property datataset_conversion_dataframe_issues: DataFrame
Aggregate conversion failures and ambiguities into a tidy diagnostic table.
The cached DataFrame has one row per source identifier encountered across all datasets and the following columns:
dataset— Dataset alias that triggered the row (duplicates possible).reason— Underscore-delimited label fromreporter_dict_creator_helper_reason_finder().target_identifier— The resolved identifier orNaNif conversion failed.was_removed(bool) — Whether the feature was ultimately dropped from the dataset.
This compact view is ideal for spreadsheet export or in-notebook inspection because it condenses the richer nested structures stored on the class into a flat, analysis-friendly format.
- Returns:
Combined diagnostic table sorted lexicographically by dataset and reason.
- Return type:
pandas.DataFrame
- property dict_1_to_not_1: dict[str, set[str]]
Collect identifiers involved in one-to-many or one-to-zero conversions.
This helper scans
unified_matching_dictand extracts every source identifier whose conversion to the target namespace is not a strict one-to-one mapping. Two situations are considered problematic:1 → 0 (conversion failure) — no target identifier could be resolved.
1 → n (ambiguous hit) — multiple targets share the best score, preventing an unambiguous choice.
The resulting dictionary is later consumed by
reporter_dict_creator()to populate the diagnostic attributes exposed to users and bycreate_dataset_conversion_dataframe()to decide which features should be dropped or flagged in eachanndata.AnnDataobject.- Returns:
{problem_class: {source_id₁, source_id₂, …}}where problem_class iseither
"1-to-0"or"1-to-n".
- Return type:
dict[str, set[str]]
- extract_source_identifiers_from_anndata(dataset_path)[source]
Load an
.h5adfile and harvest the raw feature identifiers.To prepare inputs for ID-Track, this routine opens the single-cell expression matrix at dataset_path, reads the
.varDataFrame, and extracts either the"gene_id"field (if present) or the index itself as the source identifier. Identifiers are returned in file order so that downstream procedures can preserve the original feature ordering when reconstructing matrices.- Parameters:
dataset_path (str) – Absolute or project-relative path to an
.h5adfile containing a validanndata.AnnDataobject.- Returns:
Ordered list of identifier strings exactly as they appear in the source file.
- Return type:
list[str]
- feature_harmonizer(dataset_name)[source]
Convert one dataset’s feature space into the unified target namespace.
This convenience wrapper reads a single
.h5adfile, removes identifiers deemed unusable during_initialize(), applies the conversion mapping fromcreate_dataset_conversion_dataframe(), and returns a newanndata.AnnDataobject with harmonised features. The function is intentionally side-effect-free: it never alters the source file, and large temporary matrices are deleted immediately to minimise memory usage.- Parameters:
dataset_name (str) – Key from
data_h5ad_dictidentifying which dataset to load and harmonise.- Returns:
resulting_adata (
anndata.AnnData) - Dataset whosevarnow contains_ENSEMBL_GENE_COLUMN(matching Ensembl gene ID) as index andself.final_databaseas a column.t0 (int) - Number of features before filtering and harmonisation.
t1 (int) - Number of features after the procedure (i.e., retained in resulting_adata).
- Return type:
tuple
- Raises:
AssertionError – If duplicate Ensembl or target-database IDs slip past the conversion checks, which would break one-to-one mapping assumptions.
- get_idtrack_matchings_for_all_datasets()[source]
Return raw ID-Track matchings for every dataset in the project.
This helper exposes the unfiltered mapping tables produced by ID-Track so that users can inspect exactly how each source identifier was converted (or failed to convert) in every individual dataset. Internally it triggers
run_idtrack_for_single_dataset()for any dataset that has not yet been processed, caches the resulting tables in memory, and then assembles a{dataset_name: dataframe}dictionary whose keys align one-to-one withdata_h5ad_dict.Each returned
pandas.DataFrameincludes at least the following columns:source_id,target_id,conversion_status,reason, and any custom metadata injected byidtrack.api.API.- Returns:
Mapping of dataset alias to its full, row-level ID-Track matching table. The dictionary order follows the insertion order of
data_h5ad_dict.- Return type:
dict[str, pandas.DataFrame]
- n_to_1_within_individual_dataset(dataset_name, dataset_matching_list)[source]
Detect n-to-1 collapses inside one dataset and populate diagnostic caches.
In the ID-Track context n-to-1 means several source identifiers (
query_id) converging on the same target identifier (matched_id). Such collapses are problematic because they merge distinct features when building the harmonised expression matrix. This helper inspects the raw matching rows for a single dataset, discovers all many-to-one events (including those that passed through the alternative target database), and records the results in a family of per-project dictionaries so that later stages—merging, filtering, and reporting—can make informed decisions.The routine never returns a value; instead it mutates the following public attributes:
dict_n_to_1-{matched_id: [dataset₁, dataset₂, …]}listing every dataset where the collapse occurred.dict_n_to_1_with_query-{matched_id: {(query_id₁,…): [dataset]}}for cases where thematched_idalso appears in the collapsing query set.dict_n_to_1_with_query_reverse-{query_id: {matched_id: [dataset]}}for a query-centric view.dict_n_to_1_without_query- collapses where the target never appears in its own query set.
Returns
None: All information is stored on the instance for subsequent pipeline stages.- Parameters:
dataset_name (str) – Human-readable alias used throughout the project for this dataset.
dataset_matching_list (list[dict]) – Raw per-feature matchings returned by
idtrack.api.API. Each dictionary must provide at least the keys"query_id","last_node", and"final_database".
- Return type:
None
- reporter_dict_creator(the_dict, the_set, dataset_name)[source]
Update or create per-identifier diagnostic entries for a single dataset.
Each identifier in the_set is ensured to exist as a key inside the_dict. The entry’s
"reason"field is generated exactly once usingreporter_dict_creator_helper_reason_finder(); its"datasets_containing"list is then appended with dataset_name. This allows quick aggregation of “where did this problematic identifier occur?” across all datasets.Returns
None: the_dict is modified in-place.- Parameters:
the_dict (dict[str, dict]) – Target dictionary that stores diagnostic metadata. Keys are source identifiers; values have keys
"reason"(str) and"datasets_containing"(list[str]).the_set (set[str]) – Identifiers that belong to the diagnostic category represented by the_dict.
dataset_name (str) – Human-readable alias of the dataset currently being processed.
- Return type:
None
- reporter_dict_creator_helper_reason_finder(the_id)[source]
Infer why a particular identifier failed or produced a non-one-to-one conversion.
The algorithm inspects
unified_matching_dictand categorises the_id into one or more mutually non-exclusive reasons:"n-to-1"— The identifier was part of an n → 1 collapse within at least one dataset."1-to-0"— No target identifier was returned (conversion failure)."1-to-n"— The conversion yielded multiple targets (ambiguous mapping).
The final label is a single string where multiple reasons are concatenated with underscores, e.g.
"1-to-0_1-to-n".- Parameters:
the_id (str) – Source identifier whose conversion outcome needs explanation.
- Returns:
Underscore-delimited reason string describing the failure or ambiguity class.
- Return type:
str
- run_idtrack_for_single_dataset(dataset_name, dataset_path)[source]
Convert identifiers for one dataset and cache the raw ID-Track output.
Given a dataset alias and its on-disk location, this method:
Calls
extract_source_identifiers_from_anndata()to obtain the feature list.Feeds those identifiers to
idtrack.api.APIand collects the per-feature match results.Stores the resulting
pandas.DataFrameinside the_idtrack_matchings_per_datasetcache so repeated calls are O(1).Updates
unified_matching_dictso that cross-dataset diagnostics remain consistent.
Users rarely call this directly—
get_idtrack_matchings_for_all_datasets()handles the orchestration—but it remains public for advanced, dataset-by-dataset debugging.- Parameters:
dataset_name (str) – Human-friendly alias used as the key inside diagnostic dictionaries.
dataset_path (str) – Absolute or project-relative
.h5adpath passed straight toextract_source_identifiers_from_anndata().
- Returns:
- Full ID-Track matching table for dataset_name with columns
source_id,target_id,conversion_status, and any extra metadata returned by the API.
- Return type:
pandas.DataFrame
- property unified_matching_dict
Expose the full source-to-target identifier mapping produced by IDTrack.
The dictionary is created during
_initialize()when the IDTrack graph is first queried. Keys are source identifiers (as found in input files); values are all candidate target IDs returned by the graph query, ordered by decreasing score. A value may therefore bea single-element list (unambiguous one-to-one),
a multi-element list (ambiguous one-to-n), or
an empty list (1-to-0 conversion failure).
Public access to this attribute enables advanced users to perform their own diagnostics or to reproduce the algorithm’s decisions outside the class.
- Returns:
- Mapping
{source_id: [target_id₁, target_id₂, …]}in the order delivered by the IDTrack query.
- Mapping
- Return type:
dict[str, list[str]]
- unify_multiple_anndatas(mode='union', obs_columns_to_keep=None, numeric_var_columns=None, numeric_obs_columns=None, handle_anndata_key='handle_anndata')[source]
Merge several study-specific
anndata.AnnDataobjects into a single, consolidated dataset.This helper finalises the feature-harmonisation workflow. Earlier stages ensure that every source study expresses its features (e.g. genes or proteins) in a consistent identifier namespace and that per-cell metadata follow a shared schema. unify_multiple_anndatas takes those already normalised objects—stored in
data_h5ad_dict—and fuses them into one coherentAnnDataready for joint analysis (dimensionality reduction, batch correction, integrated clustering, etc.).Two strategies govern how the function reconciles mismatched feature sets:
"union"(default) preserves the superset of all identifiers. If a particular study lacks a feature,its expression values are imputed as exact zeros. This choice maximises information retention at the cost of a sparse matrix with assay-dependent missingness.
"intersect"retains only the identifiers present in every study, implicitly discarding featuresunique to a subset. This yields a denser matrix that is easier to factorise but sacrifices potentially informative study-specific biology.
Beyond concatenating the main
Xmatrices, the routine also harmonises associated annotations:- .var (feature annotations)
All columns are outer-joined across studies. Non-shared categorical values are unioned; numeric columns specified in numeric_var_columns are cast to floating point and NaNs inserted where data are missing. In union mode an additional boolean
"intersection"column flags whether a feature survived the intersect filter, enabling fast subsetting later.
- .obs (cell annotations)
Each original column is kept if its name appears in obs_columns_to_keep or if it exists in every study. Missing columns are created and populated with
pandas.NA. Columns listed in numeric_obs_columns are coerced tofloat64. A new column named handle_anndata_key stores the handle (dictionary key) that identifies the originating study, making it trivial to stratify analyses.
- .layers, .obsp, .varp, .uns
This method uses
anndata.AnnData.concat()for this.
The implementation is mindful of scalability: concatenation leverages SciPy CSR/CSC sparse formats, avoiding densification, and streaming allocation prevents double memory use for extremely large datasets.
- Parameters:
mode (Literal["union", "intersect"]) – Strategy for reconciling discordant feature sets.
"union"keeps every identifier observed across studies (padding absent entries with zeros);"intersect"restricts the result to identifiers common to all studies. Defaults to"union".obs_columns_to_keep (list[str] | None) – Names of per-cell metadata columns that must survive the merge even if they appear in only a subset of studies (e.g. cell_type, donor_age). When a column is missing from a particular study, it is inserted and filled with
pandas.NA. Provide an empty list to allow the routine to decide purely by intersection;Nonemeans “no user preference”.numeric_var_columns (set[str] | None) – Columns in
.varthat should retain numeric dtype. The function validates that each specified column can be losslessly converted to floating point; otherwise it raisesValueError. Non-listed columns default tocategorydtype to conserve memory. IfNonean empty set is assumed.numeric_obs_columns (set[str] | None) – Analogous to numeric_var_columns but applied to
.obs. Conversions are performed after the table has been unioned, ensuring consistent dtype across the final concatenated frame. IfNonean empty set is assumed.handle_anndata_key (str) – Name of the column inserted into
.obsthat records the dictionary key of the source study. This provenance tag facilitates stratified visualisation (e.g. UMAP coloured by batch) and downstream batch-correction utilities that expect a “batch” column. Defaults to"handle_anndata".
- Returns:
- A fully merged expression matrix whose
.Xcontains either the union or intersection of all study features. Index ordering follows the order in which studies were supplied, ensuring deterministic output for reproducible pipelines. The result inherits sparse/dense representation from the first study unless mode forces feature padding, in which case CSR/CSC is chosen automatically to keep memory use in check.
- A fully merged expression matrix whose
- Return type:
anndata.AnnData
- Raises:
ValueError – If mode is not
"union"or"intersect"; if any column listed in numeric_var_columns or numeric_obs_columns fails numeric coercion; or if feature identifiers clash across studies after harmonisation (e.g. two studies mapping different genes to the same ID).AssertionError – If duplicate cell or feature indices are detected post-merge, a condition that would break many Scanpy workflows and indicates upstream validation errors.
Notes
Performance considerations The operation is CPU-bound when aligning large sparse matrices. For datasets exceeding ~1 million cells, empirical benchmarks show that running on Python 3.11 with MKL yields a 2-3x speed-up over Python 3.8 due to better sparse BLAS threading. Provide pre-compressed datasets (
hdf5,zarr) to further lower I/O overhead.Thread safety The method is re-entrant but not thread-safe because it mutates the source
AnnDataobjects in-place to reduce copying. Invoke one instance per process or deep-copy the inputs beforehand if concurrent harmonisation is required.Extensibility Sub-classes may override private hooks
_before_concat(),_after_concat(), and_merge_uns()to refine behaviour without re-implementing the full algorithm.
- class DatabaseManager(organism, form, local_repository, ensembl_release=None, ignore_before=None, ignore_after=None, store_raw_always=True, genome_assembly=None)[source]
Bases:
objectManage retrieval, preprocessing, and storage of Ensembl Core and related external datasets.
The DatabaseManager centralizes all low-level operations required for ID-track analyses, including discovering which Ensembl releases are available for a given organism/assembly, downloading the corresponding MySQL tables, normalizing column names, persisting raw and processed files under a local cache directory, and orchestrating auxiliary look-ups to third-party resources via
ExternalDatabases. By funnelling every data-access path through a single object the wider package gains:Stable, reproducible builds - every graph, lookup table, or ID-history file is anchored to the exact Ensembl release, genome assembly, and form (gene, transcript, translation, …) with which the manager was configured.
Transparent caching - expensive downloads happen once; subsequent requests are served from disk, making large iterative analyses feasible on modest hardware.
Unified version logic - helper methods such as
version_uniformize()andcheck_version_info()guarantee that cross-release identifier changes are captured and resolved consistently across the codebase.
Key public methods/attributes
available_releases()— list releases that can be queried and saved locally.change_release()— switch the manager to another Ensembl release in-place.download_table()— fetch a single MySQL table and write it tolocal_repository.create_external_all()— pull every supported external resource (UniProt, RefSeq, …).organism,form,ensembl_release,genome_assembly— core configuration knobs, surfaced for quick inspection.
The class is stateful: change-mutating helpers update internal cached properties so that the instance always reflects its current configuration. Use the built-in
__str__()for a concise, human-readable dump of that state.Initialize a
DatabaseManagerfor a specific organism, release, and assembly.- param organism:
Canonical species name in Ensembl schema (e.g.
"homo_sapiens"or"mus_musculus"). Anything else raisesNotImplementedError.- type organism:
str
- param form:
Biological entity level of interest—one of
"gene","transcript","translation", …—governing which stable-ID columns will be expected downstream.- type form:
str
- param local_repository:
Absolute or relative path to a writable directory that will hold all downloaded MySQL dumps, intermediate parquet/Feather files, and ready-to-use artefacts. The directory must already exist and be both readable and writable.
- type local_repository:
str
- param ensembl_release:
Target Ensembl release number. If
Nonethe most recent release available for genome_assembly is selected automatically.- type ensembl_release:
Optional[int]
- param ignore_before:
Earliest release to include when building cross-release ID histories. Defaults to the minimum release supported by the selected assembly.
- type ignore_before:
Optional[int]
- param ignore_after:
Latest release to include when building histories.
np.inf(the default) disables the upper bound and includes all newer releases.- type ignore_after:
Optional[int | float]
- param store_raw_always:
When
Trueraw MySQL tables are always copied tolocal_repositorybefore conversion; whenFalsethey are kept only in memory.- type store_raw_always:
bool
- param genome_assembly:
Genome assembly code used in Ensembl core schema names (
<organism>_core_<release>_<assembly>). This selects the primary assembly used for data access (e.g.38= human GRCh38,39= mouse GRCm39,111= pig Sscrofa11.1). If omitted, the highest-priority assembly for organism is used. If ensembl_release is provided, the selection is restricted to assemblies that actually contain that release.- type genome_assembly:
Optional[int]
- raises ValueError:
If form is not in the supported list or local_repository fails basic path/read/write checks.
- raises RuntimeError:
If internal port/release configuration is inconsistent.
- raises NotImplementedError:
If organism is not yet supported by the package.
- _create_relation_helper(df)[source]
Convert an ID/version matrix into the canonical three-column relationship table.
The helper is shared by
DatabaseManager.create_relation_current()andDatabaseManager.create_relation_archive()and is not intended for direct use. It validates the incoming frame, fixes inconsistent version numbers (viaDatabaseManager.version_fix()andDatabaseManager.version_fix_incomplete()), converts missing translations toNaN-compatible floats, casts all stable-ID columns to string, and finally compresses each ID + version pair into the compact node label used throughout ID-track graphs.- Parameters:
df (pandas.DataFrame) – A six-column frame with exactly the following names (order irrelevant):
gene_stable_id,gene_version,transcript_stable_id,transcript_version,translation_stable_id,translation_version.- Returns:
- Three columns—
gene,transcript, translation—deduplicated and index-reset, ready for graph construction.
- Three columns—
- Return type:
pandas.DataFrame
- Raises:
ValueError – If df does not contain the required six columns or if version columns cannot be coerced to the expected numeric dtype.
- static _determine_usecols_ids(form)[source]
Determine column subsets needed to fetch identifier tables for a given Ensembl molecular form.
The helper translates a user-facing form string (
"gene","transcript", or"translation") into three ordered lists that drive low-level SQL selects throughout ID-track. Splitting the information this way lets public routines such asDatabaseManager.create_ids()assemble the minimal column set required for each organism/release while still keeping associated keys available for later joins.- Parameters:
form (str) – Molecular form whose identifier columns are requested. Must be one of
idtrack._db.DB.forms_in_order("gene","transcript", or"translation").- Returns:
stable_id_version - always
["stable_id", "version"]; the canonical ID and its version counter.usecols_core - primary-key column for form plus
stable_id_version.usecols_asso - foreign-key columns linking form to upstream forms, enabling later joins (e.g.,
["transcript_id", "gene_id"]for transcripts).
- Return type:
tuple[list[str], list[str], list[str]]
- Raises:
ValueError – If form is not in
{"gene", "transcript", "translation"}.
- _download_table_from_ftp(table_key, usecols=None)[source]
Download a table from Ensembl’s HTTPS MySQL dumps (no direct MySQL connection).
- Parameters:
table_key (str)
usecols (list[str] | None)
- Return type:
DataFrame
- _ftp_db_dir_url()[source]
Return the HTTPS directory URL for the current core database dump.
- Return type:
str
- classmethod _ftp_find_core_db_dir(*, organism, genome_assembly, release)[source]
Locate the core DB directory name (may include patch letters) and its HTTPS directory URL.
- Parameters:
organism (str)
genome_assembly (int)
release (int)
- Return type:
tuple[str | None, str | None]
- classmethod _ftp_mysql_root_candidates(*, organism, genome_assembly, release)[source]
Return candidate HTTPS roots to search for an Ensembl core DB directory.
- Parameters:
organism (str)
genome_assembly (int)
release (int)
- Return type:
tuple[str, …]
- classmethod _ftp_schema_for_sql_url(sql_url)[source]
Return a table->columns mapping parsed from an Ensembl <db>.sql.gz schema dump.
- Parameters:
sql_url (str)
- Return type:
dict[str, list[str]]
- _ftp_schema_url()[source]
Return a working schema-dump URL (*.sql.gz or *.sql.gz.bz2) for the current DB dump directory.
- Return type:
str
- classmethod _get_core_db_index(*, organism, genome_assembly)[source]
Return cached core-DB availability for an (organism, assembly) pair across all configured ports.
The Ensembl public MySQL service can host the same assembly on multiple ports depending on release (e.g. homo_sapiens assembly 37). To support full-history workflows we build a small in-memory index:
ports: ports probed for this (organism, assembly) in preference order
releases_by_port: releases available for this assembly on each reachable port
db_by_port_release: schema name for each available release on each reachable port
releases: sorted union of all releases across ports
port_for_release: deterministic choice of port for each release (first configured port that has it)
db_for_release: chosen schema name for each release (matching port_for_release, includes patch-letter suffixes)
- Parameters:
organism (str) – Canonical Ensembl organism name (e.g.
"homo_sapiens").genome_assembly (int) – Genome assembly version (e.g.
38for GRCh38).
- Returns:
Mapping describing reachable releases/ports for this organism/assembly.
- Return type:
dict[str, Any]
- Raises:
ValueError – If organism or genome_assembly is not configured.
- classmethod _get_core_db_index_from_ftp(*, organism, genome_assembly, ports)[source]
Build a core-DB availability index by probing the Ensembl HTTPS/FTP MySQL dumps.
- Parameters:
organism (str)
genome_assembly (int)
ports (list[int])
- Return type:
dict[str, Any]
- classmethod _is_retryable_http_read_error(exc)[source]
Heuristically decide whether an HTTP read/decompress error is transient and safe to retry.
- Parameters:
exc (BaseException)
- Return type:
bool
- static _iter_exception_chain(exc)[source]
Return the exception chain (__cause__/__context__) as a list.
- Parameters:
exc (BaseException)
- Return type:
list[BaseException]
- static _open_decompressed_http_text(url)[source]
Yield a text stream for an Ensembl dump file, handling .gz, .bz2, and nested .gz.bz2.
- Parameters:
url (str)
- static _parse_apache_dir_listing_dirs(html)[source]
Extract directory names from an Apache-style directory listing HTML page.
- Parameters:
html (str)
- Return type:
list[str]
- static _parse_apache_dir_listing_files(html)[source]
Extract file names from an Apache-style directory listing HTML page.
- Parameters:
html (str)
- Return type:
list[str]
- classmethod _probe_mysql_core_schemas_by_port(*, organism, genome_assembly, ports)[source]
Return per-port core DB releases and schema names from the live Ensembl MySQL service.
- Parameters:
organism (str)
genome_assembly (int)
ports (list[int])
- Return type:
tuple[dict[int, set[int]], dict[int, dict[int, str]], dict[int, Exception]]
- classmethod _refresh_core_db_index_mysql(index, *, organism, genome_assembly)[source]
Refresh the MySQL-derived portion of a cached core-index in place.
- Parameters:
index (dict[str, Any])
organism (str)
genome_assembly (int)
- Return type:
dict[str, Any]
- property available_releases: list[int]
Return Ensembl releases that are both reachable and within the ignore window.
The set is discovered via
available_releases_versions(), filtered againstignore_before/ignore_after, sorted in ascending order, and cached for the lifetime of thisDatabaseManagerinstance. The resulting list represents releases that can safely be queried and cached locally, guaranteeing reproducible downstream analyses.- Returns:
Sorted release numbers satisfying reachability and ignore-window constraints.
- Return type:
list[int]
- property available_releases_all_assemblies: list[int]
Return all Ensembl releases reachable across every configured assembly.
For clean-handoff species (e.g. mouse: 37 → 38 → 39), no single assembly spans the full release history. Graph construction and YAML template generation therefore need a release catalogue that is the union over all assemblies configured for
organisminidtrack._db.DB.assembly_mysqlport_priority.The list is filtered by the manager’s
ignore_before/ignore_afterwindow and cached for the lifetime of thisDatabaseManagerinstance.- Returns:
Sorted release numbers available for at least one configured assembly.
- Return type:
list[int]
- property available_releases_no_save: list[int]
Return reachable Ensembl releases without persisting the discovery to disk.
Functionally identical to
available_releases(), except that the discovered list is not written to the on-disk YAML cache. This helper is useful when users want a quick, read-only view of server availability—e.g., inside CI pipelines—without contaminating the persistent cache. The value is still memoized in memory for the currentDatabaseManagerinstance.- Returns:
Sorted release numbers reachable on the remote MySQL server and compliant with the ignore window.
- Return type:
list[int]
- available_releases_versions(**kwargs)[source]
Discover valid Ensembl releases for the configured organism and assembly.
Availability is discovered via the cached, multi-port aware core index built by
_get_core_db_index(). The resulting union of releases is filtered against the manager’signore_before/ignore_afterbounds and returned in ascending order.- Parameters:
kwargs – Kept for backward compatibility; currently unused.
- Returns:
Sorted list of release numbers that exist on the mirror and comply with the ignore window.
- Return type:
list[int]
- available_tables_mysql()[source]
Enumerate tables present in the selected Ensembl MySQL schema.
Intended to complement
available_databases_mysql(): while that method lists databases (one per organism/release/assembly), this one will drill into the active database and return the table names themselves, such as"gene","transcript","xref", and so on.- Raises:
NotImplementedError – Always - the table enumeration logic has not yet been written.
- change_assembly(genome_assembly, last_possible_ensembl_release=False)[source]
Clone the manager while targeting a new genome assembly (e.g. GRCh38 → GRCh37).
Genome assemblies are encoded as integers in Ensembl’s schema naming (
38for GRCh38,37for GRCh37,39for GRCm39,111for Sscrofa11.1, …). When last_possible_ensembl_release isTruethe method automatically picks the most recent Ensembl release that still provides MySQL dumps for the requested assembly, ensuring compatibility. All other settings are copied verbatim.- Parameters:
genome_assembly (int) – Assembly code configured under
DB.assembly_mysqlport_priorityfor this manager’s organism.last_possible_ensembl_release (bool) – When
Trueoverride ensembl_release with the newest version available for genome_assembly. Defaults toFalse.
- Returns:
New manager tied to the requested assembly (and possibly a recalculated release).
- Return type:
- change_form(form)[source]
Clone the manager while switching the biological form of interest.
A “form” denotes the identifier namespace to track—
gene,transcript,translation, etc. This method preserves every other configuration knob (organism, release, assembly, cache directory, ignore windows, …) and returns a brand-new instance so that the original object remains unaffected.- Parameters:
form (str) – Target form/namespace recognised by
__init__(). Typical values are"gene","transcript", or"translation".- Returns:
An independent manager identical to self except for
form.- Return type:
- change_release(ensembl_release)[source]
Produce a new manager that targets a different Ensembl release.
The returned instance inherits organism, form, assembly, and all caching parameters, but points every subsequent query (MySQL, FTP, or REST) to ensembl_release. This is the recommended way to traverse releases in scripted analyses without mutating objects in-place.
- Parameters:
ensembl_release (int) – Desired Ensembl release number (e.g.
111). Must be available for the current genome assembly or aNotImplementedErrormay be raised further down the call stack when data retrieval is attempted.- Returns:
Fresh manager initialised for ensembl_release.
- Return type:
- change_release_auto_assembly(ensembl_release)[source]
Clone the manager for ensembl_release while inferring a compatible genome assembly.
Unlike
change_release(), this helper allowsgenome_assemblyto change when the requested release is not present in the current assembly (common for clean-handoff species). Assembly inference follows the same priority rules as__init__()withgenome_assembly=None: pick the highest-priority configured assembly that contains the requested release.- Parameters:
ensembl_release (int) – Desired Ensembl release number.
- Returns:
Fresh manager initialised for ensembl_release with an inferred assembly.
- Return type:
- check_if_change_assembly_works(db_manager, target_assembly)[source]
Evaluate whether db_manager can be cloned to operate on target_assembly.
A lightweight health-check that calls
DatabaseManager.change_assembly()inside atry/exceptblock and converts the outcome to a boolean flag rather than letting the exception propagate. It allows batch workflows to skip assemblies that are unavailable or invalid without interrupting processing.- Parameters:
db_manager (DatabaseManager) – Manager instance to probe.
target_assembly (int) – Genome-assembly code to test (a key of
idtrack._db.DB.assembly_mysqlport_priorityfor the manager’s organism).
- Returns:
Trueif the assembly switch succeeds without raisingValueError;Falseotherwise.- Return type:
bool
- check_version_info()[source]
Infer whether the organism’s Ensembl IDs come with, without, or mixed versions.
The method scans all releases available for the current genome assembly and inspects the boolean flag in the
version_infocolumn of a pre-computed table (get_db("versioninfo")). Three mutually exclusive scenarios exist:All releases lack version suffixes:
"without_version"All releases include suffixes:
"with_version"A mixture of both states:
"add_version"(synthetic versions will be injected)
- Returns:
- One of
"without_version","with_version", or"add_version". Callers use the string to decide how to standardise identifier columns.
- One of
- Return type:
str
- Raises:
ValueError – If the version_info column in the source table is not strictly boolean, signalling a corrupted download or schema drift.
- create_available_databases()[source]
Discover MySQL databases for the configured organism/assembly.
The manager issues a
SHOW DATABASESquery against the Ensembl public MySQL mirror and filters names that match^{organism}_core_[0-9]+_.*$. The resulting list is returned as a single-column dataframe so that callers can seamlessly chain further pandas operations or persist the result.- Returns:
- One column named
"available_databases"listing all databases that match the organism, irrespective of Ensembl release or genome assembly.
- One column named
- Return type:
pandas.DataFrame
- Raises:
ValueError – If the server response is not a sequence of single-field tuples or if any tuple element is not a string.
- create_database_content(just_download=False)[source]
Retrieve and optionally cache external-database metadata for every assembly, release, and form.
The helper iterates over every genome assembly configured for the current organism in
idtrack._db.DB.assembly_mysqlport_priority, every available Ensembl release for each assembly, and every identifier form supported by the package, downloading theexternal_databasetable for each combination. The resulting frames are concatenated, enriched withassembly,release,form, andorganismcolumns, and returned to the caller. Whenjust_downloadisTruethe downloads are still performed (ensuring they are cached on disk for future runs) but an empty dataframe is returned to avoid unnecessary memory use.- Parameters:
just_download (bool) –
False - concatenate intermediate results and return the union dataframe (default).
True - download and cache each frame but return an empty dataframe.
- Returns:
- External-database relationships augmented with assembly, release, form, and organism
columns. Empty when
just_downloadisTrue.
- Return type:
pandas.DataFrame
- create_external_all(return_mode, narrow_external=True)[source]
Download and collate cross-reference mappings from every supported genome assembly.
The manager cycles through every genome assembly recognised for the current organism (ordered by
idtrack._db.DB.assembly_mysqlport_priority), fetches either the filtered external_relevant mapping table (whennarrow_external=True) or the full external mapping table (whennarrow_external=False) for each viaget_db(), labels every row with its source assembly, and finally concatenates the tables. Because this helper is intended for ad-hoc inspection only, it bypasses theget_db()caching layer and therefore never writes the result to the local repository.- Parameters:
return_mode (str) – Strategy for handling rows that appear in more than one assembly.
narrow_external (bool) –
If
True(default), restrict results to databases enabled in the external YAML configuration (external_relevant). IfFalse, include all external databases provided by the Ensembl MySQL server (external)."all"Keep one copy of every unique
(release, graph_id, id_db, name_db, ensembl_identity, xref_identity, assembly)combination. Duplicates are resolved within each assembly only.
"unique"Keep one copy of every unique
(release, graph_id, id_db, name_db, ensembl_identity, xref_identity)combination across all assemblies, preferring the assembly with the highest priority. (Currently no downstream use case.)
"duplicated"Return only the rows that occur in more than one assembly as a
pandas.core.groupby.generic.DataFrameGroupBy, keyed by the same column set used for"unique". (Currently no downstream use case.)
- Returns:
- If return_mode is
"all"or"unique", a de-duplicated cross-reference table with the columns
release,graph_id,id_db,name_db,ensembl_identity,xref_identity, andassembly.
- If return_mode is
- If return_mode is
"duplicated", a group-by view containing only duplicated entries.
- If return_mode is
- Return type:
Union[pandas.DataFrame, pandas.core.groupby.generic.DataFrameGroupBy]
- Raises:
ValueError – If return_mode is not
"all","unique", or"duplicated".
- create_external_db(filter_mode)[source]
Retrieve Ensembl-external-ID relationships and/or database statistics.
This consolidates a complex SQL join—spanning Ensembl core tables gene, transcript, translation and the cross-reference tables xref, object_xref, identity_xref, external_db, and external_synonym—into a single pandas dataframe. It enables downstream analyses such as mapping Ensembl gene models to UniProt, RefSeq, or CCDS identifiers, or summarising which external sources are represented in a given Ensembl release. The result type and granularity are controlled by filter_mode, allowing either the raw relationship rows or a per-database count to be returned.
The query executed is conceptually equivalent to the (simplified) MySQL statement below, though the actual SQL is constructed programmatically for flexibility and performance:
SELECT g.stable_id, t.stable_id, tr.stable_id, x.dbprimary_acc, edb.db_name, es.synonym, ix.* FROM gene g JOIN transcript t USING (gene_id) JOIN translation tr USING (transcript_id) JOIN object_xref ox ON (g.gene_id = ox.ensembl_id AND ox.ensembl_object_type = "Gene") JOIN xref x ON (ox.xref_id = x.xref_id) LEFT JOIN external_db edb ON (x.external_db_id = edb.external_db_id) LEFT JOIN identity_xref ix ON (ox.object_xref_id = ix.object_xref_id) LEFT JOIN external_synonym es ON (x.xref_id = es.xref_id) LIMIT 10;
When tighter genomic scoping is required the gene table can be prefixed with coord_system and seq_region:
FROM coord_system cs JOIN seq_region sr USING (coord_system_id) JOIN gene g USING (seq_region_id)
You can experiment interactively against the public Ensembl MySQL mirror:
mysql --user=anonymous --host=ensembldb.ensembl.org -D homo_sapiens_core_105_38 -A # Schema reference: # https://m.ensembl.org/info/docs/api/core/core_schema.html
- Parameters:
filter_mode (str) –
Controls both the row subset and the output schema. Must be one of:
"all"- return every mapping found in MySQL, no post-filtering applied."relevant"- return only mappings whose external database is marked Include: true in theExternalDatabases.give_list_for_case()YAML configuration.
"database"- return a two-column summary (name_db,count) for all external databases."relevant-database"- as above, but restricted to databases flagged Include: true.The special values
"relevant"and"relevant-database"implicitly consult the cachedexternal_instto honour the user’s curated allow-list.
- Returns:
- For
"all"/"relevant"- six-column frame ["release", "graph_id", "id_db", "name_db", "ensembl_identity", "xref_identity"]holding one row per Ensembl→external identifier edge.graph_idis the Ensembl stable ID (+version), while the two identity columns store Smith-Waterman percent identities (float16) for QC.
- For
- For
"database"/"relevant-database"- two-column frame ["name_db", "count"]giving how many distinctgraph_idvalues each external database touches.countis anint64.
- For
- Return type:
pandas.DataFrame
- Raises:
ValueError – If filter_mode is not one of the accepted literals or if the YAML allow-list claims a database that is absent from the retrieved mappings—indicating the configuration and MySQL data are out of sync.
Notes
Synonym handling - any synonym brought in from
external_synonymis prefixed withDB.synonym_id_nodes_prefix, and itsname_dbis likewise prefixed so that synonym nodes remain distinguishable during graph building. Caching - the heavy MySQL queries are executed only if the processed frame is not already present in the manager’s per-organism HDF5 cache; otherwise the cached frame is read from disk, ensuring repeat calls are inexpensive.
- create_id_history(narrow)[source]
Retrieve historical relationships between successive Ensembl stable IDs.
Build a cross-release lineage table mapping every obsolete ID version to its immediate successor for the configured organism, form, and release window. The information is assembled from the Core tables
stable_id_eventandmapping_sessionand then normalised so that all identifiers follow the canonical<stable_id>.<version>convention. Downstream graph-construction utilities depend on this table to reconstruct how genes, transcripts, or translations evolve across Ensembl releases.- Parameters:
narrow (bool) – If
Truedrop auxiliary columns (mapping session metadata, assembly labels, creation timestamps, etc.) to minimise on-disk footprint; otherwise return the full schema for exploratory analyses.- Returns:
Seven-column table with the following fields, ordered as listed—
old_stable_id- obsolete identifier (empty string for “birth” events).old_version- version number paired with old_stable_id.new_stable_id- successor identifier (empty string for “retirement” events).new_version- version paired with new_stable_id.score- homology score reported by Ensembl (NaNif unavailable).old_release- Ensembl release where the old identifier last appeared.new_release- release where the new identifier first appeared.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If the identifier delimiter
idtrack._db.DB.id_ver_delimiteris found inside any*_stable_idfield, indicating malformed input.
- create_id_history_fixed(narrow, inspect)[source]
Create a corrected ID-history table that repairs cyclic or duplicated version transitions (deprecated).
Certain edge cases in the raw
idhistoryextraction—e.g. Homo sapiensENSG00000232423at release 105— produce sequences like1 → 2, 2 → 3, 1 → 2where an already retired version resurfaces later on. Such cycles violate the monotonic version semantics assumed by graph algorithms. This helper rewrites the offending rows so that once a version is superseded it never reappears, transforming the above sequence into the logically consistent3 → 2. The routine is retained for reproducibility but superseded byDatabaseManager.create_id_history().- Parameters:
narrow (bool) – Propagated to the underlying data fetch—when
Truestart from the column-reducedidhistory_narrowview instead of the full table.inspect (bool) – When
Trueadd diagnostic columns (e.g.changed_oldandchanged_new) to aid manual auditing of the corrections; whenFalsereturn only the cleaned canonical schema.
- Returns:
- Corrected seven-column table
old_stable_id,old_version,new_stable_id, new_version,score,old_release,new_release—ready for serialization and downstream use.
- Corrected seven-column table
- Return type:
pandas.DataFrame
Note
This function is deprecated and will be removed in a future major release once the core extractor fully addresses the ordering anomaly.
- create_ids(form)[source]
Retrieve and normalise raw Ensembl identifier records for the requested molecular form.
This method pulls the appropriate MySQL table(s) for form, copes with schema differences across Ensembl releases (e.g. the historical
*_stable_idsplit tables), coerces data types, and standardises column names so that downstream graph-building steps all consume the same shape. It finishes by delegating toDatabaseManager.version_uniformize()to ensure the Version field is either a proper integer orNaNacross the entire DataFrame.- Parameters:
form (str) – Target molecular form -
"gene","transcript", or"translation". Anything else triggers aValueError.- Returns:
A de-duplicated, index-reset table whose columns depend on form:
gene -
gene_id,gene_stable_id,gene_versiontranscript -
transcript_id,gene_id,transcript_stable_id,transcript_versiontranslation -
translation_id,transcript_id,translation_stable_id,translation_version
All ID columns are
int64except the*_stable_idstrings; version columns areint64orfloat64(withNaNwhen absent).- Return type:
pandas.DataFrame
- create_relation_archive()[source]
Retrieve a cross-release gene-transcript-translation mapping table.
This legacy helper pulls the Ensembl
gene_archivetable—spanning all releases for the current organism—viaDatabaseManager.get_table(), drops columns unrelated to identifier mapping, and passes the result toDatabaseManager._create_relation_helper(). Because the archive contains known gaps, the preferred workflow is to callDatabaseManager.create_relation_current()once per release and concatenate the outputs.- Returns:
- Same schema as
DatabaseManager.create_relation_current()—gene,transcript,translation—but potentially with missing rows because Ensembl did not always back-populate older releases.
- Return type:
pandas.DataFrame
- create_relation_current()[source]
Build a current-release gene-transcript-translation mapping table.
The routine fetches the raw stable-ID/version tables for genes, transcripts and translations via
DatabaseManager.get_db(), merges them into a single wide frame, and then delegates toDatabaseManager._create_relation_helper()to harmonise version columns and compress the information into three canonical node labels ("<stable_id>.<version>"). The resulting mapping is the authoritative per-release link between molecular forms and is consumed by downstream graph-building utilities such asDatabaseManager.create_graph().- Returns:
Three columns—
gene,transcript, andtranslation—with one row per transcript. Thetranslationcolumn may contain empty strings where non-coding transcripts have no peptide. All data are UTF-8 strings; duplicates are removed and the index is reset.- Return type:
pandas.DataFrame
- create_release_id()[source]
Return deduplicated stable-identifier/version pairs for the current form and release.
Raw identifiers are fetched via
DatabaseManager.get_db(), normalised withDatabaseManager.version_fix(), trimmed to the canonical columns, and sanity-checked. Two integrity rules are enforced: (1) the delimiteridtrack._db.DB.id_ver_delimitermust not appear inside any stable identifier, and (2) every stable identifier must be unique after deduplication. Violations raiseValueError.- Returns:
Two-column dataframe
[{form}_stable_id, {form}_version]with duplicates removed.- Return type:
pandas.DataFrame
- Raises:
ValueError – If the delimiter is present inside any stable identifier or if identifiers are not unique after deduplication.
- create_version_info()[source]
Determine whether each Ensembl release stores identifiers with or without version suffixes.
Ensembl stable identifiers can appear either with a
.versionfacet (e.g. ENSG00000139618.17) or without it (e.g. YAL001C in S. cerevisiae). For robust cross-release tracking the package needs to know which convention applies to every release of the current organism. The method loops overavailable_releases, downloads the raw identifier table for self.form, and inspects the<form>_versioncolumn:All values NaN → the release uses unversioned identifiers.
No values NaN → the release uses versioned identifiers.
Mixed NaN / non-NaN → unsupported; raises
NotImplementedError.
The outcome is encoded as a Boolean flag per release and later consumed by
check_version_info()to decide whether version strings should be kept, stripped, or synthesised.- Returns:
- Two-column table with:
ensembl_release- integer release number.version_info-Trueif all identifiers lack a version suffix,Falseifall identifiers include a version suffix.
- Return type:
pandas.DataFrame
- Raises:
NotImplementedError – If any individual release contains a mixture of versioned and unversioned identifiers, indicating an inconsistent upstream annotation.
- download_table(table_key, usecols=None)[source]
Download a raw Ensembl MySQL table and return it as a DataFrame.
The method forms the low-level backbone of all table acquisition in IDTrackDocs. It opens a direct connection to the Ensembl Core (or comparable) MySQL schema configured on the current
DatabaseManagerinstance, issues a SELECT statement against table_key, converts the results into apandas.DataFrame, and performs a minimal sanitisation pass (bytes-to-string decoding, column subset validation, logging). Public code is expected to callDatabaseManager.get_table(), which wraps this helper with caching and post-processing, but keeping this routine separate allows fine-grained testing, mocking, and reuse in advanced workflows.- Parameters:
table_key (str) – Name of the raw table as it appears in the remote Ensembl database (e.g.
'gene','mapping_session','xref'). Must exist in the schema returned byDatabaseManager.mysql_database().usecols (Optional[list[str]]) – Sequence of column names to project; None retrieves the entire table. Column order is preserved. An empty list is treated the same as None.
- Returns:
- A frame containing the requested columns in the exact order supplied via
usecols (or all columns if usecols is None). Index is monotonic and zero-based.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If any element of usecols is missing from table_key, or if the query returns binary payloads that cannot be coerced into native Python types.
- property external_inst: ExternalDatabases
Instantiate and cache an
ExternalDatabaseshelper for this manager.The instance mirrors the configuration of the surrounding
DatabaseManager—organism, Ensembl release, identifier form, local repository path, and genome assembly—so that all interactions with external data sources remain consistent throughout the session. Because the property is backed byfunctools.cached_property, the helper is created exactly once and reused on subsequent accesses, eliminating redundant network or file-system look-ups.- Returns:
A lazily created, configuration-matched helper object.
- Return type:
- file_name(df_type, *args, ensembl_release=None, **kwargs)[source]
Resolve HDF5 hierarchy key and absolute file path for a dataframe request.
This internal helper centralises every rule that
DatabaseManageruses to build HDF5 hierarchy keys and their corresponding on-disk filenames, ensuring that any two call-sites confronted with the same combination of organism, genome assembly, Ensembl release, dataframe kind, and optional column subset produce identical results. By funnelling every I/O operation through this method the wider package avoids silent cache misses, duplicate downloads, and hard-to-trace inconsistencies in downstream analytics. Public code is expected to invoke higher-level wrappers such asDatabaseManager.get_db(); use this routine only when implementing new caching utilities or in low-level tests.- Parameters:
df_type (str) – Category of dataframe whose name is required. Accepted values are
"processed","mysql", and"common"; any other string triggersValueError.ensembl_release (int, optional) – Ensembl release to encode in the filename. If None, the current
DatabaseManager.ensembl_releaseis used instead.kwargs – Additional keyword arguments forwarded to the helper that handles the selected df_type (currently only
usecolsfor the mysql path).args –
Positional arguments interpreted according to df_type:
processed -
df_indicator(str): symbolic label such as"idhistory"or"idsraw_gene". The manager appendsDatabaseManager.formso that artefacts for different biological forms do not collide.mysql -
table_key(str): raw MySQL table name (e.g."gene","exon"). An optionalusecols(list[str]) must then be supplied via kwargs; the column list is embedded in the hierarchy using the delimiter held inDatabaseManager._column_sep.common -
df_indicator(str): same as the processed case but without the form suffix, allowing cross-form artefacts (e.g."availabledatabases") to share a single key.
- Returns:
- Two-element tuple
(hierarchy_key, file_path)where hierarchy_key is the internal node path (e.g.
"ens111_mysql_gene_COL_gene_id") and file_path is the absolute path to<local_repository>/<organism>_assembly-<assembly>.h5. The path is not created on disk—callers remain responsible for reading or writing the HDF5 file.
- Two-element tuple
- Return type:
tuple[str, str]
- Raises:
ValueError – If df_type is not one of the accepted categories or if the positional/keyword argument combination does not satisfy the expectations for that category (e.g. missing
table_keywhen df_type is"mysql").
- get_db(df_indicator, create_even_if_exist=False, save_after_calculation=True, overwrite_even_if_exist=False)[source]
Retrieve or create a cached data table defined by an indicator string.
This method is the central gateway for all tabular resources managed by
DatabaseManager. It interprets a compact indicator string, decides whether the requested table already exists in the local HDF5 repository, and either loads the cached copy or triggers the appropriate builder (create_*helper) to download/assemble it. A consistent naming convention is maintained so that subsequent calls with the same indicator transparently reuse the on-disk cache, ensuring reproducible builds and minimal network traffic.Supported base indicators
external— cross-reference database registry; optional qualifierrelevant|database|relevant-databasenarrows the view.
idsraw— raw Ensembl identifiers for a given form (``gene``, ``transcript``, ``translation``); requires the form as qualifier.ids— release-specific identifier table (no qualifier).externalcontent— summary of per-database content.relationcurrent— current gene/ID relationships.relationarchive— historical gene/ID relationships across releases.idhistory— full ID history; qualifiernarrowrestricts to current IDs.versioninfo— version comparison across releases.availabledatabases— list of locally cacheable resources.
Additional indicators may be introduced by subclass extensions; consult the module documentation for the authoritative list.
- Parameters:
df_indicator (str) – Compact descriptor of the table to retrieve. Must follow the
base[qualifier]pattern described above.create_even_if_exist (bool) – Force a rebuild/download even if a cached copy is present. Defaults to
False.save_after_calculation (bool) – Persist a newly created table to the local HDF5 store. Has no effect when the table is merely loaded from disk. Defaults to
True.overwrite_even_if_exist (bool) – When saving, replace an existing HDF5 key with the same hierarchy (file-internal path). Defaults to
False.
- Returns:
The requested dataset. The exact shape, index, and column layout depend on
df_indicator; see the indicator list above for semantic details.- Return type:
pandas.DataFrame | pandas.Series
- Raises:
ValueError – If df_indicator is malformed, references an unsupported resource, or its qualifier violates the expected pattern (e.g., missing form for
idsraw).
- get_release_date()[source]
Return a mapping of Ensembl release numbers to their publication dates.
The future implementation will query the
metatable of each reachable release—or fall back to the Ensembl REST API—to build a dictionary such as{105: date(2022, 11, 1), 106: date(2023, 2, 7), …}. Down-stream routines can then translate between absolute dates and release numbers, enabling chronology-aware analyses and reporting.- Raises:
NotImplementedError – Always - date discovery is not yet implemented.
- get_table(table_key, usecols=None, create_even_if_exist=False, save_after_calculation=True, overwrite_even_if_exist=False)[source]
Download, cache, or read a raw MySQL table for the current release.
A high-level wrapper that coordinates three steps:
Path resolution - determines the HDF5 file and internal key under the local repository that belong to table_key (and usecols, if provided).
Fetch or reuse - if the target key is absent, unreadable, or forcibly refreshed, delegates to
download_table()to query the MySQL server; otherwise loads the dataframe from disk.Persistence - optionally stores the freshly downloaded dataframe back to disk, shrinking the number of future network calls.
- Parameters:
table_key (str) – Name of the MySQL table (e.g.
"gene","xref","mapping_session").usecols (list[str] | None) – Column subset to retrieve.
None(default) selects all columns.create_even_if_exist (bool) – Ignore any on-disk cache and re-download the table unconditionally.
save_after_calculation (bool) – Persist the dataframe to the computed HDF5 path when
True.overwrite_even_if_exist (bool) – Replace an existing HDF5 key even when it is already present.
- Returns:
- The requested raw table with column order mirroring usecols when supplied,
otherwise the server’s natural order.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If usecols is an empty list, not a list, or otherwise fails basic validation.
- id_ver_from_df(dbm_the_ids)[source]
Assemble fully qualified node names from a stable-ID / version DataFrame.
This convenience routine converts a two-column frame—usually produced by
DatabaseManager.get_db()with the ids form—into the canonical node labels used throughout ID-track graphs (e.g.ENSG00000000001.1). It first validates that the input columns matchself._identifiers(typically["gene_stable_id", "gene_version"]or analogous for the currentform), then delegates per-row processing toDatabaseManager.node_dict_maker()andDatabaseManager.node_name_maker(). The resulting list may be fed directly into downstream graph builders or written to disk for later reuse.- Parameters:
dbm_the_ids (pandas.DataFrame) – Two-column frame containing the stable identifiers and their Ensembl version numbers. The column order and names must exactly match
self._identifiers; otherwise an exception is raised.- Returns:
- Ordered list where each element is either
"<ID>.<version>"when a valid numeric version is present or simply"<ID>"when the version is None / NaN / an alternative marker (seeidtrack._db.DB.alternative_versions).
- Return type:
list[str]
- Raises:
ValueError – If
dbm_the_idsdoes not contain the expected column names stored inself._identifiers.
- property mysql_database: str
Return the canonical Ensembl Core schema name for the current organism, release, and assembly.
The schema naming convention is deterministic:
{organism}_core_{ensembl_release}_{genome_assembly}[<patch>]For multi-port assemblies (e.g. sus_scrofa assembly
102), the port is selected in__init__()using_get_core_db_index(). Once the release is validated to exist on that chosen port, the schema name itself does not require another server-side discovery query.- Returns:
Schema name like
"homo_sapiens_core_111_38".- Return type:
str
- Raises:
ValueError – If the current release is not available for this (organism, assembly) pair.
- static node_dict_maker(id_entry, version_entry)[source]
Return a normalized ID/Version dictionary from raw column values.
This helper creates the canonical structure consumed by
DatabaseManager.node_name_maker()and higher-level graph utilities, ensuring that version numbers are strictly integers whenever possible. It also recognises special placeholders defined inidtrack._db.DB.alternative_versions(e.g."Retired"or"Void") and passes them through unchanged so that downstream code can handle deprecated or missing entries appropriately.- Parameters:
id_entry (str) – Stable identifier portion preceding the delimiter (e.g.
"ENSG00000000001").version_entry (Any) – Raw version value following the delimiter (e.g.
1in"ENSG00000000001.1"). May be float, int, str, None, NaN, or an alternative placeholder such as"Retired".
- Returns:
{"ID": id_entry, "Version": version_entry}with Version coerced to int when it represents a whole number.- Return type:
dict[str, Any]
- Raises:
ValueError – If
version_entryis numeric but contains a fractional component (e.g.1.2), indicating a malformed identifier that cannot be represented as an integer version.
- static node_name_maker(node_dict)[source]
Concatenate ID and Version into a single node label.
Given the miniature dictionary returned by
DatabaseManager.node_dict_maker(), this helper builds the string representation that uniquely identifies a biological entity within the graph layer. When a numeric version is available, it appends that value to the stable ID usingidtrack._db.DB.id_ver_delimiter("."by default). For organisms or datasets lacking versioned identifiers, it falls back to the bare stable ID to preserve compatibility.- Parameters:
node_dict (dict[str, Any]) – Mapping with exactly two keys,
"ID"and"Version", as produced byDatabaseManager.node_dict_maker().- Returns:
- Either
"<ID>.<version>"or"<ID>"depending on whether a non-null, non-alternative version is present.
- Either
- Return type:
str
- tables_in_disk()[source]
List all dataframes cached for this manager on local disk.
The helper inspects the HDF5 file located at the path generated by
file_name()(df_type="common") and returns every key it contains. When the file does not exist yet, an empty list is returned instead of raising.- Returns:
Sorted HDF5 keys corresponding to dataframes already materialised for this manager.
- Return type:
list[str]
- version_fix(df, version_str, version_info=None)[source]
Apply a global ID-version policy to a DataFrame.
Depending on the organism and its historical annotation quirks, identifiers may (1) never include a version, (2) always include a version, or (3) require a synthetic version when mixing cross-release data. The version_info flag encodes that policy:
"without_version"— strip all versions (set column toNaN)."with_version"— cast column toint64(all values must exist)."add_version"— fill missing entries withDB.first_version.
- Parameters:
df (pandas.DataFrame) – Frame whose version_str column needs harmonising.
version_str (str) – Name of the column that stores version numbers.
version_info (Optional[str]) – One of
"add_version","without_version", or"with_version". WhenNone(default) the method callscheck_version_info()to determine the correct policy automatically.
- Returns:
Same object df with version_str updated in-place.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If version_info is not recognised.
- version_fix_incomplete(df_fx, id_col_fx, ver_col_fx)[source]
Clean up version columns when some identifiers are entirely missing.
Ensembl translation tables occasionally encode parent IDs without a version while descendants retain one, producing frames where id_col_fx is
NaNbut ver_col_fx contains a number. This helper splits the frame, delegates toversion_fix()for each subset, then stitches the pieces back together so that every row obeys a single “with/without/add version” policy.- Parameters:
df_fx (pandas.DataFrame) – Data to harmonise. The frame must include id_col_fx and ver_col_fx.
id_col_fx (str) – Column holding the stable part of the identifier (e.g.
"translation_id").ver_col_fx (str) – Column holding the integer version suffix.
- Returns:
- Frame whose ver_col_fx is consistent with the organism-level policy
determined by
check_version_info().
- Return type:
pandas.DataFrame
- version_uniformize(df, version_str)[source]
Normalise a Version column so every entry is either an
intorNaN.This post-processing helper finalises the output of
DatabaseManager.create_ids(). Ensembl releases differ: some assign an explicit integer version to every stable identifier, whereas others omit the suffix entirely. Downstream code expects a uniform dtype, so this routine coerces the designated column to a proper integer when all entries are present or fills the entire column withnp.nanwhen none are. Mixed presence is forbidden because it would break the ID-version pairing logic used byDatabaseManager.node_name_maker().- Parameters:
df (pandas.DataFrame) – Frame returned by
create_ids(); must already contain a column named version_str.version_str (str) – Name of the column that holds version information (e.g.
"gene_version").
- Returns:
- Same object df with version_str either cast to
int64or overwritten with
np.nanfor every row.
- Same object df with version_str either cast to
- Return type:
pandas.DataFrame
- Raises:
NotImplementedError – If some rows have a version and others do not, indicating an Ensembl release with inconsistent schema. Such a release is currently unsupported.
- Parameters:
organism (str)
form (str)
local_repository (str)
ensembl_release (int | None)
ignore_before (int | None)
ignore_after (int | float | None)
store_raw_always (bool)
genome_assembly (int | None)
- class ExternalDatabases(organism, ensembl_release, form, local_repository, genome_assembly)[source]
Bases:
objectManage third-party metadata for Ensembl entities through YAML side-car files.
This helper encapsulates everything related to the external (i.e. non-Ensembl) databases that can be linked to a given organism, genome assembly, release, and biological form (gene / transcript / translation). Examples of such resources include ArrayExpress, RefSeq, Uniprot, HGNC, and dozens of smaller annotation providers. Rather than hard-coding those relationships, the wider ID-Track toolkit stores them in a human-readable YAML file that lives next to the local data cache managed by
_database_manager.DatabaseManager.The YAML workflow is:
create_template_yaml()enumerates every known combination and writes a template where each entry is markedInclude: false.A user (or an automated post-processing step) reviews the template, toggling
Includetotruefor the resources they need.The modified YAML is saved under local_repository; subsequent calls to
load_modified_yaml()return it as a plaindictfor downstream logic.validate_yaml_file_up_to_date()warns if the user file lags behind a newer template (e.g. because a later Ensembl release introduced extra tables).Utility helpers such as
give_list_for_case()expose convenient filtered views—e.g. all databases that should be downloaded for the current form, or all releases supported by assembly 38.
In short, ExternalDatabases provides a single, version-controlled “contract” describing which third-party tables belong in an ID-track run, while granting users explicit opt-in control over optional resources.
Instantiate a YAML controller tied to a specific organism, release, and assembly.
The constructor mirrors the core configuration of
_database_manager.DatabaseManagerso that both objects operate on the exact same coordinate system. No I/O is performed at construction time; paths are merely recorded, and loggers are configured. Heavy-weight actions—such as scanning the cache for existing YAMLs or writing new ones—happen lazily when the corresponding methods are called.- Parameters:
organism (str) – Canonical Ensembl species identifier in snake_case (e.g.
"homo_sapiens"). Case-insensitive but must match Ensembl conventions.ensembl_release (int) – Target Ensembl release number (e.g.
110). Must correspond to a release that actually exists for organism and genome_assembly.form (str) – Entity level—
"gene","transcript", or"translation". Any other value raisesValueErrorin higher-level validation.local_repository (str) – Writable directory where YAML files and downloads are cached. The directory need not pre-exist; if missing, most public methods will attempt to create it.
genome_assembly (int) – Genome assembly code as used in Ensembl core schema naming (e.g.
38= human GRCh38,37= human GRCh37,39= mouse GRCm39,111= pig Sscrofa11.1). Used to disambiguate multiple assemblies available for the same organism/release pair.
- create_template_yaml(df)[source]
Generate a template YAML enumerating external-database options.
This helper scans df—typically the dataframe returned by
idtrack._database_manager.DatabaseManager.create_database_content()—and writes a scaffold configuration file tofile_name_template_yaml(). The file lists every organism → form → database combination observed in df, grouped by genome assembly and Ensembl release. For each entry the template records whether the database should be included when building an ID-history graph, its integer Database Index, and an empty Potential Synonymous placeholder that future versions may use to flag overlapping resources.Users are expected to edit the generated file—changing
Includefromfalsetotruewhere appropriate—and rename it by appending_modifiedto the filename before the package will load it. A warning to that effect is emitted vialogging.Logger.warning().The resulting YAML resembles the structure below (truncated for brevity):
homo_sapiens: gene: ArrayExpress: Assembly: "37": Ensembl release: 79,80,81,82,83,84,85,86,87,88,89 Include: false "38": Ensembl release: 79,80,81,82,83,84,85,86,87,88,89 Include: false Database Index: 0 Potential Synonymous: "" Clone-based (Ensembl): Assembly: "37": Ensembl release: 79,80,81,82,83,84,85 Include: false "38": Ensembl release: 79,80,81,82,83,84,85 Include: false Database Index: 5 Potential Synonymous: ""
Editing guidelines
Set
Includeto true for every assembly of the databases you need.Save the edited file with
_modifiedappended to the base name so that downstream routines load the customised version.
- Parameters:
df (pandas.DataFrame) – Dataframe containing at least the columns
["organism", "form", "name_db", "assembly", "release"]. It should be produced byidtrack._database_manager.DatabaseManager.create_database_content()so that the expected schema is guaranteed.- Raises:
ValueError – If df contains duplicate assembly entries for the same organism/form/database triple, causing an internal consistency check to fail.
Notes
The Potential Synonymous is now all empty. In the following versions, it is aimed to integrate a feature that prevent to heve synonymous databases in the list. Likewise, Database Index has now no use case, in the program. It is important to follow the final warning raised by the method. ‘’Please edit the file based on requested external databases and add ‘_modified’ to the file name.’’. The editing should be done by converting Include sections from false to true. It is recommended to make the change for each assembly for a given database.
- file_name_modified_yaml(mode)[source]
Resolve the path to a modified YAML file customised by the user or shipped with the package.
The method supports two modes that map to different storage locations:
"configured"- the user-edited file living inExternalDatabases.local_repository."default"- the read-only fallback bundled under<package_root>/default_configfor quick starts and unit tests.
By funnelling every lookup through this routine, higher-level helpers such as
ExternalDatabases.load_modified_yaml()remain agnostic about the underlying directory structure and can focus on validation and parsing instead.- Parameters:
mode (str) – Either
"configured"or"default"selecting the corresponding search location.- Returns:
Absolute path of the requested YAML file.
- Return type:
str
- Raises:
ValueError – If mode is not one of the recognised values.
- file_name_template_yaml()[source]
Return absolute path to the template YAML configuration file.
A helper that deterministically builds the filename used by
ExternalDatabases.create_template_yaml()when it first scaffolds the external-database configuration for organism. Centralising the logic here keeps every component of idtrack that may need the path (tests, CLI tools, future maintenance scripts) in perfect sync with a single implementation. The method performs no I/O; it merely concatenatesExternalDatabases.local_repositoryand the conventional filename pattern"<organism>_externals_template.yml"so callers can decide whether to create, read, or overwrite the file.- Returns:
Absolute path of
<organism>_externals_template.ymllocated insideExternalDatabases.local_repository.- Return type:
str
- give_list_for_case(give_type)[source]
Return database names or assembly codes extracted from the external-DB YAML file.
The helper provides a lightweight way for higher-level components (e.g.
DatabaseManager) to discover which external resources—or which genome assemblies—are currently eligible according to the user-editable YAML configuration created byExternalDatabases.create_template_yaml(). Instead of forcing the caller to parse the YAML structure manually, the method filters the entries for the manager’s organism, form, Ensembl release and genome assembly and returns the requested slice.- Parameters:
give_type (str) –
Kind of list to return. Accepted values are
"db"external-database names (str) whoseIncludeflag istruefor the current organism, form, assembly and Ensembl release."assembly"genome-assembly codes (int) for which the YAML enables at least one external database (Include: true) at the current Ensembl release.
- Returns:
When give_type is
"db", a list of database names.When give_type is
"assembly", a list of assembly codes.
- Return type:
list[str] | list[int]
- Raises:
ValueError – If give_type is not
"db"nor"assembly"or if an unexpected internal inconsistency is encountered while traversing the YAML structure.
- load_modified_yaml()[source]
Load the user-edited or default YAML configuration and verify release compatibility.
This convenience wrapper searches for the configured YAML file first; if it does not exist or lacks read permissions a warning is logged and the default YAML file shipped with the package is tried instead. Failure to locate either file aborts the process with
FileNotFoundError. After loading, the method delegates toExternalDatabases.validate_yaml_file_up_to_date()to ensure that the currently requested Ensembl release is represented in the configuration.- Returns:
Parsed YAML content keyed by
{organism → form → database → Assembly → {...}}.- Return type:
dict
- Raises:
FileNotFoundError – If neither the configured nor the default YAML file can be accessed.
- validate_yaml_file_up_to_date(read_yaml_file)[source]
Assert that the YAML configuration lists the active Ensembl release.
The external-database mapping evolves with each Ensembl release. This helper extracts the set of releases encoded in read_yaml_file—no matter how deeply nested—and verifies that
ExternalDatabases.ensembl_releaseis present. Triggering an exception here prevents downstream graph-construction logic from silently operating on incomplete or outdated metadata, prompting users to regenerate or update the YAML file before proceeding.- Parameters:
read_yaml_file (dict) – Dictionary produced by
ExternalDatabases.load_modified_yaml()containing the loaded YAML structure.- Raises:
ValueError – If the current Ensembl release is absent from the YAML configuration.
- class TheGraph(*args, **kwargs)[source]
Bases:
MultiDiGraphRepresent a bio-identifier multigraph with IDTrack-specific helpers.
The class extends
networkx.MultiDiGraphto model historical and cross-reference relationships between Ensembl identifiers (genes, transcripts, translations) and third-party database accessions (UniProt, RefSeq, …). It is built byidtrack._graph_maker.GraphMaker, then queried byidtrack.Trackfor high-performance path-finding across Ensembl releases and external resources.Additional cached properties (e.g.
rev,combined_edges, andhyperconnective_nodes) collapse expensive aggregate calculations into single attribute look-ups, while helpers such asattach_included_forms()record which biological forms were merged into a particular instance. Together these conveniences allow downstream algorithms to traverse millions of edges without the memory overhead of duplicating graphs or recomputing summaries.Instantiate the multigraph and configure package logging.
All positional and keyword arguments are forwarded verbatim to
networkx.MultiDiGraph, allowing callers to pre-seed the graph with nodes, edges, or name/metadata attributes exactly as they would with a vanilla NetworkX constructor. After delegating tosuper().__init__, the method initialises two convenience attributes:log— a dedicatedlogging.Loggernamed"the_graph"forstructured, per-instance diagnostics.
available_forms— a placeholder set toNoneuntilattach_included_forms()is called byidtrack._graph_maker.GraphMaker.
- Parameters:
args (Any) – Positional arguments accepted by
networkx.MultiDiGraph.__init__().kwargs (Any) – Keyword arguments accepted by
networkx.MultiDiGraph.__init__().
- _attach_included_forms(available_forms)[source]
Record which Ensembl forms are present in the merged graph.
Graphs for gene, transcript, and protein are first built independently by
GraphMakerand then merged into a singleTheGraphinstance. This helper runs after that merge to store the subset of forms that actually made it into the final graph—information required by several cached properties (e.g.available_external_databases) for consistency checks and downstream analyses. Calling the method before the merge would mis-report available forms and corrupt those caches.- Parameters:
available_forms (list[str]) – Exact list of included forms (typically
["gene", "transcript", "protein"]). Order is preserved so callers can rely on a deterministic iteration sequence.- Return type:
None
- static _combined_edges(node_list, the_graph)[source]
Aggregate database/assembly/release metadata for the edges of node_list.
The routine is the work-horse behind the
TheGraph.combined_edgesfamily of cached properties. It iterates over every node in node_list, inspects each outgoing (or, when the_graph is a reversed view, incoming) edge, and builds a deterministic description of which external database, genome assembly, and Ensembl release the connection originates from.Edges that link two nodes of the same node-type are ignored so that backbone history links (gene ↔ gene, transcript ↔ transcript, …) do not pollute the output (as tested in
idtrack._track_tests.TrackTest.is_edge_with_same_nts_only_at_backbone_nodes()). For edges whose database key is one of the generic Ensembl forms (ensembl_gene,ensembl_transcript, …) the key is rewritten to the assembly-specific variant (e.g.assembly_38_ensembl_gene) to keep assemblies logically separate in downstream analyses.- Parameters:
node_list (NodeView | list[str]) – Nodes whose edge metadata will be consolidated. Accepts either a plain list or the
networkxview returned bygraph.nodes.the_graph (nx.MultiDiGraph) – Graph to inspect. Pass
selffor the native orientation orself.revwhen a reverse walk is required.
- Returns:
- Mapping
{node: {database: {assembly: set[int]}}}that summarises every admissible edge attached to the requested nodes.
- Mapping
- Return type:
dict
- static _combined_edges_genes_helper(the_result)[source]
Merge per-neighbour edge metadata for gene-centric queries.
This helper is used exclusively by
TheGraph.combined_edges_genes()andTheGraph.combined_edges_assembly_specific_genes()to post-process the dictionaries returned byTheGraph._combined_edges(). Because backbone gene nodes have no outgoing edges except to other gene nodes, the caller invokesTheGraph._combined_edges()on a reversed graph and receives one nested dictionary per neighbour. The present routine- Flattens those per-neighbour sub-dicts so that information from
multiple neighbours of the same external database and assembly is unified.
- Re-labels the generic
ensembl_genekey to the assembly-qualified form
assembly_<N>_ensembl_geneso that the provenance of every entry remains explicit and consistent with the rest of the code base.
- Re-labels the generic
- Parameters:
the_result (dict) – Nested mapping produced by
TheGraph._combined_edges()for a single gene node. The structure is{neighbour: {database: {assembly: set[int]}}}.- Returns:
- Collapsed mapping
{database: {assembly: set[int]}}where all neighbour-level dictionaries have been merged and database names have been renamed to their assembly-specific counterparts when appropriate.
- Collapsed mapping
- Return type:
dict
- _get_active_ranges_of_id(input_id)[source]
Compute Ensembl-release ranges for a single identifier, choosing logic by node type.
This private helper inspects the
input_idand dispatches to an internal routine tailored to the node’s role in the graph:_get_active_ranges_of_id_backbone()- deals with backbone nodesthat form the primary versioned lineage.
_get_active_ranges_of_id_nonbackbone()- handles assembly-specificor auxiliary identifiers recorded in one of the combined-edges lookup tables.
- Parameters:
input_id (str) – Identifier whose life-span across Ensembl releases is requested. Must exist in
nodes.- Returns:
- Ordered, non-overlapping
[[start_rel, end_rel], …] where both ends are inclusive.
- Ordered, non-overlapping
- Return type:
list[list[int]]
- _node_trios(the_id)[source]
Compute all origin trios for a single node.
The routine identifies the node-type, chooses the appropriate combined_edges cache, and expands any Ensembl release ranges so that every individual release is represented. Alternative-assembly backbone genes and assembly-specific genes receive special handling to ensure the correct database label is recorded.
- Parameters:
the_id (str) – Canonical node name used inside the graph.
- Returns:
- Unique triples
(<database>, <assembly>, <release>)describing every context in which the_id occurs.
- Unique triples
- Return type:
set[tuple[str, int, int]]
- property available_external_databases: set[str]
Return the set of external databases represented in the graph.
This helper inspects every node whose node-type flag matches
idtrack._db.DB.nts_externaland records the database name attached to the outbound edges. The resulting set is cached so that downstream routines—such as validating user-supplied database names or determining which third-party resources must be fetched—can query the information in O(1) time instead of re-scanning the graph.- Returns:
- Unique names of all third-party (non-Ensembl) databases present
in the current
TheGraphinstance.
- Return type:
set[str]
- property available_external_databases_assembly: dict[int, set[str]]
Return external databases available for each genome assembly.
For every assembly identifier in
available_genome_assemblies, this method gathers the subset of external databases that are connected—directly or indirectly—to nodes annotated with that assembly. The per-assembly view is vital when users need to restrict conversions to genomes with consistent annotation coverage (e.g., choosing GRCh38-only resources for a human data set).- Returns:
- Mapping from assembly number (for example
37or 38) to the set of external databases that have at least one entry linked to that assembly.
- Mapping from assembly number (for example
- Return type:
dict[int, set[str]]
- property available_genome_assemblies: set[int]
Return the set of genome assemblies represented in the current graph.
The helper scans every identifier edge table cached on the instance (e.g.
combined_edges,combined_edges_genes,combined_edges_assembly_specific_genes) and extracts the assembly component of each edge key. It therefore answers the question “Which genome builds does this graph actually know about?” Several public utilities depend on this information when validating user-supplied assembly arguments or iterating across assemblies in reproducible order (seeDB.assembly_mysqlport_priorityfor organism-scoped priorities).- Returns:
Unique genome assembly identifiers (e.g.
38for human GRCh38,37for human GRCh37,39for mouse GRCm39) present anywhere in the graph.- Return type:
set[int]
- property available_releases_given_database_assembly: dict[tuple[str, int], set]
Map (database, assembly) pairs to the Ensembl releases in which they occur.
This expensive, cached property lets callers quickly answer “Which Ensembl releases contain at least one node from database **D* on assembly A?”* Internally it delegates the per-pair work to the nested
available_releases_given_database_assembly._inline_available_releases()helper, then augments the mapping with additional information gleaned from severalidtrack.DBlook-ups (e.g.DB.nts_assembly,DB.nts_base_ensembl,DB.nts_ensembl). Although heavy, the routine is indispensable for test suites and diagnostic notebooks that must reason about historical coverage across many releases.- Returns:
- A dictionary whose keys are (database_name, assembly)
tuples and whose values are the sets of Ensembl release numbers in which that pair is represented.
- Return type:
dict[tuple[str, int], set[int]]
- calculate_caches(for_test=False)[source]
Eagerly materialise every
@cached_propertyto prime the cache.Accessing a cached property for the first time triggers an expensive computation. Batch-loading all of them up-front improves latency for subsequent graph queries and simplifies unit-test expectations because no additional properties are computed lazily in the background.
The optional for_test flag activates a few heavyweight diagnostics that are normally skipped in production but useful for test suites and profiling.
- Parameters:
for_test (bool) – If
True(default), also compute caches that exist solely for testing or sanity-check purposes (e.g.external_database_connection_form). Set toFalseto warm only the properties required at run-time.- Return type:
None
- property combined_edges: dict
Aggregate outgoing-edge metadata for every non-gene node in the graph.
This cached view pre-computes, for each backbone or external identifier, which external databases, genome assemblies, and Ensembl releases are reachable through outgoing edges—while purposely excluding Ensembl gene and assembly-specific gene nodes. The summary accelerates synonym search and other traversal routines in
idtrack.track.Track.pathfinder()because consumers can consult a compact dictionary instead of repeatedly iterating raw NetworkX edges and attributes.- Returns:
Nested mapping of the form
{node_name: {database_name: {assembly: set[int]}}}, wherenode_name (str) - Identifier of the start node whose edges were inspected.
- database_name (str) - Canonical name of the external database or Ensembl sub-type
(e.g.
uniprot,refseq_rna,assembly_x_ensembl_gene).
- assembly (str) - UCSC-style assembly label (e.g.
GRCh38);Nonewhen the edge is not assembly-scoped.
- assembly (str) - UCSC-style assembly label (e.g.
set[int] - Collection of Ensembl release numbers in which the connection is valid.
- Return type:
dict
Notes
Edges that link two nodes of the **same* node-type are ignored,* ensuring the dictionary focuses on cross-type relationships that matter for ID translation.
- property combined_edges_assembly_specific_genes: dict
Aggregate incoming‐edge metadata for assembly-specific Ensembl gene nodes.
Assembly-specific gene identifiers (e.g.
GRCh37:ENSG00000123456) represent loci that differ between reference builds. This property mirrors the logic ofcombined_edges_genes()but targets nodes not captured by that property, ensuring the three cached dictionaries are mutually exclusive and collectively exhaustive. Because each such gene belongs to exactly one assembly, the returned structure always contains a single assembly key per outer node.- Returns:
- Mapping
{assembly_specific_gene_id: {database_name: {assembly: set[int]}}}where the sole assembly key matches the assembly implied by the node’s own identifier.
- Mapping
- Return type:
dict
- property combined_edges_genes: dict
Aggregate incoming-edge metadata for Ensembl gene nodes.
Gene nodes only possess incoming edges (toward the gene); therefore the calculation traverses the graph in reverse (
self.rev) to collect equivalent information tocombined_edges(), but restricted solely to nodes whoseidtrack._db.DB.node_type_strisDB.nts_ensembl["gene"]. The result merges edge data from all contributing external databases so that downstream callers receive one consolidated view per gene.- Returns:
- Nested mapping
{gene_id: {database_name: {assembly: set[int]}}}. A single gene may appear under multiple assemblies when reference genomes share that transcript locus.
- Nested mapping
- Return type:
dict
- static compact_ranges(list_of_ranges)[source]
Collapse adjacent or touching integer ranges into the smallest possible set.
In the IDTrack graph every Ensembl identifier is active for one or more contiguous release intervals. Storing those intervals as
[[start, end], …]is convenient but can become redundant when consecutive ranges abut each other. compact_ranges performs an in-place, O(n) forward sweep that merges any pair of ranges where the gap betweenendof the first andstartof the next is ≤ 1, returning a new list that covers the exact same discrete releases with the fewest possible intervals. The helper is a cornerstone for many caching utilities (e.g.TheGraph.get_active_ranges_of_id()) and therefore optimised for speed and minimal allocations.- Parameters:
list_of_ranges (list[list[int]]) – Sorted, non-overlapping, inclusive ranges in the form
[[start, end], …]. All numbers must be positive integers andstart ≤ endfor every range.- Returns:
- A new list containing the minimal, non-overlapping, inclusive ranges that exactly
cover the union of list_of_ranges.
- Return type:
list[list[int]]
- property external_database_connection_form: dict[str, str]
Infer which Ensembl identifier form each external database connects to.
External databases link to exactly one “form” of Ensembl identifier—gene, transcript, or translation—determined upstream by
idtrack._external_databases.ExternalDatabases. The method walks the neighborhood of every external-database node, tallies the node-type of its Ensembl neighbours, and assigns the majority form. A mis-annotation that connects an external node directly to a non-Ensembl node is interpreted as a schema violation and aborts withValueError.- Returns:
- Dictionary whose keys are external-database names and
whose values are one of
"gene","transcript", or"translation", indicating the form of Ensembl ID to which the database links.
- Return type:
dict[str, str]
- Raises:
ValueError – If any external-database node is found connected to a node that is not an Ensembl identifier, indicating an inconsistent graph state.
- get_active_ranges_of_base_id_alternative(base_id)[source]
Return the Ensembl-release intervals during which a base gene identifier is active.
The routine unifies child-level history into an easy-to-query representation. A base Ensembl ID (e.g.
ENSG00000123456) has one or more versioned descendants (ENSG00000123456.1,ENSG00000123456.2, …) whose lifetimes can never overlap. By walking the immediate neighbours of base_id and unioning every child’sget_active_ranges_of_id_ensembl_all_inclusiveresult, the method derives exactly the releases in which any descendant existed. This summary read-out is used by higher-level diagnostics (for example, range-overlap sanity checks) and by algorithms that need to reason about the birth and retirement of genes at the stable-ID level.- Parameters:
base_id (str) – Stable Ensembl gene identifier without version suffix. The node must have
node_type == DB.nts_base_ensembl["gene"]inside the graph.- Returns:
- Sorted, non-overlapping
[start, end]slices inclusive at both ends. endmay benp.infwhen the gene is still present in the most recent release.
- Sorted, non-overlapping
- Return type:
list[list[int]]
- property get_active_ranges_of_id: dict[str, list[list]]
Return inclusive Ensembl-release intervals in which every node in the graph is biologically active.
The convenience wrapper iterates over all nodes currently stored in this
idtrack.the_graph.TheGraphinstance and delegates the heavy lifting to_get_active_ranges_of_id(). The latter performs node-type-specific logic (backbone vs. assembly-specific) to determine contiguous release windows—at no point does this method examine which genome assembly the release originated from, because for downstream tasks (lifecycle analysis, deprecation reports, etc.) only the presence/absence across release numbers matters.- Returns:
- Mapping
{node_id: [[start_rel, end_rel], ...]} where every inner two-element list is an inclusive range. Ranges are sorted in ascending order and guaranteed not to overlap.
- Mapping
- Return type:
dict[str, list[list[int]]]
- get_active_ranges_of_id_ensembl_all_inclusive(the_id)[source]
Return the inclusive Ensembl-release ranges during which the_id is active across all assemblies.
This helper generalises
get_active_ranges_of_id(), which only reports activity on the graph’s main assembly, by folding in evidence from every other assembly represented incombined_edges_genes. The resulting timeline therefore reflects all times at which the identifier (or any assembly-specific sibling) existed in Ensembl—crucial when downstream analyses must ignore assembly boundaries, e.g. when tracking identifier synonymy across genome builds. After merging, the routine validates that the main-assembly slice remains consistent with the authoritative backbone cache and aborts with a detailed error if divergence is detected.- Parameters:
the_id (str) – Ensembl gene identifier—either backbone (
ENSG…) or assembly-qualified (assembly_<code>_ensembl_gene). The node’sDB.node_type_strmust be one ofDB.nts_ensembl["gene"]or the set inDB.nts_assembly_gene.- Returns:
A list of
[start, end]pairs (inclusive, sorted, non-overlapping) covering every Ensembl release in which the_id was present on any assembly.- Return type:
list[list[int]]
- Raises:
ValueError – If (1) activity inferred from
combined_edges_genesdisagrees withget_active_ranges_of_id()for the main assembly, or (2) the_id is not a recognised Ensembl-gene node type.
- get_external_database_nodes(database_name)[source]
Collect identifiers that appear at least once in the specified external database.
The graph stores one node per identifier and attaches metadata—such as its origin database—to each node via
self.nodes[node_name]. This helper filters that dictionary, returning every node whose metadata marks it as an external identifier belonging to database_name. The result is often fed into downstream integrity checks or exported so that analysts can cross-reference original accession lists.- Parameters:
database_name (str) – Name of the external resource (e.g.
"UniProtKB"). Must be one of the values returned byTheGraph.available_external_databases().- Returns:
All unique node names (accessions) associated with database_name.
- Return type:
set[str]
- get_id_list(database, assembly, release)[source]
Return node identifiers for a specific (database, assembly, release) slice of the multigraph.
This helper exists primarily for unit-testing and exploratory analysis. Internally, the graph stores node metadata in the memory-intensive
node_trioscache, keyed by a triple(database_or_node_type, assembly, release).get_id_list()hides that complexity, walking the full node set and extracting only those identifiers whose tuple key matches the requested slice. Because the traversal touches every node, the method is slow and scales poorly compared with the vectorised access paths used in production code. It is therefore not called in performance-critical workflows; its main purpose is to generate deterministic ground-truth lists that test-suites can compare against.The method also reproduces legacy Ensembl behaviour: when database resolves to the canonical Ensembl gene node type on the primary assembly, identifiers whose
Versionattribute is one ofidtrack._db.DB.alternative_versionsare still included, ensuring that versioned and unversioned IDs appear together—exactly as they do in public Ensembl MySQL dumps.- Parameters:
database (str) – External database name for external nodes (e.g.
"uniprot","refseq") or an Ensembl node-type label such as"gene","transcript", or"translation". Ensembl labels must match the keys defined inidtrack._db.DB.nts_ensembl.assembly (int) – Genome assembly identifier (e.g.
38for human GRCh38) that must be present inavailable_genome_assemblies().release (int) – Ensembl release number (e.g.
111) corresponding to the graph snapshot of interest.
- Returns:
A list of unique node names (identifiers) in insertion order that belong to the requested
(database, assembly, release)tuple. The list may include versioned Ensembl genes as noted above.- Return type:
list[str]
Notes
The helper performs a linear scan over
networkx.MultiDiGraph.nodes, so its runtime isO(|V|)and memory footprint equals that ofnode_trios. Prefer dedicated graph queries for production workloads and reserve this method for tests or ad-hoc inspection.
- static get_intersecting_ranges(lor1, lor2, compact=True)[source]
Return the set of releases common to both input range lists.
The routine computes the pairwise intersection between every range in lor1 and every range in lor2, yielding a list of ranges where the two original lists overlap. Optionally the result may be passed through
TheGraph.compact_ranges()to merge adjacent slices and guarantee a minimal representation. Because the helper is frequently used inside path-finding algorithms it trades clarity for raw performance and therefore assumes both inputs are already sorted, non-overlapping, and inclusive as produced elsewhere in the library.- Parameters:
lor1 (list[list[int]]) – First list of inclusive, ascending, non-overlapping ranges.
lor2 (list[list[int]]) – Second list of ranges with the same invariants as lor1.
compact (bool) – When
True(default) the raw intersections are passed toTheGraph.compact_ranges()before being returned.
- Returns:
- Inclusive integer ranges where lor1 and lor2 overlap. The list is empty when
no overlap exists.
- Return type:
list[list[int]]
- get_next_edge_releases(from_id, reverse)[source]
List the Ensembl releases reachable by the next (or previous) edges from from_id.
The method scans the immediate neighbourhood of a backbone gene node and extracts the release numbers that mark either the next chronological transition (reverse =
False) or the previous one (reverse =True). It respects graph directionality, skips non-backbone connections, collapses duplicate multi-edges, and treats infinite self-loops as “still active” when stepping forward in time. The result is a de-duplicated, easy-to-use list that higher-level path-finding algorithms can feed directly into release-oriented traversals.- Parameters:
from_id (str) – Ensembl gene identifier that must belong to the backbone (
DB.external_search_settings["nts_backbone"]).reverse (bool) – If
Falsereturn forward (old → new) transition releases; ifTruereturn backward (new → old) releases.
- Returns:
- Sorted list of unique Ensembl release numbers adjacent to from_id in the chosen temporal
direction.
- Return type:
list[int]
- Raises:
ValueError – If from_id is not a backbone node—i.e. its
node_typedoes not matchDB.external_search_settings["nts_backbone"].
- get_two_nodes_coinciding_releases(id1, id2, compact=True)[source]
Determine releases in which both graph nodes are simultaneously active.
Graph nodes (Ensembl genes, transcripts, proteins, or external IDs) exist only for defined release intervals. When integrating annotations it is often necessary to know the time span where two nodes co-exist—for example, when building an orthogonal mapping table or validating edge chronology. The method retrieves each node’s active ranges via
TheGraph.get_active_ranges_of_id(), computes their intersection withTheGraph.get_intersecting_ranges(), and optionally compacts the result. The returned list therefore represents every Ensembl release in which id1 and id2 are valid simultaneously.- Parameters:
id1 (str) – Identifier of the first node (must exist in
self.nodes).id2 (str) – Identifier of the second node (must exist in
self.nodes).compact (bool) – Forwarded to
TheGraph.get_intersecting_ranges(). WhenTrue(default) the final ranges are minimised; whenFalsethe raw intersections are returned.
- Returns:
Inclusive release intervals
[[start, end], …]where id1 and id2 overlap. The list is empty if the nodes never co-occur.- Return type:
list[list[int]]
- property hyperconnective_nodes: dict[str, int]
Return hyper-connective external nodes and their out-degree counts.
Hyper-connective nodes are external identifiers whose out-degree (number of outgoing edges) exceeds
idtrack._db.DB.hyperconnecting_threshold. Because such nodes may participate in tens of thousands of mappings, they explode the breadth-first frontier of the synonym pathfinder algorithm and become a major performance bottleneck. The algorithm therefore ignores these nodes, sacrificing a small amount of theoretical precision for a substantial speed-up.In practice the precision penalty is negligible: hyper-connective nodes tend to be coarse-grained identifiers that already suffer from low mapping specificity (for example, generic protein or transcript accessions re-used across many unrelated biological entities). Meaningful, one-to-one synonym relationships are almost always reachable through alternative external identifiers. Consequently, ignoring hyper-connective nodes both accelerates the search and often improves the overall relevance of the results.
The value is computed lazily on first access and memoised via
functools.cached_property(), so the underlying query runs at most once perTheGraphinstance.- Returns:
- Mapping from external node identifier to its out-degree, limited to nodes whose
out-degree is greater than
idtrack._db.DB.hyperconnecting_thresholdand whoseidtrack._db.DB.node_type_strequalsidtrack._db.DB.nts_external.
- Return type:
dict[str, int]
- static is_point_in_range(lor, p)[source]
Check whether a single integer lies inside any range in lor.
The helper performs a linear scan over lor (assumed sorted and non-overlapping) and returns as soon as p falls between a
[start, end]pair. It is intentionally lightweight because it is called inside tight loops that filter large identifier sets by Ensembl release.- Parameters:
lor (list[list[int]]) – Inclusive, ascending, non-overlapping ranges against which p is tested.
p (int) – The release number to evaluate.
- Returns:
Truewhen p is covered by at least one range in lor;Falseotherwise.- Return type:
bool
- static list_to_ranges(lst)[source]
Compact a sorted list of releases into minimal inclusive ranges.
The helper converts monotonically increasing, duplicate-free release numbers into a run-length representation (e.g.
[1, 2, 3, 5] → [[1, 3], [5, 5]]). It is the logical inverse ofTheGraph.ranges_to_list()and is frequently used to post-process the raw release sets collected from edge metadata.- Parameters:
lst (list[int]) – Releases strictly increasing with no repetitions. Supplying an unsorted or duplicate-containing list leads to undefined behaviour.
- Returns:
- Non-overlapping
[start, end]intervals covering exactly the input elements. Each inner list is inclusive; singleton releases become
[r, r].
- Non-overlapping
- Return type:
list[list[int]]
- property lower_chars_graph: dict[str, str]
Map lowercase node identifiers to their canonical graph node names.
The cached mapping enables case-insensitive queries against the graph by translating a lowercase version of every node into the exact identifier stored in
self.nodes. ID-resolution helpers such asnode_name_alternatives()rely on this cache to recover the intended node even when callers supply mixed-case or lowercase strings.During construction the method iterates once over all nodes, lowers each identifier, and asserts that no two distinct nodes collide after lower-casing. The result is memoised via
functools.cached_property, so subsequent accesses are O(1).- Returns:
{lowercase_id: original_id}giving a one-to-one mapping fromlowercase node identifiers to the exact strings used in the graph.
- Return type:
dict[str, str]
- Raises:
ValueError – If two or more nodes become identical after converting to lowercase, indicating ambiguous casing in the underlying graph.
- node_name_alternatives(identifier)[source]
Resolve a raw query identifier to the exact graph node label that ID-Track expects.
The routine shields downstream path-finding code from the myriad ways users may spell or format biological identifiers. It walks through a well-defined priority list—direct lookup, case-blind match, version-suffix trimming, and dash/underscore substitutions—before finally retrying the whole sequence with the
synonym:prefix used bysynonym_id_nodes_prefix. This makes interactive exploration tolerant to typos such as lower-case gene symbols (actb→ACTB) or versioned Ensembl IDs written with underscores (ENSG00000123456_2→ENSG00000123456.2).- Parameters:
identifier (str) – Raw identifier supplied by the caller. May be an Ensembl ID, external database key, or any variant handled by the heuristics described above.
- Returns:
The canonical node label or
Nonewhen no match is possible.Truewhen identifier had to be modified (case change, suffix strip, etc.);Falsewhen an exact graph hit was found.
- Return type:
tuple[Optional[str], bool]
Notes
Internally this is a thin wrapper that delegates the heavy lifting to the private
_node_name_alternatives()helper, then retries once with the synonym prefix if the first pass fails. The helper itself is further decomposed into specialised sub-functions—see their individual docstrings for details.
- property node_trios: dict[str, set[tuple]]
Return a full node → trio-set cache.
Builds the complete mapping once and stores it as a
functools.cached_property. The mapping is memory-heavy but accelerates downstream helpers that repeatedly need the(<database>, <assembly>, <release>)origin of many nodes.- Returns:
- Node identifier → the set of unique
(database, assembly, release)combinations in which that node is active.
- Return type:
dict[str, set[tuple[str, int, int]]]
Notes
The builder simply iterates ``self.nodes`` and delegates the per-node logic to :py:meth:`_node_trios`. Expect a multi-second start-up on large graphs.
- ranges_to_list(lor)[source]
Explode inclusive ranges back into a sorted list of releases.
This is the inverse of
TheGraph.list_to_ranges(). Each[start, end]slice is expanded inclusive of both boundaries; if end isnp.infthe interval is closed withmax(self.graph["confident_for_release"])so that downstream numeric operations continue to work on finite integers. The union of all expanded ranges is returned in ascending order without duplicates.- Parameters:
lor (list[list[int | float]]) – List of inclusive, non-overlapping
[start, end]pairs.startmust be> 0;endmay benp.infto denote open-ended activity.- Returns:
Strictly increasing sequence of releases represented by lor.
- Return type:
list[int]
- property rev: TheGraph
Return a view of the same graph with all edge directions reversed.
The call delegates to
networkx.MultiDiGraph.reverse()withcopy=False, meaning the returned object re-uses the underlying data structures and therefore consumes no additional memory. Use this property whenever a temporal walk must proceed backwards in history (e.g. when resolving identifiers from a newer to an older Ensembl release).- Returns:
- A non-copying, lazily constructed reverse-orientation
view that honours every node and edge attribute of the original graph.
- Return type:
- class GraphMaker(db_manager)[source]
Bases:
objectCreates ID history graph.
It includes Ensembl gene ID history. Ensembl ID history is obtained from Ensembl resources, which shows the connection between different Ensembl base IDs or different versions of the same Ensembl base ID. Ensembl transcripts (with base IDs and versions) are connected to gene, and Ensembl proteins are connected to transcripts. Additionally, a selected set of external databases are connected to the related Ensembl IDs: for example UniProt IDs are associated with proteins, while RefSeq transcript IDs are associated with transcripts. The
GraphMakerclass also saves the resulting graph into the defined temporary directory for later calculations.Class initialization.
- Parameters:
db_manager (DatabaseManager) – Needed to download all necessary tables and data frames. It contains the temporary directory to save the resultant graph.
- Raises:
ValueError –
GraphMakerhas to be created with the latest release possible indb_manager. If not, the exception is raised.
- construct_graph(narrow=False, form_list=None, narrow_external=True)[source]
Main method to construct the graph.
It creates the graph with Ensembl gene, transcript and protein information. It also adds
DB.nts_base_ensembl[f]nodes into the graph, which has only base Ensembl gene ID (no version). External database entries described inExternalDatabaseswill be part of the graph. Normally, user is not expected to use this method, as the method is utilized inget_graphmethod.- Parameters:
narrow (bool) – Determine whether a some more information should be added between Ensembl gene IDs. For example, which genome assembly is used, or when was the connection is established. For usual uses, no need to set it
True.form_list (list | None) – Determine which forms (transcript, translation, gene) should be included. If
None, then include all possible forms defined inDatabaseManager. It has to be list composed of following strings: ‘gene’, ‘transcript’, ‘translation’.narrow_external (bool) – If set
False, all possible external databases defined in Ensembl MySQL server will be included into the graph. The graph will be immensely larger, and the ID history travel calculation will be very slow. Additionally, the success of ID conversion under such a setting it has not been tested yet.
- Returns:
Resultant multiedge directed graph.
- Raises:
ValueError – Unexpected error.
- Return type:
- construct_graph_form(narrow, db_manager)[source]
Creates a graph with connected nodes based on historical relationships between each Ensembl IDs.
- Parameters:
narrow (bool) – See parameter in
Graph.construct_graph.narrowdb_manager (DatabaseManager) – The method reads ID history dataframe, and Ensembl IDs lists at each Ensembl release, provided by
DatabaseManager.
- Returns:
Resultant multi edge directed graph.
- Raises:
ValueError – Unexpected error.
- Return type:
- create_file_name(narrow, form_list=None, narrow_external=True)[source]
File name creator which includes some information regarding the construction process.
Facilitates to recognize the graph based on file name.
- Parameters:
narrow (bool) – See parameter in
Graph.construct_graph.narrowform_list (list[str] | None)
narrow_external (bool)
- Returns:
Absolute file path in the temporary directory provided by
DatabaseManager.- Return type:
str
- export_disk(g, file_path, overwrite)[source]
Write the pickle file in the provided file path, which contains the graph.
- Parameters:
g (TheGraph) – Multi edge directed graph object to stor in the disk.
file_path (str) – Absolute target path, provided by
Graph.create_file_name()overwrite (bool) – See parameter in
Graph.get_graph.overwrite_even_if_exist
- get_graph(narrow=True, create_even_if_exist=False, save_after_calculation=True, overwrite_even_if_exist=False, *, form_list=None, narrow_external=True)[source]
Simplifies the graph construction process.
- Parameters:
narrow (bool) – See parameter in
Graph.construct_graph.narrowcreate_even_if_exist (bool) – Determine whether create the graph even if it exists. If there is no graph in the provided temporary directory, the graph will be created regardless.
save_after_calculation (bool) – Determine whether resultant graph will be saved or not.
overwrite_even_if_exist (bool) – If the graph will be saved, determine whether the program should overwrite. If
False, it does not re-saves the calculated (or loaded) graph.form_list (list[str] | None)
narrow_external (bool)
- Returns:
Resultant multi edge directed graph, which can be used in all future calculations.
- Return type:
- initialize_downloads()[source]
Initialize the external database downloads.
- Raises:
NotImplementedError – Not implemented yet. Currently, the necessary data sources are downloaded when needed during the graph construction process.
- static read_exported(file_path)[source]
Read the pickle file in the provided file path, which contains the graph.
- Parameters:
file_path (str) – Absolute path of the file of interest.
- Returns:
Resultant multi edge directed graph.
- Raises:
FileNotFoundError – When there is no file in the provided directory.
- Return type:
- static remove_non_gene_trees(graph, forms_remove=None)[source]
Removes the edges between the nodes with the same node type and removes abstract nodes (Void and Retired).
The nodes between two the same
DB.node_type_strwill be removed. Also, the nodes with versionsDB.no_new_node_idandDB.no_old_node_idwill be also removed.
- static split_id(id_to_split, which_part)[source]
Simpler method to retrieve ID or Version part of a node name.
- Parameters:
id_to_split (str) – Query node name.
which_part (str) – Either ‘Version’ or ‘ID’.
- Returns:
The requested substring of the node name.
- Raises:
ValueError – If ‘which_part’ is assigned to some other value than ‘Version’, or ‘ID’.
- Return type:
str | float
- update_graph_with_the_new_release()[source]
When new release arrive, just add new nodes.
- Raises:
NotImplementedError – Not implemented yet. Currently, the user is expected to recreate whole graph using
get_graphmethod. Note that not all databases need to be re-downloaded, the program will only download the new release, and re-construct the graph.
- class VerifyOrganism(organism_query)[source]
Bases:
objectResolve a tentative organism identifier to the formal Ensembl species name and its latest supported release.
The class shields end-users from the quirks of the Ensembl REST payload by converting any synonym—common name, scientific name, assembly accession, or NCBI taxon ID—into the canonical Ensembl species identifier (e.g.
homo_sapiens) and the newest Ensembl release that still hosts that species. Because the mapping is refreshed on every instantiation through a live call to the Ensembl REST API, downstream workflows in idtrack always rely on up-to-date metadata rather than a possibly stale local cache.After construction the instance offers two high-level helpers:
>>> resolver = VerifyOrganism("human") >>> resolver.get_formal_name() # 'homo_sapiens' >>> resolver.get_latest_release() # 117 (example)
Both helpers are backed by two public dataframes created during initialisation:
name_synonyms_dataframe— maps every synonym returned by the REST service to the chosen formal name and flags synonyms that are ambiguous across species.ensembl_release_dataframe— one-row table (indexed by formal_name) holding the latest Ensembl release number.
Initialise the resolver and pre-fetch synonym/release tables from the Ensembl REST API.
The constructor immediately invokes
fetch_organism_and_latest_release(), downloading the complete species list from{DB.rest_server_api}{DB.rest_server_ext}so that all subsequent look-ups run entirely in-memory. Any exceptions raised during that fetch are allowed to propagate unchanged so that callers can handle network or data-quality issues explicitly.- Parameters:
organism_query (str) – Organism identifier supplied by the user—common name (
"human"), shorthand ("hsapiens"), taxon ID (9606), or fully qualified Ensembl species name ("homo_sapiens"). The value is converted to lower case before processing.
- fetch_organism_and_latest_release(connect_timeout, read_timeout)[source]
Query the Ensembl REST API once and build lookup tables for species synonyms and latest releases.
This internal utility performs a single call to
/info/specieson the Ensembl REST server, parses the returned JSON, and constructs two pandas dataframes:name_synonyms_df- one row per synonym, with columnssynonym,formal_nameandambiguous(
Trueif the synonym belongs to more than one species).
latest_ensembl_releases_df- indexed byformal_nameand holding a singleensembl_releaseinteger column.
Consolidating the REST query in one place avoids repeated network traffic and provides a cache-friendly structure for subsequent lookups.
- Parameters:
connect_timeout (int) – Seconds to wait while establishing the TCP connection to the Ensembl server.
read_timeout (int) – Seconds to wait for the server to send the full response after the connection has been established.
- Returns:
(name_synonyms_df, latest_ensembl_releases_df)as described above.- Return type:
tuple[pandas.DataFrame, pandas.DataFrame]
- Raises:
TimeoutError – If the combined (connect_timeout, read_timeout) limit is exceeded.
ValueError – If the JSON schema differs from the expected
{"species": [...]}structure or required keys are missing.
- get_formal_name()[source]
Resolve the user’s organism query to the canonical Ensembl species name.
The method performs an exact match against the synonym column of
name_synonyms_dataframe, which was pre-populated from the Ensembl REST species endpoint. Synonyms include scientific names, common names, NCBI TaxIDs, assembly accessions and other aliases, allowing flexible user input while guaranteeing that only one formally recognised organism is selected before any expensive data retrieval begins.- Returns:
The canonical Ensembl species identifier (always lower-case, e.g.
"homo_sapiens").- Return type:
str
- Raises:
KeyError – If the query string does not match any synonym in the dataframe.
ValueError – If the query matches more than one formal name, indicating an ambiguous synonym.
- get_latest_release()[source]
Return the latest Ensembl release number associated with the queried organism.
This helper calls
get_formal_name()to resolve the user-supplied organism query to the canonical Ensembl species name, then looks up that key in the dataframe prepared at instantiation time. Down-stream routines (e.g. database connectors, file download helpers) rely on this value to decide which Ensembl release to fetch, ensuring the entire pipeline stays on a single, internally consistent genome build.- Returns:
The most recent Ensembl release available for the resolved organism.
- Return type:
int
- class Track(db_manager, **kwargs)[source]
Bases:
objectBidirectional path-finding resolver for biological identifiers.
Track builds and queries a bio-ID multigraph that stitches together Ensembl history edges (genes, transcripts, proteins) and cross-reference edges to external databases (UniProt, RefSeq, …). Given a source identifier, a target Ensembl release, and/or a target database, the class:
Normalises the source to an Ensembl gene node when necessary.
Time-travels through historical edges—forward or backward—until it reaches the requested release, optionally “beaming-up” through external IDs when the backbone is disconnected.
Converts the resolved Ensembl gene into the requested external database (or returns the gene itself) while annotating the result with confidence scores and the full traversal path.
Two mutually-recursive engines power the search:
_recursive_function — depth-first search along temporal edges.
_recursive_synonymous — search for synonymous nodes at a single release to enable the external “beam-up”.
- graph
The pre-computed bio-ID graph produced by
_graph_maker.GraphMaker.- Type:
networkx.MultiDiGraph
- version_info
Metadata about the graph build (Ensembl releases included, build date, Git commit, etc.).
- Type:
dict
- _external_entrance_placeholder
Sentinel node IDs that mark artificial edges used when an external ID is pulled onto the Ensembl backbone (False → -1, True → 10001).
- Type:
dict[bool, int]
- _external_entrance_placeholders
Sorted list of the sentinel values above.
- Type:
list[int]
Create a Track resolver and load (or build) its graph.
- Parameters:
db_manager (DatabaseManager) – Connection manager that knows how to fetch Ensembl and cross-reference tables from a local cache or a live MySQL mirror. The same instance is forwarded to
_graph_maker.GraphMaker.kwargs – Additional keyword arguments forwarded verbatim to
_graph_maker.GraphMaker.get_graph(). Common flags include force_rebuild (recompute the graph from scratch), species (restrict to one taxon), and cache_dir (override on-disk cache location).
- property _calculate_node_scores_helper
Build and cache helper look-ups for node-scoring.
The property constructs two complementary data structures:
- filter_set - the union of
(1) every external-database node-type present in the graph, and (2) every Ensembl-specific node-type (gene, transcript, translation, …) across all assemblies. This set can therefore be passed unmodified to
synonymous_nodes()to ask for “anything that is not an assembly-less backbone gene”.
- ensembl_include - a mapping
{form → set(node_type_str)} where each value lists the node-types that should be considered equivalent to that form (e.g. gene, transcript, translation) when computing richness metrics.
- Returns:
(filter_set, ensembl_include) exactly as described above.
- Return type:
tuple[set[str], dict[str, set[str]]]
- _choose_relevant_synonym_helper(from_id, synonym_ids, to_release, from_release, mode)[source]
Select the most temporally relevant synonym(s) for an Ensembl gene-ID family.
The method evaluates each candidate in synonym_ids against the target release to_release and, when applicable, the source release from_release. Its job is to decide where the path should enter the Ensembl backbone and whether the remainder of the traversal must run in reverse (new → old) order.
Selection strategy
Fixed `from_release` - If the caller already knows the release of the starting node, every candidate is paired with that same release and the correct reverse flag is derived trivially.
Non-backbone start - When the starting node is not an Ensembl-gene backbone ID, the synonym whose active range edge is closest (or farthest, per mode) to to_release is chosen.
Backbone start - If the query is itself an Ensembl-gene, the algorithm first looks for overlapping ranges between the query and each synonym; if none overlap, it falls back to the distance rule described in step 2.
- param from_id:
Identifier from which the path search will start.
- type from_id:
str
- param synonym_ids:
Ensembl IDs considered synonyms of from_id (typically the same gene with different version numbers).
- type synonym_ids:
Sequence[str]
- param to_release:
Target Ensembl release that the overall conversion aims for.
- type to_release:
int
- param from_release:
Release in which from_id is known to be active. If None, the method infers a suitable release for each candidate.
- type from_release:
int | None
- param mode:
Either ‘closest’ or ‘distant’—controls whether the synonym chosen should minimise or maximise its distance to to_release.
- type mode:
str
- returns:
One or more triplets of the form [synonym_id, entry_release, reverse] where:
synonym_id - the chosen synonym,
entry_release - release at which to join the backbone, and
reverse - True if the subsequent history walk must run backwards in time.
- rtype:
list[list[Union[str, int, bool]]]
- raises ValueError:
If no synonym satisfies the distance/overlap criteria or if mode is invalid.
- Parameters:
to_release (int)
from_release (int | None)
mode (str)
- _create_priority_list_ensembl(from_id, to_release)[source]
Build a priority list of assemblies in which from_id is active.
The priorities are the numeric assembly rankings defined in
DB.assembly_mysqlport_priority(smaller numbers mean higher priority).- Parameters:
from_id (str) – Ensembl gene identifier.
to_release (int) – Target Ensembl release; only assemblies that contain this release are considered.
- Returns:
Sorted list of priority values (ascending).
- Return type:
list[int]
- _ensure_assembly_priority_cache()[source]
Ensure per-graph assembly-priority caches exist.
Some test fixtures construct Track via Track.__new__() (bypassing __init__). Keep the conversion code robust by lazily initialising the per-graph assembly-priority mapping.
- Return type:
None
- _final_conversion(converted, cnvt, final_database, ens_release, return_path, return_ensembl_alternative, prevent_assembly_jumps=True, account_for_hyperconnected_nodes=False)[source]
Convert an Ensembl gene node to the requested external database.
Convert an Ensembl gene node to the requested external database and merge the result back into converted.
The routine:
Builds every legal synonym path from cnvt to final_database that is active in ens_release (or in any release as a fallback).
Computes assembly-jump penalties for each path.
Calls
_final_conversion_dict_prepare()to create the conversion sub-dict.Optionally falls back to returning the Ensembl gene itself when no synonym exists and return_ensembl_alternative is True.
- Parameters:
converted (dict) – The current accumulator being built by
convert().cnvt (str) – Ensembl gene identifier that is undergoing final conversion.
final_database (str) – Target external database.
ens_release (int) – Target Ensembl release.
return_path (bool) – If True, embed the path(s) that lead to each synonym.
return_ensembl_alternative (bool) – When no synonym can be found, add a fallback entry that keeps the Ensembl gene.
prevent_assembly_jumps (bool) – If
True, disallow conversion paths that cross between different genome assemblies. Defaults toFalse.account_for_hyperconnected_nodes (bool) – If
True, skip nodes that are marked as hyperconnective (very high connectivity) to prevent search explosion and low-quality paths. Defaults toTrue.
- Returns:
The same converted dict, updated in place (and also returned for convenience).
- Return type:
dict
- Raises:
EmptyConversionMetricsError – Raised when no valid conversion metrics are available and no alternative conversion path can be found.
- static _final_conversion_dict_prepare(confidence, sysns, paths, min_priority_list, len_priority_list, add_ass_jump_list, final_database)[source]
Assemble the final-conversion section that will be attached to a candidate path.
The section contains a global conversion-confidence flag plus one entry per synonym that survived the path-finding stage. When paths is None the structure is identical but omits the ‘the_path’ member to save memory.
- Parameters:
confidence (int | float) – Heuristic confidence for the whole conversion step - 0 for “perfect”, larger values for fallback scenarios, np.inf when no conversion was possible.
sysns (list) – List of synonym identifiers in the same order as the metric lists below.
paths (list[list] | None) – One walk (edge list) per synonym, or None if the caller does not want to expose paths.
min_priority_list (list) – Minimum assembly priority reached by each walk.
len_priority_list (list) – Number of distinct assembly priorities encountered by each walk.
add_ass_jump_list (list) – Additional assembly-jump penalty incurred during the synonym hop itself.
final_database (str) – Name of the database these synonyms belong to (e.g. ‘uniprot’ or DB.nts_ensembl[DB.backbone_form]).
- Returns:
Nested dictionary ready to be stored under the key ‘final_conversion’.
- Return type:
dict
- static _minimum_assembly_jumps_helper(step_pri, current_priority, priorities, assembly_priority=None)[source]
Internal worker for
minimum_assembly_jumps().Given the priority sets for the remaining edges, iterate until all have been consumed while updating the current assembly priority and counting how often it must drop.
- Parameters:
step_pri (list[int]) – Priority values of the edge currently under consideration.
current_priority (int) – Priority value inherited from previous steps.
priorities (list[list[int]]) – Priority lists for the rest of the path, already sorted for correct bisecting.
assembly_priority (list[int] | None) – Optional global priority lattice. If
None, it is computed fromstep_pri,current_priority, andpriorities.
- Returns:
Same three-tuple as documented in
minimum_assembly_jumps().- Return type:
tuple[int, list[int], int]
- _path_score_sorter_all_targets(dict_of_dict, from_id, to_release)[source]
Select the overall best target(s).
Select the overall best target(s) once every candidate Ensembl node has itself been reduced to its single best path.
The method linearises several per-path metrics into an importance order (see the tuple at the top of the function), then:
Computes that ordered score for each pair (ensembl_gene, final_target).
- Finds the global minimum; if multiple pairs tie:
Prefer the target whose identifier is identical to from_id.
- If more than one Ensembl gene still tie, fall back on
calculate_node_scores()to favour the “richer” node.
Returns a pruned copy of dict_of_dict that contains only the surviving Ensembl genes, each with only the winning final_elements entry. Additional provenance is written to filter_scores.
- Parameters:
dict_of_dict (dict) – Nested result of
calculate_score_and_select(). Keys are candidate Ensembl genes; values are dictionaries that already contain one best path per final target.from_id (str) – Original query identifier; used to break ties in favour of “same as input”.
to_release (int) – Target Ensembl release; forwarded to
calculate_node_scores()during tie-breaking.
- Returns:
- A reduced version of dict_of_dict holding only the winner(s) and enriched with a
final_elements[*][‘filter_scores’] sub-dict that records the filters applied.
- Return type:
dict
- Raises:
AssertionError – If node-score tie-breaking results in an empty candidate set.
ValueError – If dict_of_dict is empty.
- static _path_score_sorter_single_target(lst_of_dict)[source]
Select the best score dictionary for one conversion target.
The input is a list of dictionaries produced by
calculate_score_and_select(). Each dictionary is converted into a tuple according to the lexicographic importance order(“assembly_jump”, “external_jump”, “external_step”, “edge_scores_reduced”, “ensembl_step”)
and the dictionary with the smallest tuple is returned.
- Parameters:
lst_of_dict (list[dict]) – Candidate score dictionaries for this target.
- Returns:
The chosen “winner” score dictionary.
- Return type:
dict
- Raises:
ValueError – If the input list is empty.
- _recursive_synonymous(_the_id, synonymous_ones, synonymous_ones_db, filter_node_type, the_path=None, the_path_db=None, depth_max=0, from_release=None, ensembl_backbone_shallow_search=False, account_for_hyperconnected_nodes=True)[source]
Helper method to be used in
_graph.Track.synonymous_nodes().Recursively explore the bio-ID graph to collect synonymous paths starting at _the_id and ending on a node whose type is a member of filter_node_type.
A path is a list of node identifiers (_the_path) together with a parallel list of their node-type strings (_the_path_db). The search is breadth-limited: the depth of a path is defined as the maximum count of any single node-type it contains (e.g. a path with three external nodes has depth 3). Recursion stops when that depth would exceed depth_max.
Additional pruning rules:
The walk never visits the same node twice (no cycles).
- It never traverses two consecutive edges whose source and target
share the same node-type—this prevents “time-travel” within the Ensembl history backbone.
- When ensembl_backbone_shallow_search is True, the search is
restricted to the reverse direction except for node-types listed in
DB.nts_bidirectional_synonymous_search.
On reaching a terminating node the method appends the discovered paths to synonymous_ones and synonymous_ones_db. It does not return anything. Results are accumulated in synonymous_ones and synonymous_ones_db.
- Parameters:
_the_id (str) – Identifier of the starting node (Ensembl or external).
synonymous_ones (list) – Mutable list that will receive each successful identifier path.
synonymous_ones_db (list) – Mutable list that will receive the corresponding node-type paths.
filter_node_type (set[str]) – Allowed node-types for the final node of a path (e.g. {‘ensembl_gene’}).
the_path (list | None) – Current path leading to _the_id; None for the root invocation.
the_path_db (list | None) – Node-type counterpart of the_path; None for the root invocation.
depth_max (int) – Maximum allowed depth as defined above.
from_release (int | None) – If given, only keep terminal nodes that are active in this Ensembl release.
ensembl_backbone_shallow_search (bool) – Activate the shallow, mostly-reverse search mode described above.
account_for_hyperconnected_nodes (bool) – If
True, skip nodes that are marked as hyperconnective (very high connectivity) to prevent search explosion and low-quality paths. Defaults toTrue.
- calculate_node_scores(the_id, ens_release)[source]
Rank competing Ensembl targets by the “richness” of their synonyms.
The method counts, within a radius of two synonym hops, how many unique identifiers of various categories point to each candidate and returns the counts as negative integers so that smaller is better for the up-stream sorter.
- Parameters:
the_id (str) – Identifier that is being converted.
ens_release (int) – Target Ensembl release; only synonyms active in this release are considered.
- Returns:
- [-ext, -form₁, -form₂] where
ext - number of distinct external-database synonyms.
form₁ - number of distinct synonyms of the most important Ensembl form (typically gene).
form₂ - number of distinct synonyms of the second form (typically transcript or translation).
- Return type:
list
- Raises:
ValueError – If the graph does not expose exactly the two expected non-backbone forms, or if a synonym node’s type cannot be mapped to external, form₁, or form₂.
- calculate_score_and_select(all_possible_paths, reduction, remove_na, from_releases, to_release, score_of_the_queried_item, return_path, from_id)[source]
Collapse a set of candidate paths into the single best path per target.
For each path produced by the search engine the function:
1. Computes an edge-score aggregate using reduction while handling missing values as directed by remove_na. 2. Tallies external statistics (steps, jumps, initial conversion confidence) and assembly statistics (number of priority drops, final priority). 3. Packs all metrics into a dictionary and stores it under the key of the path’s final destination node. 4. Keeps only the lexicographically “smallest” dictionary per destination via
_path_score_sorter_single_target().- Parameters:
all_possible_paths (tuple) – Sequence of edge-lists representing every admissible walk returned by the path-finder.
reduction (Callable) – Function such as np.mean or sum used to collapse edge weights into one number.
remove_na (str) – How to treat NaN edge weights - one of ‘omit’, ‘to_1’, ‘to_0’.
from_releases (Iterable[int]) – Release that each path starts from; must align with all_possible_paths.
to_release (int) – Target release - needed to know whether an edge is traversed forward or reverse.
score_of_the_queried_item (float) – Fallback weight for the implicit edge that represents the query ID itself.
return_path (bool) – If True, embed the full edge-list inside each score dict under the key ‘the_path’.
from_id (str) – Original identifier being converted - echoed back in the score dict for traceability.
- Returns:
Mapping {destination_id → best_score_dict}. Each score dict contains (inter alia) assembly_jump, external_jump, external_step, edge_scores_reduced, and ensembl_step.
- Return type:
dict
- Raises:
ValueError – If an unexpected edge encoding is encountered, if an edge score is invalid/∞, or if remove_na is set to an unknown mode.
- choose_relevant_synonym(the_id, depth_max, to_release, filter_node_type, from_release)[source]
Wrapper that discovers, clusters, and ranks synonymous Ensembl candidates for a given identifier.
The function performs three steps:
Discover paths to all Ensembl-gene nodes that share the same biological identity (synonymous_nodes).
Cluster those paths by gene ID (ignoring version).
Rank each cluster with
_choose_relevant_synonym_helper(), selecting the entry release (and direction) that best suits to_release.
- Parameters:
the_id (str) – Source identifier (Ensembl or external).
depth_max (int) – Maximum depth passed to
synonymous_nodes(); governs how far the synonym search is allowed to roam through external nodes.to_release (int) – Target Ensembl release required by the overall conversion.
filter_node_type (set[str]) – Node-types that the synonym search must terminate on (usually {‘ensembl_gene’}).
from_release (int | None) – Known active release of the_id. If None, the helper will infer one.
- Returns:
A list whose elements are
[synonym_id, entry_release, reverse, identifier_path, node_type_path]
where the last two items reproduce the path returned by
synonymous_nodes().- Return type:
list[list[Any]]
Notes
The method purposefully keeps **all* equally-ranked candidates; further tie-breaking is deferred to the main path-scoring routine.*
- convert(from_id, from_release=None, to_release=None, final_database=None, reduction=<function mean>, remove_na='omit', score_of_the_queried_item=nan, go_external=True, prioritize_to_one_filter=False, return_path=False, deprioritize_lrg_genes=True, return_ensembl_alternative=True)[source]
End-to-end ID conversion workflow.
Starting from from_id the routine
Determines the correct time-travel direction if from_release is unspecified.
Enumerates all admissible paths with
get_possible_paths()(forward and/or reverse).Collapses those paths with
calculate_score_and_select().Optionally converts the surviving Ensembl gene(s) into final_database via
_final_conversion().Optionally applies a final global selection with
_path_score_sorter_all_targets().
The output structure mirrors this decision tree and, when return_path is True, embeds the full edge list so that callers can audit every hop.
- Parameters:
from_id (str) – Source identifier (Ensembl, UniProt, RefSeq, …).
from_release (int | None) – Starting Ensembl release. None → infer from the graph.
to_release (int | None) – Target Ensembl release. Defaults to the newest release contained in the graph.
final_database (str | None) – External database to convert into. None → stay on the Ensembl gene.
reduction (Callable) – Function (e.g. numpy.mean) used to collapse per-edge weights. Must accept an iterable of floats and return a float.
remove_na (str) – Strategy for NaN edge weights - ‘omit’, ‘to_1’, or ‘to_0’.
score_of_the_queried_item (float) – Weight assigned to the implicit edge that represents from_id itself.
go_external (bool) – Allow jumps through external databases when the backbone is disconnected.
prioritize_to_one_filter (bool) – After all scoring, keep only the single globally best target.
return_path (bool) – Embed the full edge list(s) in the returned dictionary.
deprioritize_lrg_genes (bool) – If True and other results exist, drop LRG_* genomic regions from the final set.
return_ensembl_alternative (bool) – When converting to an external database, also return the Ensembl gene as a fallback.
- Returns:
dict - Structured result as described above.
None - No admissible path was found.
- Return type:
dict | None
- Raises:
ValueError – For non-callable reduction, unsupported remove_na modes, unknown final_database values, or logical inconsistencies detected during processing.
- convert_optimized_multiple()[source]
Placeholder for a batch-optimised converter.
The intended behaviour is to accept multiple query IDs and choose a conversion target for each such that cross-sample clashes (e.g. duplicate loci) are minimised.
Note
This method is a placeholder for future implementation. Use
idtrack._api.API.convert_identifier_multiple()for batch conversions until this optimised version is available.- Raises:
NotImplementedError – Always - the optimisation strategy is not yet implemented.
- edge_key_orientor(n1, n2, n3)[source]
Return the stored orientation of a multigraph edge.
For multigraphs every logical edge is stored once, but the caller may hold (u, v, k) or (v, u, k). This helper resolves the ambiguity so that subsequent attribute look-ups succeed.
- Parameters:
n1 (str) – One endpoint of the edge.
n2 (str) – The other endpoint.
n3 (int) – Edge key (index) within the networkx multi-edge.
- Returns:
A triple that is guaranteed to exist as written in self.graph.
- Return type:
tuple[str, str, int]
- Raises:
AssertionError – If neither orientation is present in the graph.
- static get_from_release_and_reverse_vars(lor, p, mode)[source]
Derive a list of (release, reverse) tuples.
Derive a list of (release, reverse) tuples that indicate which Ensembl release to start the graph walk from and whether that walk should move backwards in time.
Given a collection of active-range intervals lor and a pivot release p, the algorithm selects one or two release points per interval depending on mode:
- ‘closest’ - choose the release nearest to p within or at
the ends of the interval.
- ‘distant’ - choose the release farthest from p within the
interval.
The boolean in each tuple is True when the walk should start after the selected release and move backwards (i.e. “reverse mode”), and False when it should move forwards.
- Parameters:
lor (list) – List of inclusive (first_release, last_release) intervals in ascending order.
p (int) – Pivot release around which “closest” or “distant” is evaluated.
mode (str) – Either ‘closest’ or ‘distant’.
- Returns:
Release / reverse-flag pairs, ordered in the sequence they should be tried by the path-finder.
- Return type:
list[tuple[int, bool]]
- Raises:
ValueError – If an interval in lor is malformed, mode is not recognised, or internal consistency checks fail.
- get_next_edges(from_id, from_release, reverse, debugging=False)[source]
Enumerate chronologically admissible history edges from a node.
Starting at from_id and release from_release, the method scans outgoing (or incoming, when reverse is True) edges whose timestamps allow the path to advance in the desired temporal direction. It collapses duplicate “same-ID” transitions and flags self-loops so that later heuristics can treat branch points and tips differently.
- Parameters:
from_id (str) – Current node from which the search will step.
from_release (int) – Release at which the current node is known to exist.
reverse (bool) – False to walk forward in history (old → new), True to walk backward (new → old).
debugging (bool) – If set, disables the duplicate-edge collapse so that unit tests can inspect the raw edge set.
- Returns:
Sorted list of edge descriptors, each of which is
[edge_release, is_self_loop, src_node, dst_node, multiedge_key].
- Return type:
list[list[Union[int, bool, str, int]]]
- Raises:
ValueError – If inconsistent multi-edges (same nodes, same release) are detected—this signals a corrupted graph build.
- get_possible_paths(from_id, from_release, to_release, reverse, go_external=True, increase_depth_until=2, increase_jump_until=0, from_release_inferred=False)[source]
Run
path_search()under progressively relaxed settings.Run
path_search()under progressively relaxed settings until at least one viable path is found—or every relaxation level is exhausted.Four search stages are attempted in order:
Backbone-only - external jumps disabled.
- External enabled - allow external jumps; increment synonym depth
and jump limit after each failure up to increase_depth_until/increase_jump_until.
- Backbone with multiple-Ensembl transition - external disabled but
permit starting release to shift on external nodes.
- External + multiple-transition - most permissive search, with
iterative depth/jump relaxation as in stage 2.
- Parameters:
from_id (str) – Identifier to convert.
from_release (int) – Release at which the search begins.
to_release (int) – Desired target release.
reverse (bool) – Traverse the Ensembl history backwards if True, forwards otherwise.
go_external (bool) – If False, skip any stage that requires external jumps.
increase_depth_until (int) – Additional synonym-search depth to allow beyond the default.
increase_jump_until (int) – Additional external-jump count to allow beyond the default.
from_release_inferred (bool) – Reserved for future use. Indicates that from_release was chosen automatically rather than provided by the user.
- Returns:
All paths discovered by the most restrictive stage that yielded at least one result, returned as an immutable tuple.
- Return type:
tuple[tuple[tuple[str, str, int]]]
Notes
The function copies and mutates
DB.external_search_settingsinternally; the caller’s copy is not modified.
- identify_source(dataset_ids, mode)[source]
Infer the most likely origin (assembly and/or Ensembl release) of a heterogeneous identifier list.
The function tallies how often each origin triple appears among dataset_ids and returns the counts sorted in descending order.
- Parameters:
dataset_ids (list[str]) – Collection of identifiers to analyse.
mode (str) – Granularity of the origin to extract - one of - ‘complete’ → (assembly, db, release) - ‘ensembl_release’ → release only - ‘assembly’ → assembly only - ‘assembly_ensembl_release’ → (assembly, release)
- Returns:
Pairs (origin, count) sorted by frequency.
- Return type:
list[tuple[Any, int]]
- Raises:
ValueError – If mode is not one of the recognised values.
- minimum_assembly_jumps(the_path, step_pri=None, current_priority=None)[source]
Compute the penalty incurred by assembly downgrades along a path.
Each path step may be annotated with one or more candidate assemblies. These are translated into priority values via the organism-scoped configuration
DB.assembly_mysqlport_priority. The algorithm walks the path, tracking the current priority and counting how many times it must drop to a lower priority value—each drop constitutes an “assembly jump” penalty.- Parameters:
the_path (Iterable[tuple]) – Sequence of edge descriptors; each element is either (n1, n2, k) or (n1, n2, k, release).
step_pri (list[int] | None, optional) – Priority list for the first edge. If None, it is derived from the_path.
current_priority (int | None, optional) – Starting priority. If None, initialised to max(step_pri).
- Returns:
assembly_jump - total number of priority drops.
step_pri - priority list of the last processed edge.
current_priority - priority value after the final edge.
- Return type:
tuple[int, list[int], int]
- path_search(from_id, from_release, to_release, reverse, external_settings, external_jump=None, multiple_ensembl_transition=False)[source]
Enumerate every admissible history path from from_id at from_release to to_release.
The algorithm performs a depth-first traversal of the Ensembl history edges. Whenever it becomes stranded on a non-backbone node it may “beam-up” via a synonym path through an external database, subject to the constraints in external_settings:
synonymous_max_depth - maximum depth of a synonym search.
jump_limit - maximum number of external “beam-up” jumps allowed.
nts_backbone - canonical node-type of the Ensembl backbone.
Additional flags control the initial conditions:
- Setting external_jump to np.inf disables external jumps.
Setting it to None enables them with the counter reset to 0.
- multiple_ensembl_transition allows the algorithm to time-travel to
a different release while still on an external node; this is useful when from_release was inferred and might not actually connect.
- Parameters:
from_id (str) – Identifier to start the search from.
from_release (int) – Release number where from_id is considered active.
to_release (int) – Target Ensembl release.
reverse (bool) – If True, traverse the graph backwards in time; otherwise forwards.
external_settings (dict) – Copy of
DB.external_search_settingsthat governs depth, jump limits, and backbone node-type.external_jump (float | None) – Current external-jump count (None starts from zero, np.inf forbids any jump).
multiple_ensembl_transition (bool) – Permit the synonym engine to select a different release for an external node when no path exists at from_release.
- Returns:
A set of edge-lists. Each edge is stored as (src, dst, key); an empty walk that terminates immediately is represented by ((None, from_id, None),).
- Return type:
set[tuple[tuple[str, str, int]]]
Notes
The method is intentionally side-effect free; it constructs all intermediate data on the stack and returns a fresh set.
- path_step_possible_assembly_jumps(n1, n2, n3, n4=None)[source]
Return the genome-assembly codes that can legally be used for a single edge.
The helper inspects the edge that connects n1 → n2 and filters the assemblies recorded on that edge against the release constraint n4:
- None - the edge is treated as backbone history; the result is the
graph-wide default assembly (usually the build on which the backbone was constructed).
int - keep only assemblies whose release set contains that single release.
set[int] - keep assemblies whose release set intersects the provided set.
- Parameters:
n1 (str) – Source node identifier.
n2 (str) – Destination node identifier.
n3 (int) – Edge key within the NetworkX multigraph.
n4 (int | set[int] | None, optional) – Release filter as described above.
- Returns:
Sorted list of assembly codes (species-specific integers; e.g.
[37, 38]for human).- Return type:
list[int]
- Raises:
ValueError – If n4 is of an unsupported type.
- should_graph_reversed(from_id, to_release)[source]
Determine the temporal orientation of the graph walk.
Given an identifier that is active in one or more release intervals, the routine decides whether the subsequent path-finder must move forward in time, backward in time, or explore both directions in parallel in order to reach the target release.
The decision is based on the closest boundary of every active interval returned by
Track.get_from_release_and_reverse_vars()(mode=’closest’).- Parameters:
from_id (str) – The starting identifier (Ensembl gene, transcript, protein, or external ID).
to_release (int) – The Ensembl release the user wishes to convert to.
- Returns:
- ‘forward’ - walk old → new, starting at the earliest
release in which from_id is active → return (‘forward’, start_release)
- ’reverse’ - walk new → old, starting at the latest
active release → return (‘reverse’, start_release)
- ’both’ - split search: one forward walk and one reverse
walk → return (‘both’, (forward_start, reverse_start))
- Return type:
tuple[str, Union[int, tuple[int, int]]]
- Raises:
ValueError – If from_id is never active in or around to_release (i.e. no viable starting release can be found).
- synonymous_nodes(the_id, depth_max, filter_node_type, from_release=None, ensembl_backbone_shallow_search=False, account_for_hyperconnected_nodes=True)[source]
Public wrapper around
_recursive_synonymous().The method returns all minimal-length synonym paths emanating from the_id.
The function first runs a default depth search determined by DB.external_search_settings[‘synonymous_max_depth’]. If no synonym is found and depth_max is greater than that default, a second, deeper search is attempted.
For every distinct target node the shortest path is kept; longer paths to the same target are discarded.
- Parameters:
the_id (str) – Source identifier.
depth_max (int) – Maximum search depth to try if the default search fails.
filter_node_type (set[str]) – Node-types that are acceptable for the target node(s). Must not include the generic ‘external’ type—specify the concrete external DB instead.
from_release (int | None) – Constrain targets to those active in this Ensembl release.
ensembl_backbone_shallow_search (bool) – If True, restricts the graph traversal as explained in
_recursive_synonymous().account_for_hyperconnected_nodes (bool) – If
True, skip nodes that are marked as hyperconnective (very high connectivity) to prevent search explosion and low-quality paths. Defaults toTrue.
- Returns:
- A list whose elements are [identifier_path, node_type_path] pairs,
each representing the minimal route to one synonymous node.
- Return type:
list[list[list[str]]]
- Raises:
ValueError – If filter_node_type improperly contains the generic external type, or if depth_max is incompatible with ensembl_backbone_shallow_search.
- class TrackTests(*args, **kwargs)[source]
Bases:
Track,ABCDeveloper-facing integrity-test harness for
Track.This module defines
TrackTests, a mix-in that adds an extensive white-box test suite to a populatedidtrack.Trackinstance. The class is for developers only; it should never be used in production pipelines. Every public method beginning withis_returns a boolean that tells whether a specific invariant holds. Methods beginning withhistory_execute heavier, end-to-end conversions and collect rich statistics. The class is intended to be mixin-ed into a concrete Track subclass—or instantiated standalone—after the underlying graph and lookup tables have been fully built. It performs purely read-only operations and therefore imposes no risk of mutating state.All test methods share the following contract:
They never raise on failure—return-value only—so they can be run in bulk without interrupting your session.
A return value of
Truemeans the invariant holds;Falsemeans a violation was detected.Where useful, a verbose flag gives a tqdm progress bar so long- running checks remain user-friendly.
Typical use:
tests = TrackTests(...) tests.is_id_functions_consistent_ensembl() # Raises if inconsistent.
Note
The class is not designed for production; instantiate it only in test suites or interactive debugging sessions.
Initialize the test harness.
All positional and keyword arguments are forwarded verbatim to
__init__. Besides constructing the underlying graph, the initializer sets up a dedicatedlogging.Loggernamed"track_tests"so individual test routines can emit structured diagnostics without polluting the main application log.- Parameters:
args – Positional arguments accepted by
__init__.kwargs – Keyword arguments accepted by
__init__.
- _format_history_travel_testing_report(res, include_header=False, line_separation_at_end=True)[source]
Format a complete history travel testing report with metrics.
- Parameters:
res (dict[str, Any]) – Results dictionary from history_travel_testing containing conversion metrics and parameters.
include_header (bool) – If
True, include the header section with source/target information. Defaults toFalse.line_separation_at_end (bool) – If
True, append a blank line separator at the end of the report. Defaults toTrue.
- Returns:
Lines of formatted text for the complete report.
- Return type:
list[str]
- _format_history_travel_testing_report_header(p)[source]
Format the header section for a history travel testing report.
- Parameters:
p (dict[str, Any]) – Parameters dictionary containing
from_database,from_assembly,from_release,to_database, andto_release.- Returns:
Lines of formatted text for the report header.
- Return type:
list[str]
- history_travel_testing(from_release, from_assembly, from_database, to_release, to_database, go_external, prioritize_to_one_filter, convert_using_release, from_fraction=1.0, verbose=True, verbose_detailed=False, return_ensembl_alternative=False)[source]
Run an end-to-end Ensembl-history conversion and collect granular QA metrics.
The routine samples identifiers from from_database/from_release (optionally down-sampling via from_fraction) and converts each one to to_database/to_release using
idtrack.Track.convert(). It is intentionally non-fatal: every failure mode is caught, logged and tallied so that large regression suites can run unattended. All results are returned in a single nested metrics dictionary whose structure mirrors the printable report produced byformat_history_travel_testing_report().The statistics fall into four conceptual groups; each counter not only records an absolute event count but also serves as a red-flag indicator for specific classes of mapping pathology. Use the guidelines below to interpret the numbers and decide whether a run is healthy, questionable, or action-required.
Failure / anomaly counters
history_voyage_failed_gracefully- the converter raisedEmptyConversionMetricsError.history_voyage_failed_unknown- any other unexpected exception.query_not_in_the_graph- source ID absent from the graph.lost_item- traversal finished but produced no final IDs.lost_item_but_the_same_id_exists- special case of lost_item when the target and source DB are both Ensembl-gene and the target ID still exists in the graph.found_ids_not_accurate- at least one returned target ID is not part of the authoritative ids_to reference set.
Mapping quality
one_to_one_ids- queries that resolved to exactly one target ID.one_to_multiple_ids- queries with > 1 admissible targets.one_to_multiple_final_conversion- subset of the above where exactly one traversal path was found (heuristics eliminated alternatives).
Collision analysis
The list
clashing_id_type == [clash_one_one, clash_multi_multi, clash_multi_one]classifies target IDs that were reached by more than one query:clash_one_one- every colliding query was 1→1.clash_multi_multi- every colliding query was 1→many.clash_multi_one- mixture of 1→1 and 1→many queries (most alarming category).
Timings & book-keeping
time- wall-clock runtime in seconds.conversion- per-query mapping result for all successful traversals.converted_item_dict/converted_item_dict_reversed- raw per-ID caches used to derive the higher-level counters above.parameters- echo of the function arguments.ids- the sampled from and reference to ID sets.
- Parameters:
from_release (int) – Ensembl release number of the source IDs.
from_assembly (int) – Genome assembly code of the source IDs.
from_database (str) – Node-type / database of the source IDs.
to_release (int) – Ensembl release number of the target IDs.
to_database (str) – Node-type / database to convert into.
go_external (bool) – Permit temporary detours through external IDs when native Ensembl history edges break.
prioritize_to_one_filter (bool) – Prefer 1→1 mappings over 1→many when multiple paths exist.
convert_using_release (bool) – Pass from_release straight into
idtrack.Track.convert()instead of letting it infer the starting point.from_fraction (float) – Fraction (0 < x ≤ 1) of the ids_from population to sample; speeds up smoke tests.
verbose (bool) – Show tqdm progress bar (coarse).
verbose_detailed (bool) – Embed live metric counters in the tqdm postfix.
return_ensembl_alternative (bool) – Forwarded to
idtrack.Track.convert().
- Raises:
ValueError – If either database argument refers to an Ensembl node-type (must use backbone helpers instead) or if from_fraction is outside the open interval (0, 1].
- Returns:
- Nested metrics dictionary with the layout described above.
Use
format_history_travel_testing_report()for a human-readable summary.
- Return type:
dict
Notes
All counters are absolute counts - divide by
len(metrics['ids']['from'])to obtain rates.The collision analysis is inspired by the clash statistics logic implemented at the end of the function and helps spot discrepant “unique” IDs that suddenly become ambiguous. Keeping all three clash counters at zero is the gold standard for a healthy build.
- history_travel_testing_random(from_fraction, include_ensembl_source=True, include_external_source=True, include_ensembl_destination=True, include_external_destination=True, verbose=True, verbose_detailed=False, strict_forward=False, convert_using_release=False, prioritize_to_one_filter=True, return_result=False)[source]
Convenience wrapper around
history_travel_testing().The routine generates a random but internally consistent test case via
history_travel_testing_random_arguments_generator(), logs the chosen parameters (unless verbose is False) and delegates the heavy lifting tohistory_travel_testing().- Parameters:
from_fraction (float) – Fraction of IDs to sample from the source set.
strict_forward (bool) – Forwarded to the argument generator.
convert_using_release (bool) – Forwarded to
history_travel_testing().prioritize_to_one_filter (bool) – Forwarded to
history_travel_testing().verbose (bool) – Show coarse progress information.
verbose_detailed (bool) – Include extended per-ID counters in the progress bar.
return_result (bool) – If
True, return the metrics dictionary.include_ensembl_source (bool) – Include Ensembl databases as valid sources.
include_external_source (bool) – Include external databases as valid sources.
include_ensembl_destination (bool) – Include Ensembl databases as valid destinations.
include_external_destination (bool) – Include external databases as valid destinations.
- Returns:
The metrics dictionary returned by
history_travel_testing().- Return type:
dict
- history_travel_testing_random_arguments_generator(strict_forward, include_exclude_list)[source]
Generate a plausible random parameter set for
history_travel_testing().The helper picks compatible source/target assemblies, releases and databases so the subsequent conversion test has a realistic chance to succeed. When strict_forward is True the target release is guaranteed to be ≥ the source release (no time-travel back).
- Parameters:
strict_forward (bool) – Enforce a non-decreasing release direction.
include_exclude_list (list[bool]) – A 4-element list of booleans controlling inclusion of
[include_ensembl_source, include_external_source, include_ensembl_destination, include_external_destination].
- Returns:
Keys
from_assembly,from_release,to_release,from_database,to_databaseready to be splatted intohistory_travel_testing().- Return type:
dict
- how_many_corresponding_path_ensembl(from_release, from_assembly, to_release, go_external, verbose=True)[source]
Count history paths between two Ensembl releases.
The method iterates over all Ensembl-gene stable IDs that exist in from_release/from_assembly. For every ID that is present in the graph it calls
idtrack.Track.get_possible_paths()and records how many distinct paths the searcher finds to to_release.The routine is non-destructive; it merely provides a quick way to gauge the density of the history graph or to spot releases where path-finding was unexpectedly difficult.
- Parameters:
from_release (int) – Source Ensembl release number.
from_assembly (int) – Source genome assembly code.
to_release (int) – Target Ensembl release number.
go_external (bool) – If True history paths are allowed to temporarily leave the Ensembl lineage via external databases.
verbose (bool) – Show a tqdm progress bar (default True).
- Returns:
A list of two-element sub-lists
[stable_id, n_paths]wheren_pathsisan int ≥ 0 when the ID was in the graph, or
None when the source ID was absent.
- Return type:
list[list[Union[str, int, None]]]
- is_base_is_range_correct(verbose=True)[source]
Verify consistency of base-gene active-range calculations.
Each “base Ensembl gene” node (
node_type == 'base_ensembl_gene') has an active release range—the list of Ensembl releases during which descendants of the gene were present. There are two independent ways to obtain this information:High-level helper
graph.get_active_ranges_of_base_id_alternative- a cached convenience wrapper.Low-level reconstruction by aggregating the combined_edges table and converting the set of releases into compact
[start, end]slices viagraph.list_to_ranges.
This test iterates through all base-gene nodes and asserts that the two methods deliver byte-identical results.
- Parameters:
verbose (bool) – If True (default) show a tqdm progress bar that updates with the current node under inspection.
- Returns:
Trueif every base-gene has matching ranges;Falseas soon as a single mismatch is encountered.- Return type:
bool
- is_combined_edges_dicts_overlapping_and_complete()[source]
Check edge-dictionary partitioning invariants.
The Track graph materialises three edge caches—
combined_edgesand its two specialised siblings—each storing adjacency and release metadata for a different subset of nodes:combined_edges- all nodes, including backbone genes.combined_edges_genes- stable Ensembl genes (non-assembly-specific).combined_edges_assembly_specific_genes- genes that exist only on a single assembly.
The design contract says:
Disjointness - No node key may appear in more than one dictionary.
Completeness - The union of the dictionaries must cover all graph nodes except those that represent alternative database versions (e.g. “EnsemblMetazoa”) which are intentionally kept separate.
This routine enforces both rules.
- Returns:
Trueif the dictionaries are pair-wise disjoint and collectively cover every eligible node;Falseotherwise.- Return type:
bool
- is_edge_with_same_nts_only_at_backbone_nodes()[source]
Assert same-node-type edges exist only between backbone genes.
The graph is a multilayer network where nodes of different
This method traverses every base-gene node and checks that condition.
- Returns:
Truewhen no overlaps are found;Falseotherwise. Each offending base ID triggers a warning with the conflicting ranges.- Return type:
bool
- is_final_external_conversion_robust(convert_using_release=False, database=None, ens_rel=None, verbose=True, from_fraction=1.0, prioritize_to_one_filter=False)[source]
Validate Ensembl→external conversion against MySQL ground truth.
A random external database is chosen for every genome assembly. For the selected combination the method grabs the authoritative mapping table (graph-ID → external ID set) from MySQL and converts the same graph-IDs with
idtrack.Track.convert().- Parameters:
convert_using_release (bool) – Whether to pin the from_release when calling the converter. Keeping this True usually speeds up the search and mimics user-facing behaviour.
verbose (bool) – Print the current assembly/database/release being tested.
prioritize_to_one_filter (bool) – If
True, apply tie-breaking to select a single best target when multiple candidates exist.ens_rel (int | None) – Specific Ensembl release to test. If
None, a random release is chosen.from_fraction (float) – Fraction of identifiers to sample for testing (0.0-1.0). Defaults to 1.0 (all identifiers).
database (str | None) – Specific external database to test. If
None, a random database is chosen.
- Returns:
- True if every converted set equals the MySQL reference,
False upon the first deviation.
- Return type:
bool
- Raises:
ValueError – Raised when test parameters are invalid or incompatible.
- is_id_functions_consistent_ensembl(verbose=True)[source]
Ensure Ensembl ID-list helpers agree with the SQL back-end.
For every release listed in
graph.graph["confident_for_release"]the test compares two independent sources of Ensembl-gene IDs for the current genome assembly:IDs retrieved directly from MySQL via
DatabaseManager.IDs returned by
idtrack.Track.get_id_list()from the graph.
A mismatch means either the graph was built incompletely or the helper functions drift out of sync with the database schema.
- Parameters:
verbose (bool) – If
True(default) show a tqdm progress-bar while iterating through the releases.- Returns:
Truewhen all releases produce identical sets; elseFalse(a descriptive warning is logged).- Return type:
bool
- is_id_functions_consistent_ensembl_2(verbose=True)[source]
Cross-check Ensembl ID range helpers against raw edge data.
For every backbone Ensembl-gene node the routine computes the list of active release ranges in two distinct ways:
Raw computation - by flattening
graph.combined_edges_genesand compacting the releases withidtrack.Track.list_to_ranges().Cached lookup - via the lazily built dictionary
graph.get_active_ranges_of_id.
The two lists must match exactly. A divergence would indicate that the cached helper is out of sync with the authoritative edge structure.
- Parameters:
verbose (bool) – If
True(default) wrap the iteration in a tqdm bar.- Returns:
Trueif all gene nodes pass;Falseafter the first failure (a warning is emitted).- Return type:
bool
- is_id_functions_consistent_external(verbose=True)[source]
Check external-ID list helpers against the raw MySQL tables.
For every combination of assembly, Ensembl release (limited to
graph.graph["confident_for_release"]) and external database this test performs the following steps:Query the authoritative list of external IDs directly from the MySQL snapshot via
DatabaseManager.Ask the in-memory graph for the same list via
idtrack.Track.get_id_list().Normalise node names through
idtrack.Track.node_name_alternatives()to cope with the occasional “_1” suffix.Compare the two sets. A mismatch is logged and the method returns
Falseimmediately.
The exhaustive traversal is expensive (minutes for large genomes) but ensures the graph`s indexing helpers never drift from the actual database content.
- Parameters:
verbose (bool) – If True (default) display a tqdm progress bar and emit log messages at INFO level. When False the method runs silently.
- Returns:
True when every single comparison matched, False as soon as an inconsistency is encountered.
- Return type:
bool
- is_node_consistency_robust(verbose=True)[source]
Check for illegal neighbour relationships and multi-edges.
The graph may contain exactly one edge between nodes of different node-types. Nodes of the same node-type are only allowed when that type is the Ensembl backbone (
ensembl_gene). Any deviation - a lateral same-type connection or >1 multi-edge - is logged and aborts the test.- Parameters:
verbose (bool) – Print offending nodes when a violation is detected.
- Returns:
True when the graph satisfies the topology rules, False otherwise.
- Return type:
bool
- is_range_functions_robust(verbose=True)[source]
Detect overlapping release ranges among sibling Ensembl IDs.
A base Ensembl-gene ID is the stable identifier that groups multiple versioned Ensembl-gene records (siblings). The gene-history model requires that the release ranges of sibling IDs never overlap - each release must be covered by exactly one child stable ID.
This method traverses every base-gene node and checks that condition.
- Parameters:
verbose (bool) – If
True(default) display a tqdm progress-bar.- Returns:
Truewhen no overlaps are found;Falseotherwise. Each offending base ID triggers a warning with the conflicting ranges.- Return type:
bool
- random_dataset_source_generator(assembly, include_external, include_ensembl, for_final_database, only_backbone_tests, release_lower_limit=None, form=None)[source]
Pick a random (<database>, <assembly>, <release>) tuple.
The function guarantees that the triple actually exists in the graph and - if release_lower_limit is provided - honours the minimum release constraint.
- Parameters:
assembly (int) – Genome assembly code used in Ensembl core schema names (e.g.
38= human GRCh38,39= mouse GRCm39,111= pig Sscrofa11.1).include_ensembl (bool) – Whether Ensembl backbone databases may be returned as database.
release_lower_limit (int | None) – Smallest permissible Ensembl release number for the returned triple. None disables the filter.
form (str | None) – Restrict the draw to a particular connection form (protein/coding/gene). None means no restriction.
only_backbone_tests (bool) – If
True, restrict selection to backbone-only databases (skip external databases entirely).include_external (bool) – Whether external (non-Ensembl) databases may be included in the selection pool.
for_final_database (bool) – If
True, exclude Ensembl assembly-specific databases from selection (useful when selecting final conversion targets).
- Returns:
(<database>, <assembly>, <release>)or None when no matching release exists.- Return type:
tuple | None
- Raises:
ValueError – Raised when no valid database/release combination can be found.
- class DB[source]
Bases:
objectStore constants shared across IDTrack modules for Ensembl data access and graph construction.
This class centralizes every constant that multiple components (e.g.
idtrack.graph.GraphMaker,idtrack.pathfinder.PathFinder) rely on when talking to the Ensembl FTP mirror, REST API, and public MySQL instances. Housing the values in one immutable namespace prevents circular imports, ensures a single source of truth, and simplifies testing. DB is never instantiated; import the class and reference its attributes directly.- id_ver_delimiter
Character separating a stable identifier from its version suffix.
- Type:
str
- first_version
Default version assumed when an ID lacks an explicit version component.
- Type:
int
- connection_timeout
TCP connect timeout in seconds used by both FTP and REST clients.
- Type:
int
- reading_timeout
Socket read timeout in seconds applied to FTP and REST operations.
- Type:
int
- ensembl_ftp_base
Hostname of the Ensembl public FTP mirror.
- Type:
str
- rest_server_api
Root URL of the Ensembl REST API.
- Type:
str
- rest_server_ext
Resource path appended to
rest_server_apito query species metadata.- Type:
str
- mysql_host
Hostname of the Ensembl public MySQL server.
- Type:
str
- myqsl_user
Username for anonymous MySQL access.
- Type:
str
- mysql_togo
Placeholder string kept for backward compatibility when assembling connection URLs.
- Type:
str
- assembly_mysqlport_priority
Organism-aware mapping
{organism -> {assembly -> {...}}}defining: - Ports: ordered list of MySQL ports to try for that assembly - Priority: assembly priority within the organism (1 = newest / preferred)- Type:
dict[str, dict[int, dict[str, Any]]]
- mysql_port_min_release
Minimum Ensembl release supported by each public MySQL port.
- Type:
dict[int, int]
- all_assemblies
Union of every configured assembly code across supported organisms.
- Type:
set[int]
- main_assembly
Backward-compatibility default assembly (human GRCh38 = 38).
- Type:
int
- synonym_id_nodes_prefix
Prefix inserted before node identifiers that represent synonym edges.
- Type:
str
- no_old_node_id
Sentinel used when a historical ID is retired.
- Type:
str
- no_new_node_id
Sentinel used when no future successor exists.
- Type:
str
- alternative_versions
Two sentinels—
no_new_node_idandno_old_node_id.- Type:
set[str]
- hyperconnecting_threshold
Maximum allowable out-degree before a node is considered hyper-connected and ignored by breadth-first expansions.
- Type:
int
- node_type_str
Edge/Node attribute key holding the node type value.
- Type:
str
- nts_external
Canonical node type assigned to entities originating outside Ensembl.
- Type:
str
- forms_in_order
Stable ordering of Ensembl entity forms (
gene,transcript,translation). Order matters when inferring parent/child relationships.- Type:
list[str]
- backbone_form
Form selected as backbone for graph traversals (always
gene).- Type:
str
- nts_ensembl
Map each canonical form to its namespaced node type (
gene→ensembl_gene, etc.).- Type:
dict[str, str]
- nts_ensembl_reverse
Reverse mapping of
nts_ensembl.- Type:
dict[str, str]
- nts_assembly
Form-to-assembly-specific node type map.
- Type:
dict[str, dict[str, str]]
- nts_assembly_reverse
Reverse mapping of
nts_assembly.- Type:
dict[str, dict[str, str]]
- nts_base_ensembl
Reduced node type names stripped of assembly suffixes.
- Type:
dict[str, str]
- nts_base_ensembl_reverse
Reverse mapping of
nts_base_ensembl.- Type:
dict[str, str]
- nts_bidirectional_synonymous_search
Node types for which synonym searches are performed bidirectionally.
- Type:
set[str]
- nts_assembly_gene
Every node type that represents a gene, regardless of assembly.
- Type:
set[str]
- connection_dict
Edge attribute key whose value stores connection metadata dictionaries.
- Type:
str
- conn_dict_str_ensembl_base
Constant placed under
connection_dictwhen the edge points to an Ensembl data source.- Type:
str
- external_search_settings
Default limits for outward traversal into external databases. Keys are
jump_limit,synonymous_max_depth, andnts_backbone.- Type:
dict[str, Any]
- placeholder_na
Sentinel stored in HDF5 datasets where a true NA/None value is not permitted or would break downstream type expectations.
- Type:
str
- UTF8
The literal string
"utf-8"—a canonical spelling of the UTF-8 encoding name used when writing variable-length strings to HDF5 files.- Type:
str
- UTF8_STR
Pre-configured variable-length UTF-8 string dtype created via
h5py.string_dtype(). Pass this value when creating HDF5 datasets that should hold arbitrary Unicode text to avoid hard-coding datatypes throughout the codebase.- Type:
h5py.Datatype