Part 7 — Advanced Topics
Last updated: 2026-01-09
This notebook collects advanced, production-oriented topics for power users:
custom external database profiles (strict vs broad)
programmatic YAML management (useful in pipelines)
troubleshooting & diagnostics (common failure modes)
restricted-server SSH bridging (ConnectionBridge)
integration patterns (Snakemake/Nextflow-friendly workflows)
Tip: You don’t need this notebook for day-to-day conversions. It’s here for when you want reproducibility at scale.
7.1 — Custom External Database Inclusion
The external YAML is an explicit contract: it defines which external namespaces are allowed to influence your graph.
Two common profiles:
Strict profile (recommended for most analyses): small allowlist, lower ambiguity, faster queries.
Broad profile (exploration): larger allowlist, more coverage, but higher ambiguity and slower builds.
In IDTrack today, the active YAML file name is fixed (<organism>_externals_modified.yml). A practical pattern is to keep multiple profiles as side-by-side files and copy/rename the one you want for a given run.
The next cell demonstrates how to generate two profile files without overwriting your active YAML.
1
from __future__ import annotations
import os
from copy import deepcopy
from pathlib import Path
import yaml
try:
import idtrack
IDTRACK_OK = True
except Exception as e:
print('idtrack import failed ->', repr(e))
IDTRACK_OK = False
LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)
print('Local repository:', LOCAL_REPOSITORY)
if IDTRACK_OK:
organism = 'homo_sapiens'
# Prefer your configured YAML if it exists; otherwise fall back to the package default.
configured = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'
default_cfg = Path(idtrack.__file__).resolve().parent / 'default_config' / f'{organism}_externals_modified.yml'
source_path = configured if configured.exists() else default_cfg
print('Reading YAML from:', source_path)
y = yaml.safe_load(source_path.read_text(encoding='utf-8'))
form = list(y[organism].keys())[0]
strict_allowlist = {
'HGNC Symbol',
'EntrezGene',
'UniProtKB',
'RefSeq_mRNA',
}
broad_allowlist = strict_allowlist | {
# Add cautiously; broad profiles can increase ambiguity.
'RefSeq_peptide',
'ArrayExpress',
}
def make_profile(base: dict, allowlist: set[str]) -> dict:
out = deepcopy(base)
for db_name, db_block in out[organism][form].items():
include = db_name in allowlist
for _asm, attrs in db_block.get('Assembly', {}).items():
attrs['Include'] = bool(include)
return out
strict_yaml = make_profile(y, strict_allowlist)
broad_yaml = make_profile(y, broad_allowlist)
strict_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_strict.yml'
broad_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_broad.yml'
strict_path.write_text(yaml.safe_dump(strict_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')
broad_path.write_text(yaml.safe_dump(broad_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')
print('Wrote strict profile:', strict_path.name)
print('Wrote broad profile:', broad_path.name)
print()
print('To activate a profile: copy/rename it to:')
print(' ', configured)
else:
print('Skipping YAML profile demo (idtrack not imported).')
Local repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache
Reading YAML from: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml
Wrote strict profile: homo_sapiens_externals_modified_strict.yml
Wrote broad profile: homo_sapiens_externals_modified_broad.yml
To activate a profile: copy/rename it to:
/ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml
7.2 — Programmatic YAML Management
For pipelines you often want YAML changes to be scriptable and repeatable.
A safe automation pattern:
Generate (or refresh) a template YAML.
Apply a curated allowlist in code.
Write the resulting YAML to a file you commit alongside your pipeline.
The next cell shows a reusable helper that:
reads a template YAML
applies an allowlist
writes a modified YAML
Tip: Keep allowlists per organism in your pipeline repository (as Python sets or a small TOML/YAML). That makes your configuration reviewable.
2
from __future__ import annotations
import os
from copy import deepcopy
from pathlib import Path
import yaml
def apply_allowlist_to_yaml(template_path: Path, organism: str, allowlist: set[str], out_path: Path) -> None:
y = yaml.safe_load(template_path.read_text(encoding='utf-8'))
form = list(y[organism].keys())[0]
out = deepcopy(y)
enabled = []
for db_name, db_block in out[organism][form].items():
include = db_name in allowlist
if include:
enabled.append(db_name)
for _asm, attrs in db_block.get('Assembly', {}).items():
attrs['Include'] = bool(include)
out_path.write_text(yaml.safe_dump(out, sort_keys=False, allow_unicode=True), encoding='utf-8')
print('Wrote:', out_path)
print('Enabled (count):', len(enabled))
print('Enabled (sample):', enabled[:15])
# Example: apply a mouse allowlist if a template exists in your local repository
LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
organism = 'mus_musculus'
template = LOCAL_REPOSITORY / f'{organism}_externals_template.yml'
out_yaml = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'
allowlist_mouse = {'MGI Symbol', 'EntrezGene', 'UniProtKB', 'RefSeq_mRNA'}
if template.exists():
apply_allowlist_to_yaml(template, organism=organism, allowlist=allowlist_mouse, out_path=out_yaml)
else:
print('Template not found:', template)
print('Generate it first with Part 2 (02_prepare_new_external_yaml.ipynb).')
Wrote: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/mus_musculus_externals_modified.yml
Enabled (count): 2
Enabled (sample): ['EntrezGene', 'MGI Symbol']
7.3 — Troubleshooting & Diagnostics
Common failure modes (and what they usually mean):
Permission errors in ``IDTRACK_LOCAL_REPO``: your cache directory is not writable.
REST timeouts: network issues or Ensembl REST is temporarily slow.
MySQL connection errors: outbound MySQL ports may be blocked; IDTrack will fall back to HTTPS/FTP dumps (slower but functional). Port
3337is used only for the human GRCh37 archive.``ValueError: release not included in YAML``: your external YAML does not include the snapshot boundary you chose.
Unexpected 1→n explosion: your external allowlist is too broad or contains promiscuous namespaces.
The next cell is a compact diagnostic report you can paste into issues or lab notes.
3
from __future__ import annotations
import os
import socket
from pathlib import Path
report = {}
LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)
report['local_repository'] = str(LOCAL_REPOSITORY)
report['local_repository_writable'] = os.access(LOCAL_REPOSITORY, os.W_OK)
# YAML + graph snapshot inventory
report['yaml_files'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('*_externals_modified.yml'))
report['graph_snapshots'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('graph_*.pickle'))
# REST connectivity
try:
import requests
try:
r = requests.get('https://rest.ensembl.org/info/ping', headers={'Content-Type': 'application/json'}, timeout=15)
report['ensembl_rest_status'] = r.status_code
except Exception as e:
report['ensembl_rest_status'] = f'failed: {e.__class__.__name__}'
except Exception as e:
report['ensembl_rest_status'] = f'requests_missing: {e.__class__.__name__}'
# MySQL ports (best-effort)
# Note: port 3337 is for the human GRCh37 archive; other species typically use 3306/5306.
try:
from idtrack._db import DB
host = DB.mysql_host
port_status = {}
for port in [3306, 5306, 3337]:
try:
with socket.create_connection((host, port), timeout=2):
port_status[port] = 'ok'
except OSError as e:
port_status[port] = e.__class__.__name__
report['ensembl_mysql_ports'] = port_status
except Exception as e:
report['ensembl_mysql_ports'] = f'skipped: {e.__class__.__name__}'
# Print report
for k, v in report.items():
print(f'{k}: {v}')
# Optional: quick integrity checks (only if a human snapshot exists)
# NOTE: TrackTests can be expensive. We only run a very small, cheap check here.
if report['graph_snapshots'] and any('homo_sapiens' in g for g in report['graph_snapshots']):
try:
import idtrack
api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
api.configure_logger()
org, latest = api.resolve_organism('human')
# Load existing snapshot if present; build if missing (may be slow).
api.build_graph(organism_name=org, snapshot_release=latest, return_test=True, calculate_caches=False)
ok = api.track.is_edge_with_same_nts_only_at_backbone_nodes()
print()
print('Quick TrackTests check (edge nts invariant):', ok)
except Exception as e:
print()
print('TrackTests quick check skipped/failed ->', repr(e))
2026-01-17 14:59:09 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.
local_repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache
local_repository_writable: True
yaml_files: ['homo_sapiens_externals_modified.yml', 'mus_musculus_externals_modified.yml', 'sus_scrofa_externals_modified.yml']
graph_snapshots: ['graph_homo_sapiens_min48_max115_narrow.pickle', 'graph_mus_musculus_min48_max115_narrow.pickle', 'graph_sus_scrofa_min48_max115_narrow.pickle']
ensembl_rest_status: 200
ensembl_mysql_ports: {3306: 'TimeoutError', 5306: 'TimeoutError', 3337: 'TimeoutError'}
2026-01-17 14:59:34 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 38: releases 76-115 (from config [76, None])
2026-01-17 15:00:28 INFO:graph_maker: The graph is being read: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/graph_homo_sapiens_min48_max115_narrow.pickle
2026-01-17 15:05:09 INFO:the_graph: Cached properties being calculated: available_genome_assemblies
2026-01-17 15:05:09 INFO:the_graph: Cached properties being calculated: combined_edges
2026-01-17 15:08:47 INFO:the_graph: Cached properties being calculated: combined_edges_genes
2026-01-17 15:10:11 INFO:the_graph: Cached properties being calculated: combined_edges_assembly_specific_genes
Quick TrackTests check (edge nts invariant): True
7.4 — Integration Patterns
IDTrack works best in pipelines when you make the snapshot boundary and cache location explicit.
Key ideas:
Set
IDTRACK_LOCAL_REPOto a stable, shared path (per project or per compute environment).Build snapshots once, then reuse them across jobs.
Store your external YAML alongside your pipeline so the configuration is reviewable.
Below are lightweight patterns you can adapt.
4
# Snakemake / Nextflow snippets (printed as plain text)
snakemake_rule = """
rule idtrack_build_human_graph:
output:
'idtrack_cache/graph_homo_sapiens_*.pickle'
shell:
"python - <<'PY'\nimport os\nfrom pathlib import Path\nimport idtrack\n\nos.environ['IDTRACK_LOCAL_REPO'] = os.path.abspath('idtrack_cache')\nPath(os.environ['IDTRACK_LOCAL_REPO']).mkdir(parents=True, exist_ok=True)\n\napi = idtrack.API(local_repository=os.environ['IDTRACK_LOCAL_REPO'])\norg, latest = api.resolve_organism('human')\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\nprint('Built:', org, latest)\nPY"
""".strip()
nextflow_process = """
process BUILD_IDTRACK_HUMAN {
output:
path "idtrack_cache/graph_homo_sapiens_*.pickle"
script:
"export IDTRACK_LOCAL_REPO=$PWD/idtrack_cache\npython - <<'PY'\nimport os\nfrom pathlib import Path\nimport idtrack\n\nrepo = os.environ['IDTRACK_LOCAL_REPO']\nPath(repo).mkdir(parents=True, exist_ok=True)\n\napi = idtrack.API(local_repository=repo)\norg, latest = api.resolve_organism('human')\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\nprint('Built:', org, latest)\nPY"
}
""".strip()
print('--- Snakemake example ---')
print(snakemake_rule)
print('--- Nextflow example ---')
print(nextflow_process)
--- Snakemake example ---
rule idtrack_build_human_graph:
output:
'idtrack_cache/graph_homo_sapiens_*.pickle'
shell:
"python - <<'PY'
import os
from pathlib import Path
import idtrack
os.environ['IDTRACK_LOCAL_REPO'] = os.path.abspath('idtrack_cache')
Path(os.environ['IDTRACK_LOCAL_REPO']).mkdir(parents=True, exist_ok=True)
api = idtrack.API(local_repository=os.environ['IDTRACK_LOCAL_REPO'])
org, latest = api.resolve_organism('human')
api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)
print('Built:', org, latest)
PY"
--- Nextflow example ---
process BUILD_IDTRACK_HUMAN {
output:
path "idtrack_cache/graph_homo_sapiens_*.pickle"
script:
"export IDTRACK_LOCAL_REPO=$PWD/idtrack_cache
python - <<'PY'
import os
from pathlib import Path
import idtrack
repo = os.environ['IDTRACK_LOCAL_REPO']
Path(repo).mkdir(parents=True, exist_ok=True)
api = idtrack.API(local_repository=repo)
org, latest = api.resolve_organism('human')
api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)
print('Built:', org, latest)
PY"
}
7.5 — No-Internet Servers: SSH ConnectionBridge
Some HPC clusters block outbound internet access from compute nodes. IDTrack needs to contact Ensembl services, so in these environments you can route networking through your local machine using:
an SSH reverse SOCKS proxy (
ssh -R ...) that creates a SOCKS5 proxy on the remote machine, andidtrack.ConnectionBridgeinside the Python process so Python uses that proxy.
The most important rule
The SOCKS proxy must exist on the same machine where your Python code is running.
If your Python runs on the login node → the proxy must be on the login node.
If your Python runs on a compute node → the proxy must be on that compute node.
If your Jupyter kernel runs on a compute node → the proxy must be on that compute node (not on the login node).
Why “two SSH tabs” is normal (inbound vs outbound)
Inbound traffic: your browser (local) needs to reach Jupyter (remote) → solved with
ssh -L(Jupyter tunnel).Outbound traffic: your remote Python process needs to reach the internet (Ensembl) → solved with
ssh -R+idtrack.ConnectionBridge.
These are independent problems, so using two separate SSH sessions/tabs is common. If your cluster allows you to SSH directly to the machine running the kernel (often via ssh -J login compute), you can also combine -L and -R in a single SSH command.
Scenario 1 — Python runs on the server/login node (no compute node)
On your local machine (keep this SSH session open):
ssh -N -R 127.0.0.1:1080 user@server
On the server (run IDTrack inside this Python process):
import idtrack
with idtrack.ConnectionBridge(proxy_port=1080):
# ... run IDTrack code ...
pass
Scenario 2 — Python runs on a compute node (Slurm), not Jupyter (plain Python script)
This is the same idea, but the target machine is the compute node.
Start your job and learn the compute node hostname (often from job output or
hostname).On your local machine, create the proxy on that compute node (often via the login node as a jump host):
ssh -N -J user@login -R 127.0.0.1:1080 user@compute123
Run your Python script on the compute node, and enable
ConnectionBridgeat the start of the program.
Scenario 3 — Jupyter kernel runs on a compute node (browser on your local machine)
You usually need BOTH:
an inbound Jupyter tunnel (
ssh -L ...) so your browser can reach Jupyter, andan outbound SOCKS tunnel (
ssh -R ...) so the kernel can reach Ensembl.
Option A (common): keep your existing Jupyter tunnel workflow, and open a second SSH tab for the ssh -R ... tunnel.
Option B (single SSH session): only if you can SSH to the compute node (often via -J). Example (adjust ports/users):
ssh -N -J user@login -L 8888:127.0.0.1:8888 -R 127.0.0.1:1080 user@compute123
What ConnectionBridge changes (briefly)
It patches
socket.socketto PySocks’socks.socksocketso most Python networking stacks transparently use the proxy (requests/urllib3, urllib, PyMySQL, etc.).It optionally sets
ALL_PROXY/all_proxyso child processes inherit the proxy.It restores the original state on
stop()(also best-effort on interpreter exit).
Tip: Prefer the context-manager form (
with ConnectionBridge(...)) in scripts so cleanup is guaranteed.
Copy/paste helper: ConnectionBridge SSH command
There is a ready-to-use helper script in this repo:
idtrack/reproducibility/scripts/connection_bridge_tunnel.sh
It is meant to run on your local machine. It prints only the SSH command (one line). Paste that line into a terminal tab and keep it open.
Examples:
# Show full beginner-oriented help:
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --help
# Server/login node (Python runs on the login node):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh user@server
# Compute node (Python runs on compute node; go via login as jump host):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login compute123
# Compute node host auto-detected from Jupyter URL (kernel on compute node):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login "http://compute123:8888/lab?token=..."
# Optional: one combined SSH command with both Jupyter (-L) and SOCKS (-R):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login --with-jupyter "http://compute123:8888/lab?token=..."
5
from __future__ import annotations
import idtrack
# Enable the bridge inside the current Python process (e.g. a Jupyter kernel running on the server).
# If `test=True` (default), the bridge pings Ensembl REST and rolls back automatically on failure.
b = idtrack.ConnectionBridge(proxy_port=1080)
ok = b.start(test=True)
print('Bridge enabled:', ok)
# ... run IDTrack code here ...
# api = idtrack.API(local_repository='./idtrack_cache')
# org, latest = api.resolve_organism('human')
# api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)
# Recommended pattern in scripts:
# with idtrack.ConnectionBridge(proxy_port=1080) as _b:
# ... do work ...
b.stop()
print('Bridge disabled')
2026-01-17 15:10:38 INFO:connection_bridge: [idtrack] ConnectionBridge enabled: all TCP sockets in this Python process are routed through 127.0.0.1:1080. Call `b.stop()` to restore normal networking.
2026-01-17 15:10:39 ERROR:connection_bridge: [idtrack] ConnectionBridge check failed: SOCKSHTTPSConnectionPool(host='rest.ensembl.org', port=443): Max retries exceeded with url: /info/ping (Caused by NewConnectionError("SOCKSHTTPSConnection(host='rest.ensembl.org', port=443): Failed to establish a new connection: [Errno 111] Connection refused"))
Troubleshooting:
- Ensure the SSH session is established from your local machine: `ssh -R 1080 user@server`
- On the server, check the SOCKS port is listening: `ss -tlnp | grep 1080`
2026-01-17 15:10:39 WARNING:connection_bridge: [idtrack] ConnectionBridge test failed; disabling bridge.
2026-01-17 15:10:39 INFO:connection_bridge: [idtrack] ConnectionBridge disabled: normal networking restored.
2026-01-17 15:10:39 INFO:connection_bridge: [idtrack] ConnectionBridge already stopped.
[idtrack] ConnectionBridge enabled: all TCP sockets in this Python process are routed through 127.0.0.1:1080. Call `b.stop()` to restore normal networking.
[idtrack] ConnectionBridge check failed: SOCKSHTTPSConnectionPool(host='rest.ensembl.org', port=443): Max retries exceeded with url: /info/ping (Caused by NewConnectionError("SOCKSHTTPSConnection(host='rest.ensembl.org', port=443): Failed to establish a new connection: [Errno 111] Connection refused"))
Troubleshooting:
- Ensure the SSH session is established from your local machine: `ssh -R 1080 user@server`
- On the server, check the SOCKS port is listening: `ss -tlnp | grep 1080`
[idtrack] ConnectionBridge test failed; disabling bridge.
[idtrack] ConnectionBridge disabled: normal networking restored.
Bridge enabled: False
[idtrack] ConnectionBridge already stopped.
Bridge disabled