Part 7 — Advanced Topics

Last updated: 2026-01-09

This notebook collects advanced, production-oriented topics for power users:

custom external database profiles (strict vs broad)
programmatic YAML management (useful in pipelines)
troubleshooting & diagnostics (common failure modes)
restricted-server SSH bridging (ConnectionBridge)
integration patterns (Snakemake/Nextflow-friendly workflows)

Tip: You don’t need this notebook for day-to-day conversions. It’s here for when you want reproducibility at scale.

7.1 — Custom External Database Inclusion

The external YAML is an explicit contract: it defines which external namespaces are allowed to influence your graph.

Two common profiles:

Strict profile (recommended for most analyses): small allowlist, lower ambiguity, faster queries.
Broad profile (exploration): larger allowlist, more coverage, but higher ambiguity and slower builds.

In IDTrack today, the active YAML file name is fixed (<organism>_externals_modified.yml). A practical pattern is to keep multiple profiles as side-by-side files and copy/rename the one you want for a given run.

The next cell demonstrates how to generate two profile files without overwriting your active YAML.

from __future__ import annotations

import os
from copy import deepcopy
from pathlib import Path

import yaml

try:
    import idtrack

    IDTRACK_OK = True
except Exception as e:
    print('idtrack import failed ->', repr(e))
    IDTRACK_OK = False

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

print('Local repository:', LOCAL_REPOSITORY)

if IDTRACK_OK:
    organism = 'homo_sapiens'

    # Prefer your configured YAML if it exists; otherwise fall back to the package default.
    configured = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'
    default_cfg = Path(idtrack.__file__).resolve().parent / 'default_config' / f'{organism}_externals_modified.yml'

    source_path = configured if configured.exists() else default_cfg
    print('Reading YAML from:', source_path)

    y = yaml.safe_load(source_path.read_text(encoding='utf-8'))
    form = list(y[organism].keys())[0]

    strict_allowlist = {
        'HGNC Symbol',
        'EntrezGene',
        'UniProtKB',
        'RefSeq_mRNA',
    }

    broad_allowlist = strict_allowlist | {
        # Add cautiously; broad profiles can increase ambiguity.
        'RefSeq_peptide',
        'ArrayExpress',
    }

    def make_profile(base: dict, allowlist: set[str]) -> dict:
        out = deepcopy(base)
        for db_name, db_block in out[organism][form].items():
            include = db_name in allowlist
            for _asm, attrs in db_block.get('Assembly', {}).items():
                attrs['Include'] = bool(include)
        return out

    strict_yaml = make_profile(y, strict_allowlist)
    broad_yaml = make_profile(y, broad_allowlist)

    strict_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_strict.yml'
    broad_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_broad.yml'

    strict_path.write_text(yaml.safe_dump(strict_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')
    broad_path.write_text(yaml.safe_dump(broad_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')

    print('Wrote strict profile:', strict_path.name)
    print('Wrote broad  profile:', broad_path.name)

    print()
    print('To activate a profile: copy/rename it to:')
    print(' ', configured)
else:
    print('Skipping YAML profile demo (idtrack not imported).')

Local repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache
Reading YAML from: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml
Wrote strict profile: homo_sapiens_externals_modified_strict.yml
Wrote broad  profile: homo_sapiens_externals_modified_broad.yml

To activate a profile: copy/rename it to:
  /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml

7.2 — Programmatic YAML Management

For pipelines you often want YAML changes to be scriptable and repeatable.

A safe automation pattern:

Generate (or refresh) a template YAML.
Apply a curated allowlist in code.
Write the resulting YAML to a file you commit alongside your pipeline.

The next cell shows a reusable helper that:

reads a template YAML
applies an allowlist
writes a modified YAML

Tip: Keep allowlists per organism in your pipeline repository (as Python sets or a small TOML/YAML). That makes your configuration reviewable.

from __future__ import annotations

import os
from copy import deepcopy
from pathlib import Path

import yaml


def apply_allowlist_to_yaml(template_path: Path, organism: str, allowlist: set[str], out_path: Path) -> None:
    y = yaml.safe_load(template_path.read_text(encoding='utf-8'))
    form = list(y[organism].keys())[0]

    out = deepcopy(y)
    enabled = []

    for db_name, db_block in out[organism][form].items():
        include = db_name in allowlist
        if include:
            enabled.append(db_name)
        for _asm, attrs in db_block.get('Assembly', {}).items():
            attrs['Include'] = bool(include)

    out_path.write_text(yaml.safe_dump(out, sort_keys=False, allow_unicode=True), encoding='utf-8')
    print('Wrote:', out_path)
    print('Enabled (count):', len(enabled))
    print('Enabled (sample):', enabled[:15])


# Example: apply a mouse allowlist if a template exists in your local repository
LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
organism = 'mus_musculus'

template = LOCAL_REPOSITORY / f'{organism}_externals_template.yml'
out_yaml = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'

allowlist_mouse = {'MGI Symbol', 'EntrezGene', 'UniProtKB', 'RefSeq_mRNA'}

if template.exists():
    apply_allowlist_to_yaml(template, organism=organism, allowlist=allowlist_mouse, out_path=out_yaml)
else:
    print('Template not found:', template)
    print('Generate it first with Part 2 (02_prepare_new_external_yaml.ipynb).')

Wrote: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/mus_musculus_externals_modified.yml
Enabled (count): 2
Enabled (sample): ['EntrezGene', 'MGI Symbol']

7.3 — Troubleshooting & Diagnostics

Common failure modes (and what they usually mean):

Permission errors in ``IDTRACK_LOCAL_REPO``: your cache directory is not writable.
REST timeouts: network issues or Ensembl REST is temporarily slow.
MySQL connection errors: outbound MySQL ports may be blocked; IDTrack will fall back to HTTPS/FTP dumps (slower but functional). Port 3337 is used only for the human GRCh37 archive.
``ValueError: release not included in YAML``: your external YAML does not include the snapshot boundary you chose.
Unexpected 1→n explosion: your external allowlist is too broad or contains promiscuous namespaces.

The next cell is a compact diagnostic report you can paste into issues or lab notes.

from __future__ import annotations

import os
import socket
from pathlib import Path

report = {}

LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()
LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)

report['local_repository'] = str(LOCAL_REPOSITORY)
report['local_repository_writable'] = os.access(LOCAL_REPOSITORY, os.W_OK)

# YAML + graph snapshot inventory
report['yaml_files'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('*_externals_modified.yml'))
report['graph_snapshots'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('graph_*.pickle'))

# REST connectivity
try:
    import requests

    try:
        r = requests.get('https://rest.ensembl.org/info/ping', headers={'Content-Type': 'application/json'}, timeout=15)
        report['ensembl_rest_status'] = r.status_code
    except Exception as e:
        report['ensembl_rest_status'] = f'failed: {e.__class__.__name__}'
except Exception as e:
    report['ensembl_rest_status'] = f'requests_missing: {e.__class__.__name__}'

# MySQL ports (best-effort)
# Note: port 3337 is for the human GRCh37 archive; other species typically use 3306/5306.
try:
    from idtrack._db import DB

    host = DB.mysql_host
    port_status = {}
    for port in [3306, 5306, 3337]:
        try:
            with socket.create_connection((host, port), timeout=2):
                port_status[port] = 'ok'
        except OSError as e:
            port_status[port] = e.__class__.__name__
    report['ensembl_mysql_ports'] = port_status
except Exception as e:
    report['ensembl_mysql_ports'] = f'skipped: {e.__class__.__name__}'

# Print report
for k, v in report.items():
    print(f'{k}: {v}')

# Optional: quick integrity checks (only if a human snapshot exists)
# NOTE: TrackTests can be expensive. We only run a very small, cheap check here.
if report['graph_snapshots'] and any('homo_sapiens' in g for g in report['graph_snapshots']):
    try:
        import idtrack

        api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))
        api.configure_logger()
        org, latest = api.resolve_organism('human')

        # Load existing snapshot if present; build if missing (may be slow).
        api.build_graph(organism_name=org, snapshot_release=latest, return_test=True, calculate_caches=False)

        ok = api.track.is_edge_with_same_nts_only_at_backbone_nodes()
        print()
        print('Quick TrackTests check (edge nts invariant):', ok)
    except Exception as e:
        print()
        print('TrackTests quick check skipped/failed ->', repr(e))

2026-01-17 14:59:09 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.

local_repository: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache
local_repository_writable: True
yaml_files: ['homo_sapiens_externals_modified.yml', 'mus_musculus_externals_modified.yml', 'sus_scrofa_externals_modified.yml']
graph_snapshots: ['graph_homo_sapiens_min48_max115_narrow.pickle', 'graph_mus_musculus_min48_max115_narrow.pickle', 'graph_sus_scrofa_min48_max115_narrow.pickle']
ensembl_rest_status: 200
ensembl_mysql_ports: {3306: 'TimeoutError', 5306: 'TimeoutError', 3337: 'TimeoutError'}

2026-01-17 14:59:34 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 38: releases 76-115 (from config [76, None])
2026-01-17 15:00:28 INFO:graph_maker: The graph is being read: /ictstr01/home/icb/kemal.inecik/work/codes/idtrack/docs/_notebooks/idtrack_cache/graph_homo_sapiens_min48_max115_narrow.pickle
2026-01-17 15:05:09 INFO:the_graph: Cached properties being calculated: available_genome_assemblies
2026-01-17 15:05:09 INFO:the_graph: Cached properties being calculated: combined_edges
2026-01-17 15:08:47 INFO:the_graph: Cached properties being calculated: combined_edges_genes
2026-01-17 15:10:11 INFO:the_graph: Cached properties being calculated: combined_edges_assembly_specific_genes


Quick TrackTests check (edge nts invariant): True

7.4 — Integration Patterns

IDTrack works best in pipelines when you make the snapshot boundary and cache location explicit.

Key ideas:

Set IDTRACK_LOCAL_REPO to a stable, shared path (per project or per compute environment).
Build snapshots once, then reuse them across jobs.
Store your external YAML alongside your pipeline so the configuration is reviewable.

Below are lightweight patterns you can adapt.

# Snakemake / Nextflow snippets (printed as plain text)

snakemake_rule = """
rule idtrack_build_human_graph:
    output:
        'idtrack_cache/graph_homo_sapiens_*.pickle'
    shell:
        "python - <<'PY'\nimport os\nfrom pathlib import Path\nimport idtrack\n\nos.environ['IDTRACK_LOCAL_REPO'] = os.path.abspath('idtrack_cache')\nPath(os.environ['IDTRACK_LOCAL_REPO']).mkdir(parents=True, exist_ok=True)\n\napi = idtrack.API(local_repository=os.environ['IDTRACK_LOCAL_REPO'])\norg, latest = api.resolve_organism('human')\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\nprint('Built:', org, latest)\nPY"
""".strip()

nextflow_process = """
process BUILD_IDTRACK_HUMAN {
  output:
    path "idtrack_cache/graph_homo_sapiens_*.pickle"
  script:
    "export IDTRACK_LOCAL_REPO=$PWD/idtrack_cache\npython - <<'PY'\nimport os\nfrom pathlib import Path\nimport idtrack\n\nrepo = os.environ['IDTRACK_LOCAL_REPO']\nPath(repo).mkdir(parents=True, exist_ok=True)\n\napi = idtrack.API(local_repository=repo)\norg, latest = api.resolve_organism('human')\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\nprint('Built:', org, latest)\nPY"
}
""".strip()

print('--- Snakemake example ---')
print(snakemake_rule)
print('--- Nextflow example ---')
print(nextflow_process)

--- Snakemake example ---
rule idtrack_build_human_graph:
    output:
        'idtrack_cache/graph_homo_sapiens_*.pickle'
    shell:
        "python - <<'PY'
import os
from pathlib import Path
import idtrack

os.environ['IDTRACK_LOCAL_REPO'] = os.path.abspath('idtrack_cache')
Path(os.environ['IDTRACK_LOCAL_REPO']).mkdir(parents=True, exist_ok=True)

api = idtrack.API(local_repository=os.environ['IDTRACK_LOCAL_REPO'])
org, latest = api.resolve_organism('human')
api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)
print('Built:', org, latest)
PY"
--- Nextflow example ---
process BUILD_IDTRACK_HUMAN {
  output:
    path "idtrack_cache/graph_homo_sapiens_*.pickle"
  script:
    "export IDTRACK_LOCAL_REPO=$PWD/idtrack_cache
python - <<'PY'
import os
from pathlib import Path
import idtrack

repo = os.environ['IDTRACK_LOCAL_REPO']
Path(repo).mkdir(parents=True, exist_ok=True)

api = idtrack.API(local_repository=repo)
org, latest = api.resolve_organism('human')
api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)
print('Built:', org, latest)
PY"
}

7.5 — No-Internet Servers: SSH ConnectionBridge

Some HPC clusters block outbound internet access from compute nodes. IDTrack needs to contact Ensembl services, so in these environments you can route networking through your local machine using:

an SSH reverse SOCKS proxy (ssh -R ...) that creates a SOCKS5 proxy on the remote machine, and
idtrack.ConnectionBridge inside the Python process so Python uses that proxy.

The most important rule

The SOCKS proxy must exist on the same machine where your Python code is running.

If your Python runs on the login node → the proxy must be on the login node.
If your Python runs on a compute node → the proxy must be on that compute node.
If your Jupyter kernel runs on a compute node → the proxy must be on that compute node (not on the login node).

Why “two SSH tabs” is normal (inbound vs outbound)

Inbound traffic: your browser (local) needs to reach Jupyter (remote) → solved with ssh -L (Jupyter tunnel).
Outbound traffic: your remote Python process needs to reach the internet (Ensembl) → solved with ssh -R + idtrack.ConnectionBridge.

These are independent problems, so using two separate SSH sessions/tabs is common. If your cluster allows you to SSH directly to the machine running the kernel (often via ssh -J login compute), you can also combine -L and -R in a single SSH command.

Scenario 1 — Python runs on the server/login node (no compute node)

On your local machine (keep this SSH session open):

ssh -N -R 127.0.0.1:1080 user@server

On the server (run IDTrack inside this Python process):

import idtrack

with idtrack.ConnectionBridge(proxy_port=1080):
    # ... run IDTrack code ...
    pass

Scenario 2 — Python runs on a compute node (Slurm), not Jupyter (plain Python script)

This is the same idea, but the target machine is the compute node.

Start your job and learn the compute node hostname (often from job output or hostname).
On your local machine, create the proxy on that compute node (often via the login node as a jump host):

ssh -N -J user@login -R 127.0.0.1:1080 user@compute123

Run your Python script on the compute node, and enable ConnectionBridge at the start of the program.

Scenario 3 — Jupyter kernel runs on a compute node (browser on your local machine)

You usually need BOTH:

an inbound Jupyter tunnel (ssh -L ...) so your browser can reach Jupyter, and
an outbound SOCKS tunnel (ssh -R ...) so the kernel can reach Ensembl.

Option A (common): keep your existing Jupyter tunnel workflow, and open a second SSH tab for the ssh -R ... tunnel.

Option B (single SSH session): only if you can SSH to the compute node (often via -J). Example (adjust ports/users):

ssh -N -J user@login -L 8888:127.0.0.1:8888 -R 127.0.0.1:1080 user@compute123

What `ConnectionBridge` changes (briefly)

It patches socket.socket to PySocks’ socks.socksocket so most Python networking stacks transparently use the proxy (requests/urllib3, urllib, PyMySQL, etc.).
It optionally sets ALL_PROXY/all_proxy so child processes inherit the proxy.
It restores the original state on stop() (also best-effort on interpreter exit).

Tip: Prefer the context-manager form (with ConnectionBridge(...)) in scripts so cleanup is guaranteed.

Copy/paste helper: ConnectionBridge SSH command

There is a ready-to-use helper script in this repo:

idtrack/reproducibility/scripts/connection_bridge_tunnel.sh

It is meant to run on your local machine. It prints only the SSH command (one line). Paste that line into a terminal tab and keep it open.

Examples:

# Show full beginner-oriented help:
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --help

# Server/login node (Python runs on the login node):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh user@server

# Compute node (Python runs on compute node; go via login as jump host):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login compute123

# Compute node host auto-detected from Jupyter URL (kernel on compute node):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login "http://compute123:8888/lab?token=..."

# Optional: one combined SSH command with both Jupyter (-L) and SOCKS (-R):
bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login --with-jupyter "http://compute123:8888/lab?token=..."

from __future__ import annotations

import idtrack

# Enable the bridge inside the current Python process (e.g. a Jupyter kernel running on the server).
# If `test=True` (default), the bridge pings Ensembl REST and rolls back automatically on failure.
b = idtrack.ConnectionBridge(proxy_port=1080)
ok = b.start(test=True)
print('Bridge enabled:', ok)

# ... run IDTrack code here ...
# api = idtrack.API(local_repository='./idtrack_cache')
# org, latest = api.resolve_organism('human')
# api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)

# Recommended pattern in scripts:
# with idtrack.ConnectionBridge(proxy_port=1080) as _b:
#     ... do work ...

b.stop()
print('Bridge disabled')

2026-01-17 15:10:38 INFO:connection_bridge: [idtrack] ConnectionBridge enabled: all TCP sockets in this Python process are routed through 127.0.0.1:1080. Call `b.stop()` to restore normal networking.
2026-01-17 15:10:39 ERROR:connection_bridge: [idtrack] ConnectionBridge check failed: SOCKSHTTPSConnectionPool(host='rest.ensembl.org', port=443): Max retries exceeded with url: /info/ping (Caused by NewConnectionError("SOCKSHTTPSConnection(host='rest.ensembl.org', port=443): Failed to establish a new connection: [Errno 111] Connection refused"))
  Troubleshooting:
  - Ensure the SSH session is established from your local machine: `ssh -R 1080 user@server`
  - On the server, check the SOCKS port is listening: `ss -tlnp | grep 1080`
2026-01-17 15:10:39 WARNING:connection_bridge: [idtrack] ConnectionBridge test failed; disabling bridge.
2026-01-17 15:10:39 INFO:connection_bridge: [idtrack] ConnectionBridge disabled: normal networking restored.
2026-01-17 15:10:39 INFO:connection_bridge: [idtrack] ConnectionBridge already stopped.

[idtrack] ConnectionBridge enabled: all TCP sockets in this Python process are routed through 127.0.0.1:1080. Call `b.stop()` to restore normal networking.
[idtrack] ConnectionBridge check failed: SOCKSHTTPSConnectionPool(host='rest.ensembl.org', port=443): Max retries exceeded with url: /info/ping (Caused by NewConnectionError("SOCKSHTTPSConnection(host='rest.ensembl.org', port=443): Failed to establish a new connection: [Errno 111] Connection refused"))
  Troubleshooting:
  - Ensure the SSH session is established from your local machine: `ssh -R 1080 user@server`
  - On the server, check the SOCKS port is listening: `ss -tlnp | grep 1080`
[idtrack] ConnectionBridge test failed; disabling bridge.
[idtrack] ConnectionBridge disabled: normal networking restored.
Bridge enabled: False
[idtrack] ConnectionBridge already stopped.
Bridge disabled