{ "cells": [ { "cell_type": "markdown", "id": "360bda19", "metadata": {}, "source": [ "# Part 7 — Advanced Topics\n", "\n", "*Last updated:* 2026-01-09\n", "\n", "This notebook collects advanced, production-oriented topics for power users:\n", "\n", "- custom external database profiles (strict vs broad)\n", "- programmatic YAML management (useful in pipelines)\n", "- troubleshooting & diagnostics (common failure modes)\n", "- restricted-server SSH bridging (ConnectionBridge)\n", "- integration patterns (Snakemake/Nextflow-friendly workflows)\n", "\n", "> **Tip:** You don’t need this notebook for day-to-day conversions. It’s here for when you want reproducibility at scale.\n" ] }, { "cell_type": "markdown", "id": "15441637", "metadata": {}, "source": [ "## 7.1 — Custom External Database Inclusion\n", "\n", "The external YAML is an explicit **contract**: it defines which external namespaces are allowed to influence your graph.\n", "\n", "Two common profiles:\n", "\n", "- **Strict profile (recommended for most analyses):** small allowlist, lower ambiguity, faster queries.\n", "- **Broad profile (exploration):** larger allowlist, more coverage, but higher ambiguity and slower builds.\n", "\n", "In IDTrack today, the active YAML file name is fixed (`_externals_modified.yml`).\n", "A practical pattern is to keep multiple profiles as *side-by-side files* and copy/rename the one you want for a given run.\n", "\n", "The next cell demonstrates how to generate two profile files **without overwriting** your active YAML.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1d90a415", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "import os\n", "from copy import deepcopy\n", "from pathlib import Path\n", "\n", "import yaml\n", "\n", "try:\n", " import idtrack\n", "\n", " IDTRACK_OK = True\n", "except Exception as e:\n", " print('idtrack import failed ->', repr(e))\n", " IDTRACK_OK = False\n", "\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n", "\n", "print('Local repository:', LOCAL_REPOSITORY)\n", "\n", "if IDTRACK_OK:\n", " organism = 'homo_sapiens'\n", "\n", " # Prefer your configured YAML if it exists; otherwise fall back to the package default.\n", " configured = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'\n", " default_cfg = Path(idtrack.__file__).resolve().parent / 'default_config' / f'{organism}_externals_modified.yml'\n", "\n", " source_path = configured if configured.exists() else default_cfg\n", " print('Reading YAML from:', source_path)\n", "\n", " y = yaml.safe_load(source_path.read_text(encoding='utf-8'))\n", " form = list(y[organism].keys())[0]\n", "\n", " strict_allowlist = {\n", " 'HGNC Symbol',\n", " 'EntrezGene',\n", " 'UniProtKB',\n", " 'RefSeq_mRNA',\n", " }\n", "\n", " broad_allowlist = strict_allowlist | {\n", " # Add cautiously; broad profiles can increase ambiguity.\n", " 'RefSeq_peptide',\n", " 'ArrayExpress',\n", " }\n", "\n", " def make_profile(base: dict, allowlist: set[str]) -> dict:\n", " out = deepcopy(base)\n", " for db_name, db_block in out[organism][form].items():\n", " include = db_name in allowlist\n", " for _asm, attrs in db_block.get('Assembly', {}).items():\n", " attrs['Include'] = bool(include)\n", " return out\n", "\n", " strict_yaml = make_profile(y, strict_allowlist)\n", " broad_yaml = make_profile(y, broad_allowlist)\n", "\n", " strict_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_strict.yml'\n", " broad_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_broad.yml'\n", "\n", " strict_path.write_text(yaml.safe_dump(strict_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')\n", " broad_path.write_text(yaml.safe_dump(broad_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')\n", "\n", " print('Wrote strict profile:', strict_path.name)\n", " print('Wrote broad profile:', broad_path.name)\n", "\n", " print()\n", " print('To activate a profile: copy/rename it to:')\n", " print(' ', configured)\n", "else:\n", " print('Skipping YAML profile demo (idtrack not imported).')\n" ] }, { "cell_type": "markdown", "id": "4cc22683", "metadata": {}, "source": [ "## 7.2 — Programmatic YAML Management\n", "\n", "For pipelines you often want YAML changes to be **scriptable** and **repeatable**.\n", "\n", "A safe automation pattern:\n", "\n", "1. Generate (or refresh) a template YAML.\n", "2. Apply a curated allowlist in code.\n", "3. Write the resulting YAML to a file you commit alongside your pipeline.\n", "\n", "The next cell shows a reusable helper that:\n", "- reads a template YAML\n", "- applies an allowlist\n", "- writes a modified YAML\n", "\n", "> **Tip:** Keep allowlists per organism in your pipeline repository (as Python sets or a small TOML/YAML). That makes your configuration reviewable.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1980ce0f", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "import os\n", "from copy import deepcopy\n", "from pathlib import Path\n", "\n", "import yaml\n", "\n", "\n", "def apply_allowlist_to_yaml(template_path: Path, organism: str, allowlist: set[str], out_path: Path) -> None:\n", " y = yaml.safe_load(template_path.read_text(encoding='utf-8'))\n", " form = list(y[organism].keys())[0]\n", "\n", " out = deepcopy(y)\n", " enabled = []\n", "\n", " for db_name, db_block in out[organism][form].items():\n", " include = db_name in allowlist\n", " if include:\n", " enabled.append(db_name)\n", " for _asm, attrs in db_block.get('Assembly', {}).items():\n", " attrs['Include'] = bool(include)\n", "\n", " out_path.write_text(yaml.safe_dump(out, sort_keys=False, allow_unicode=True), encoding='utf-8')\n", " print('Wrote:', out_path)\n", " print('Enabled (count):', len(enabled))\n", " print('Enabled (sample):', enabled[:15])\n", "\n", "\n", "# Example: apply a mouse allowlist if a template exists in your local repository\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "organism = 'mus_musculus'\n", "\n", "template = LOCAL_REPOSITORY / f'{organism}_externals_template.yml'\n", "out_yaml = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'\n", "\n", "allowlist_mouse = {'MGI Symbol', 'EntrezGene', 'UniProtKB', 'RefSeq_mRNA'}\n", "\n", "if template.exists():\n", " apply_allowlist_to_yaml(template, organism=organism, allowlist=allowlist_mouse, out_path=out_yaml)\n", "else:\n", " print('Template not found:', template)\n", " print('Generate it first with Part 2 (02_prepare_new_external_yaml.ipynb).')\n" ] }, { "cell_type": "markdown", "id": "281588a8", "metadata": {}, "source": [ "## 7.3 — Troubleshooting & Diagnostics\n", "\n", "Common failure modes (and what they usually mean):\n", "\n", "- **Permission errors in `IDTRACK_LOCAL_REPO`:** your cache directory is not writable.\n", "- **REST timeouts:** network issues or Ensembl REST is temporarily slow.\n", "- **MySQL connection errors:** outbound MySQL ports may be blocked; IDTrack will fall back to HTTPS/FTP dumps (slower but functional). Port `3337` is used only for the human GRCh37 archive.\n", "- **`ValueError: release not included in YAML`:** your external YAML does not include the snapshot boundary you chose.\n", "- **Unexpected 1→n explosion:** your external allowlist is too broad or contains promiscuous namespaces.\n", "\n", "The next cell is a compact diagnostic report you can paste into issues or lab notes.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e62acf76", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "import os\n", "import socket\n", "from pathlib import Path\n", "\n", "report = {}\n", "\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n", "\n", "report['local_repository'] = str(LOCAL_REPOSITORY)\n", "report['local_repository_writable'] = os.access(LOCAL_REPOSITORY, os.W_OK)\n", "\n", "# YAML + graph snapshot inventory\n", "report['yaml_files'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('*_externals_modified.yml'))\n", "report['graph_snapshots'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('graph_*.pickle'))\n", "\n", "# REST connectivity\n", "try:\n", " import requests\n", "\n", " try:\n", " r = requests.get('https://rest.ensembl.org/info/ping', headers={'Content-Type': 'application/json'}, timeout=15)\n", " report['ensembl_rest_status'] = r.status_code\n", " except Exception as e:\n", " report['ensembl_rest_status'] = f'failed: {e.__class__.__name__}'\n", "except Exception as e:\n", " report['ensembl_rest_status'] = f'requests_missing: {e.__class__.__name__}'\n", "\n", "# MySQL ports (best-effort)\n", "# Note: port 3337 is for the human GRCh37 archive; other species typically use 3306/5306.\n", "try:\n", " from idtrack._db import DB\n", "\n", " host = DB.mysql_host\n", " port_status = {}\n", " for port in [3306, 5306, 3337]:\n", " try:\n", " with socket.create_connection((host, port), timeout=2):\n", " port_status[port] = 'ok'\n", " except OSError as e:\n", " port_status[port] = e.__class__.__name__\n", " report['ensembl_mysql_ports'] = port_status\n", "except Exception as e:\n", " report['ensembl_mysql_ports'] = f'skipped: {e.__class__.__name__}'\n", "\n", "# Print report\n", "for k, v in report.items():\n", " print(f'{k}: {v}')\n", "\n", "# Optional: quick integrity checks (only if a human snapshot exists)\n", "# NOTE: TrackTests can be expensive. We only run a very small, cheap check here.\n", "if report['graph_snapshots'] and any('homo_sapiens' in g for g in report['graph_snapshots']):\n", " try:\n", " import idtrack\n", "\n", " api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", " api.configure_logger()\n", " org, latest = api.resolve_organism('human')\n", "\n", " # Load existing snapshot if present; build if missing (may be slow).\n", " api.build_graph(organism_name=org, snapshot_release=latest, return_test=True, calculate_caches=False)\n", "\n", " ok = api.track.is_edge_with_same_nts_only_at_backbone_nodes()\n", " print()\n", " print('Quick TrackTests check (edge nts invariant):', ok)\n", " except Exception as e:\n", " print()\n", " print('TrackTests quick check skipped/failed ->', repr(e))\n" ] }, { "cell_type": "markdown", "id": "52ee6099", "metadata": {}, "source": [ "## 7.4 — Integration Patterns\n", "\n", "IDTrack works best in pipelines when you make the snapshot boundary and cache location explicit.\n", "\n", "Key ideas:\n", "\n", "- Set `IDTRACK_LOCAL_REPO` to a stable, shared path (per project or per compute environment).\n", "- Build snapshots once, then reuse them across jobs.\n", "- Store your external YAML alongside your pipeline so the configuration is reviewable.\n", "\n", "Below are lightweight patterns you can adapt.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "38da584f", "metadata": {}, "outputs": [], "source": [ "# Snakemake / Nextflow snippets (printed as plain text)\n", "\n", "snakemake_rule = \"\"\"\n", "rule idtrack_build_human_graph:\n", " output:\n", " 'idtrack_cache/graph_homo_sapiens_*.pickle'\n", " shell:\n", " \"python - <<'PY'\\nimport os\\nfrom pathlib import Path\\nimport idtrack\\n\\nos.environ['IDTRACK_LOCAL_REPO'] = os.path.abspath('idtrack_cache')\\nPath(os.environ['IDTRACK_LOCAL_REPO']).mkdir(parents=True, exist_ok=True)\\n\\napi = idtrack.API(local_repository=os.environ['IDTRACK_LOCAL_REPO'])\\norg, latest = api.resolve_organism('human')\\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\\nprint('Built:', org, latest)\\nPY\"\n", "\"\"\".strip()\n", "\n", "nextflow_process = \"\"\"\n", "process BUILD_IDTRACK_HUMAN {\n", " output:\n", " path \"idtrack_cache/graph_homo_sapiens_*.pickle\"\n", " script:\n", " \"export IDTRACK_LOCAL_REPO=$PWD/idtrack_cache\\npython - <<'PY'\\nimport os\\nfrom pathlib import Path\\nimport idtrack\\n\\nrepo = os.environ['IDTRACK_LOCAL_REPO']\\nPath(repo).mkdir(parents=True, exist_ok=True)\\n\\napi = idtrack.API(local_repository=repo)\\norg, latest = api.resolve_organism('human')\\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\\nprint('Built:', org, latest)\\nPY\"\n", "}\n", "\"\"\".strip()\n", "\n", "print('--- Snakemake example ---')\n", "print(snakemake_rule)\n", "print('--- Nextflow example ---')\n", "print(nextflow_process)\n" ] }, { "cell_type": "markdown", "id": "a21f4f3b", "metadata": {}, "source": [ "## 7.5 — No-Internet Servers: SSH ConnectionBridge\n", "\n", "Some HPC clusters block outbound internet access from compute nodes. IDTrack needs to contact Ensembl services, so in\n", "these environments you can route networking through your **local machine** using:\n", "\n", "1) an SSH reverse SOCKS proxy (`ssh -R ...`) that creates a SOCKS5 proxy *on the remote machine*, and\n", "2) `idtrack.ConnectionBridge` inside the Python process so Python uses that proxy.\n", "\n", "### The most important rule\n", "\n", "**The SOCKS proxy must exist on the same machine where your Python code is running.**\n", "\n", "- If your Python runs on the login node → the proxy must be on the login node.\n", "- If your Python runs on a compute node → the proxy must be on that compute node.\n", "- If your Jupyter kernel runs on a compute node → the proxy must be on that compute node (not on the login node).\n", "\n", "### Why “two SSH tabs” is normal (inbound vs outbound)\n", "\n", "- **Inbound** traffic: your browser (local) needs to reach Jupyter (remote) → solved with `ssh -L` (Jupyter tunnel).\n", "- **Outbound** traffic: your remote Python process needs to reach the internet (Ensembl) → solved with `ssh -R` +\n", " `idtrack.ConnectionBridge`.\n", "\n", "These are independent problems, so using two separate SSH sessions/tabs is common. If your cluster allows you to SSH\n", "directly to the machine running the kernel (often via `ssh -J login compute`), you can also combine `-L` and `-R`\n", "in a single SSH command.\n", "\n", "### Scenario 1 — Python runs on the server/login node (no compute node)\n", "\n", "On your **local machine** (keep this SSH session open):\n", "\n", "```bash\n", "ssh -N -R 127.0.0.1:1080 user@server\n", "```\n", "\n", "On the **server** (run IDTrack inside this Python process):\n", "\n", "```python\n", "import idtrack\n", "\n", "with idtrack.ConnectionBridge(proxy_port=1080):\n", " # ... run IDTrack code ...\n", " pass\n", "```\n", "\n", "### Scenario 2 — Python runs on a compute node (Slurm), not Jupyter (plain Python script)\n", "\n", "This is the same idea, but the target machine is the **compute node**.\n", "\n", "1) Start your job and learn the compute node hostname (often from job output or `hostname`).\n", "2) On your **local machine**, create the proxy on that compute node (often via the login node as a jump host):\n", "\n", "```bash\n", "ssh -N -J user@login -R 127.0.0.1:1080 user@compute123\n", "```\n", "\n", "3) Run your Python script on the **compute node**, and enable `ConnectionBridge` at the start of the program.\n", "\n", "### Scenario 3 — Jupyter kernel runs on a compute node (browser on your local machine)\n", "\n", "You usually need BOTH:\n", "\n", "- an **inbound** Jupyter tunnel (`ssh -L ...`) so your browser can reach Jupyter, and\n", "- an **outbound** SOCKS tunnel (`ssh -R ...`) so the kernel can reach Ensembl.\n", "\n", "Option A (common): keep your existing Jupyter tunnel workflow, and open a second SSH tab for the `ssh -R ...` tunnel.\n", "\n", "Option B (single SSH session): only if you can SSH to the compute node (often via `-J`). Example (adjust ports/users):\n", "\n", "```bash\n", "ssh -N -J user@login -L 8888:127.0.0.1:8888 -R 127.0.0.1:1080 user@compute123\n", "```\n", "\n", "### What `ConnectionBridge` changes (briefly)\n", "\n", "- It patches `socket.socket` to PySocks’ `socks.socksocket` so most Python networking stacks transparently use the proxy\n", " (requests/urllib3, urllib, PyMySQL, etc.).\n", "- It optionally sets `ALL_PROXY`/`all_proxy` so child processes inherit the proxy.\n", "- It restores the original state on `stop()` (also best-effort on interpreter exit).\n", "\n", "> **Tip:** Prefer the context-manager form (`with ConnectionBridge(...)`) in scripts so cleanup is guaranteed.\n" ] }, { "cell_type": "markdown", "id": "56b1a7c6", "metadata": {}, "source": [ "### Copy/paste helper: ConnectionBridge SSH command\n", "\n", "There is a ready-to-use helper script in this repo:\n", "\n", "- `idtrack/reproducibility/scripts/connection_bridge_tunnel.sh`\n", "\n", "It is meant to run on your **local machine**. It prints only the SSH command (one line). Paste that line into a\n", "terminal tab and keep it open.\n", "\n", "Examples:\n", "\n", "```bash\n", "# Show full beginner-oriented help:\n", "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --help\n", "\n", "# Server/login node (Python runs on the login node):\n", "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh user@server\n", "\n", "# Compute node (Python runs on compute node; go via login as jump host):\n", "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login compute123\n", "\n", "# Compute node host auto-detected from Jupyter URL (kernel on compute node):\n", "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login \"http://compute123:8888/lab?token=...\"\n", "\n", "# Optional: one combined SSH command with both Jupyter (-L) and SOCKS (-R):\n", "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login --with-jupyter \"http://compute123:8888/lab?token=...\"\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d7c0d6d1", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "import idtrack\n", "\n", "# Enable the bridge inside the current Python process (e.g. a Jupyter kernel running on the server).\n", "# If `test=True` (default), the bridge pings Ensembl REST and rolls back automatically on failure.\n", "b = idtrack.ConnectionBridge(proxy_port=1080)\n", "ok = b.start(test=True)\n", "print('Bridge enabled:', ok)\n", "\n", "# ... run IDTrack code here ...\n", "# api = idtrack.API(local_repository='./idtrack_cache')\n", "# org, latest = api.resolve_organism('human')\n", "# api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\n", "\n", "# Recommended pattern in scripts:\n", "# with idtrack.ConnectionBridge(proxy_port=1080) as _b:\n", "# ... do work ...\n", "\n", "b.stop()\n", "print('Bridge disabled')\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }