{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "360bda19",
   "metadata": {},
   "source": [
    "# Part 7 — Advanced Topics\n",
    "\n",
    "*Last updated:* 2026-01-09\n",
    "\n",
    "This notebook collects advanced, production-oriented topics for power users:\n",
    "\n",
    "- custom external database profiles (strict vs broad)\n",
    "- programmatic YAML management (useful in pipelines)\n",
    "- troubleshooting & diagnostics (common failure modes)\n",
    "- restricted-server SSH bridging (ConnectionBridge)\n",
    "- integration patterns (Snakemake/Nextflow-friendly workflows)\n",
    "\n",
    "> **Tip:** You don’t need this notebook for day-to-day conversions. It’s here for when you want reproducibility at scale.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15441637",
   "metadata": {},
   "source": [
    "## 7.1 — Custom External Database Inclusion\n",
    "\n",
    "The external YAML is an explicit **contract**: it defines which external namespaces are allowed to influence your graph.\n",
    "\n",
    "Two common profiles:\n",
    "\n",
    "- **Strict profile (recommended for most analyses):** small allowlist, lower ambiguity, faster queries.\n",
    "- **Broad profile (exploration):** larger allowlist, more coverage, but higher ambiguity and slower builds.\n",
    "\n",
    "In IDTrack today, the active YAML file name is fixed (`<organism>_externals_modified.yml`).\n",
    "A practical pattern is to keep multiple profiles as *side-by-side files* and copy/rename the one you want for a given run.\n",
    "\n",
    "The next cell demonstrates how to generate two profile files **without overwriting** your active YAML.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d90a415",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import os\n",
    "from copy import deepcopy\n",
    "from pathlib import Path\n",
    "\n",
    "import yaml\n",
    "\n",
    "try:\n",
    "    import idtrack\n",
    "\n",
    "    IDTRACK_OK = True\n",
    "except Exception as e:\n",
    "    print('idtrack import failed ->', repr(e))\n",
    "    IDTRACK_OK = False\n",
    "\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "print('Local repository:', LOCAL_REPOSITORY)\n",
    "\n",
    "if IDTRACK_OK:\n",
    "    organism = 'homo_sapiens'\n",
    "\n",
    "    # Prefer your configured YAML if it exists; otherwise fall back to the package default.\n",
    "    configured = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'\n",
    "    default_cfg = Path(idtrack.__file__).resolve().parent / 'default_config' / f'{organism}_externals_modified.yml'\n",
    "\n",
    "    source_path = configured if configured.exists() else default_cfg\n",
    "    print('Reading YAML from:', source_path)\n",
    "\n",
    "    y = yaml.safe_load(source_path.read_text(encoding='utf-8'))\n",
    "    form = list(y[organism].keys())[0]\n",
    "\n",
    "    strict_allowlist = {\n",
    "        'HGNC Symbol',\n",
    "        'EntrezGene',\n",
    "        'UniProtKB',\n",
    "        'RefSeq_mRNA',\n",
    "    }\n",
    "\n",
    "    broad_allowlist = strict_allowlist | {\n",
    "        # Add cautiously; broad profiles can increase ambiguity.\n",
    "        'RefSeq_peptide',\n",
    "        'ArrayExpress',\n",
    "    }\n",
    "\n",
    "    def make_profile(base: dict, allowlist: set[str]) -> dict:\n",
    "        out = deepcopy(base)\n",
    "        for db_name, db_block in out[organism][form].items():\n",
    "            include = db_name in allowlist\n",
    "            for _asm, attrs in db_block.get('Assembly', {}).items():\n",
    "                attrs['Include'] = bool(include)\n",
    "        return out\n",
    "\n",
    "    strict_yaml = make_profile(y, strict_allowlist)\n",
    "    broad_yaml = make_profile(y, broad_allowlist)\n",
    "\n",
    "    strict_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_strict.yml'\n",
    "    broad_path = LOCAL_REPOSITORY / f'{organism}_externals_modified_broad.yml'\n",
    "\n",
    "    strict_path.write_text(yaml.safe_dump(strict_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')\n",
    "    broad_path.write_text(yaml.safe_dump(broad_yaml, sort_keys=False, allow_unicode=True), encoding='utf-8')\n",
    "\n",
    "    print('Wrote strict profile:', strict_path.name)\n",
    "    print('Wrote broad  profile:', broad_path.name)\n",
    "\n",
    "    print()\n",
    "    print('To activate a profile: copy/rename it to:')\n",
    "    print(' ', configured)\n",
    "else:\n",
    "    print('Skipping YAML profile demo (idtrack not imported).')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cc22683",
   "metadata": {},
   "source": [
    "## 7.2 — Programmatic YAML Management\n",
    "\n",
    "For pipelines you often want YAML changes to be **scriptable** and **repeatable**.\n",
    "\n",
    "A safe automation pattern:\n",
    "\n",
    "1. Generate (or refresh) a template YAML.\n",
    "2. Apply a curated allowlist in code.\n",
    "3. Write the resulting YAML to a file you commit alongside your pipeline.\n",
    "\n",
    "The next cell shows a reusable helper that:\n",
    "- reads a template YAML\n",
    "- applies an allowlist\n",
    "- writes a modified YAML\n",
    "\n",
    "> **Tip:** Keep allowlists per organism in your pipeline repository (as Python sets or a small TOML/YAML). That makes your configuration reviewable.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1980ce0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import os\n",
    "from copy import deepcopy\n",
    "from pathlib import Path\n",
    "\n",
    "import yaml\n",
    "\n",
    "\n",
    "def apply_allowlist_to_yaml(template_path: Path, organism: str, allowlist: set[str], out_path: Path) -> None:\n",
    "    y = yaml.safe_load(template_path.read_text(encoding='utf-8'))\n",
    "    form = list(y[organism].keys())[0]\n",
    "\n",
    "    out = deepcopy(y)\n",
    "    enabled = []\n",
    "\n",
    "    for db_name, db_block in out[organism][form].items():\n",
    "        include = db_name in allowlist\n",
    "        if include:\n",
    "            enabled.append(db_name)\n",
    "        for _asm, attrs in db_block.get('Assembly', {}).items():\n",
    "            attrs['Include'] = bool(include)\n",
    "\n",
    "    out_path.write_text(yaml.safe_dump(out, sort_keys=False, allow_unicode=True), encoding='utf-8')\n",
    "    print('Wrote:', out_path)\n",
    "    print('Enabled (count):', len(enabled))\n",
    "    print('Enabled (sample):', enabled[:15])\n",
    "\n",
    "\n",
    "# Example: apply a mouse allowlist if a template exists in your local repository\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "organism = 'mus_musculus'\n",
    "\n",
    "template = LOCAL_REPOSITORY / f'{organism}_externals_template.yml'\n",
    "out_yaml = LOCAL_REPOSITORY / f'{organism}_externals_modified.yml'\n",
    "\n",
    "allowlist_mouse = {'MGI Symbol', 'EntrezGene', 'UniProtKB', 'RefSeq_mRNA'}\n",
    "\n",
    "if template.exists():\n",
    "    apply_allowlist_to_yaml(template, organism=organism, allowlist=allowlist_mouse, out_path=out_yaml)\n",
    "else:\n",
    "    print('Template not found:', template)\n",
    "    print('Generate it first with Part 2 (02_prepare_new_external_yaml.ipynb).')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "281588a8",
   "metadata": {},
   "source": [
    "## 7.3 — Troubleshooting & Diagnostics\n",
    "\n",
    "Common failure modes (and what they usually mean):\n",
    "\n",
    "- **Permission errors in `IDTRACK_LOCAL_REPO`:** your cache directory is not writable.\n",
    "- **REST timeouts:** network issues or Ensembl REST is temporarily slow.\n",
    "- **MySQL connection errors:** outbound MySQL ports may be blocked; IDTrack will fall back to HTTPS/FTP dumps (slower but functional). Port `3337` is used only for the human GRCh37 archive.\n",
    "- **`ValueError: release not included in YAML`:** your external YAML does not include the snapshot boundary you chose.\n",
    "- **Unexpected 1→n explosion:** your external allowlist is too broad or contains promiscuous namespaces.\n",
    "\n",
    "The next cell is a compact diagnostic report you can paste into issues or lab notes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e62acf76",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import os\n",
    "import socket\n",
    "from pathlib import Path\n",
    "\n",
    "report = {}\n",
    "\n",
    "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n",
    "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "report['local_repository'] = str(LOCAL_REPOSITORY)\n",
    "report['local_repository_writable'] = os.access(LOCAL_REPOSITORY, os.W_OK)\n",
    "\n",
    "# YAML + graph snapshot inventory\n",
    "report['yaml_files'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('*_externals_modified.yml'))\n",
    "report['graph_snapshots'] = sorted(p.name for p in LOCAL_REPOSITORY.glob('graph_*.pickle'))\n",
    "\n",
    "# REST connectivity\n",
    "try:\n",
    "    import requests\n",
    "\n",
    "    try:\n",
    "        r = requests.get('https://rest.ensembl.org/info/ping', headers={'Content-Type': 'application/json'}, timeout=15)\n",
    "        report['ensembl_rest_status'] = r.status_code\n",
    "    except Exception as e:\n",
    "        report['ensembl_rest_status'] = f'failed: {e.__class__.__name__}'\n",
    "except Exception as e:\n",
    "    report['ensembl_rest_status'] = f'requests_missing: {e.__class__.__name__}'\n",
    "\n",
    "# MySQL ports (best-effort)\n",
    "# Note: port 3337 is for the human GRCh37 archive; other species typically use 3306/5306.\n",
    "try:\n",
    "    from idtrack._db import DB\n",
    "\n",
    "    host = DB.mysql_host\n",
    "    port_status = {}\n",
    "    for port in [3306, 5306, 3337]:\n",
    "        try:\n",
    "            with socket.create_connection((host, port), timeout=2):\n",
    "                port_status[port] = 'ok'\n",
    "        except OSError as e:\n",
    "            port_status[port] = e.__class__.__name__\n",
    "    report['ensembl_mysql_ports'] = port_status\n",
    "except Exception as e:\n",
    "    report['ensembl_mysql_ports'] = f'skipped: {e.__class__.__name__}'\n",
    "\n",
    "# Print report\n",
    "for k, v in report.items():\n",
    "    print(f'{k}: {v}')\n",
    "\n",
    "# Optional: quick integrity checks (only if a human snapshot exists)\n",
    "# NOTE: TrackTests can be expensive. We only run a very small, cheap check here.\n",
    "if report['graph_snapshots'] and any('homo_sapiens' in g for g in report['graph_snapshots']):\n",
    "    try:\n",
    "        import idtrack\n",
    "\n",
    "        api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n",
    "        api.configure_logger()\n",
    "        org, latest = api.resolve_organism('human')\n",
    "\n",
    "        # Load existing snapshot if present; build if missing (may be slow).\n",
    "        api.build_graph(organism_name=org, snapshot_release=latest, return_test=True, calculate_caches=False)\n",
    "\n",
    "        ok = api.track.is_edge_with_same_nts_only_at_backbone_nodes()\n",
    "        print()\n",
    "        print('Quick TrackTests check (edge nts invariant):', ok)\n",
    "    except Exception as e:\n",
    "        print()\n",
    "        print('TrackTests quick check skipped/failed ->', repr(e))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52ee6099",
   "metadata": {},
   "source": [
    "## 7.4 — Integration Patterns\n",
    "\n",
    "IDTrack works best in pipelines when you make the snapshot boundary and cache location explicit.\n",
    "\n",
    "Key ideas:\n",
    "\n",
    "- Set `IDTRACK_LOCAL_REPO` to a stable, shared path (per project or per compute environment).\n",
    "- Build snapshots once, then reuse them across jobs.\n",
    "- Store your external YAML alongside your pipeline so the configuration is reviewable.\n",
    "\n",
    "Below are lightweight patterns you can adapt.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38da584f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Snakemake / Nextflow snippets (printed as plain text)\n",
    "\n",
    "snakemake_rule = \"\"\"\n",
    "rule idtrack_build_human_graph:\n",
    "    output:\n",
    "        'idtrack_cache/graph_homo_sapiens_*.pickle'\n",
    "    shell:\n",
    "        \"python - <<'PY'\\nimport os\\nfrom pathlib import Path\\nimport idtrack\\n\\nos.environ['IDTRACK_LOCAL_REPO'] = os.path.abspath('idtrack_cache')\\nPath(os.environ['IDTRACK_LOCAL_REPO']).mkdir(parents=True, exist_ok=True)\\n\\napi = idtrack.API(local_repository=os.environ['IDTRACK_LOCAL_REPO'])\\norg, latest = api.resolve_organism('human')\\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\\nprint('Built:', org, latest)\\nPY\"\n",
    "\"\"\".strip()\n",
    "\n",
    "nextflow_process = \"\"\"\n",
    "process BUILD_IDTRACK_HUMAN {\n",
    "  output:\n",
    "    path \"idtrack_cache/graph_homo_sapiens_*.pickle\"\n",
    "  script:\n",
    "    \"export IDTRACK_LOCAL_REPO=$PWD/idtrack_cache\\npython - <<'PY'\\nimport os\\nfrom pathlib import Path\\nimport idtrack\\n\\nrepo = os.environ['IDTRACK_LOCAL_REPO']\\nPath(repo).mkdir(parents=True, exist_ok=True)\\n\\napi = idtrack.API(local_repository=repo)\\norg, latest = api.resolve_organism('human')\\napi.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\\nprint('Built:', org, latest)\\nPY\"\n",
    "}\n",
    "\"\"\".strip()\n",
    "\n",
    "print('--- Snakemake example ---')\n",
    "print(snakemake_rule)\n",
    "print('--- Nextflow example ---')\n",
    "print(nextflow_process)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a21f4f3b",
   "metadata": {},
	   "source": [
	    "## 7.5 — No-Internet Servers: SSH ConnectionBridge\n",
	    "\n",
	    "Some HPC clusters block outbound internet access from compute nodes. IDTrack needs to contact Ensembl services, so in\n",
	    "these environments you can route networking through your **local machine** using:\n",
	    "\n",
	    "1) an SSH reverse SOCKS proxy (`ssh -R ...`) that creates a SOCKS5 proxy *on the remote machine*, and\n",
	    "2) `idtrack.ConnectionBridge` inside the Python process so Python uses that proxy.\n",
	    "\n",
	    "### The most important rule\n",
	    "\n",
	    "**The SOCKS proxy must exist on the same machine where your Python code is running.**\n",
	    "\n",
	    "- If your Python runs on the login node → the proxy must be on the login node.\n",
	    "- If your Python runs on a compute node → the proxy must be on that compute node.\n",
	    "- If your Jupyter kernel runs on a compute node → the proxy must be on that compute node (not on the login node).\n",
	    "\n",
	    "### Why “two SSH tabs” is normal (inbound vs outbound)\n",
	    "\n",
	    "- **Inbound** traffic: your browser (local) needs to reach Jupyter (remote) → solved with `ssh -L` (Jupyter tunnel).\n",
	    "- **Outbound** traffic: your remote Python process needs to reach the internet (Ensembl) → solved with `ssh -R` +\n",
	    "  `idtrack.ConnectionBridge`.\n",
	    "\n",
	    "These are independent problems, so using two separate SSH sessions/tabs is common. If your cluster allows you to SSH\n",
	    "directly to the machine running the kernel (often via `ssh -J login compute`), you can also combine `-L` and `-R`\n",
	    "in a single SSH command.\n",
	    "\n",
	    "### Scenario 1 — Python runs on the server/login node (no compute node)\n",
	    "\n",
	    "On your **local machine** (keep this SSH session open):\n",
	    "\n",
	    "```bash\n",
	    "ssh -N -R 127.0.0.1:1080 user@server\n",
	    "```\n",
	    "\n",
	    "On the **server** (run IDTrack inside this Python process):\n",
	    "\n",
	    "```python\n",
	    "import idtrack\n",
	    "\n",
	    "with idtrack.ConnectionBridge(proxy_port=1080):\n",
	    "    # ... run IDTrack code ...\n",
	    "    pass\n",
	    "```\n",
	    "\n",
	    "### Scenario 2 — Python runs on a compute node (Slurm), not Jupyter (plain Python script)\n",
	    "\n",
	    "This is the same idea, but the target machine is the **compute node**.\n",
	    "\n",
	    "1) Start your job and learn the compute node hostname (often from job output or `hostname`).\n",
	    "2) On your **local machine**, create the proxy on that compute node (often via the login node as a jump host):\n",
	    "\n",
	    "```bash\n",
	    "ssh -N -J user@login -R 127.0.0.1:1080 user@compute123\n",
	    "```\n",
	    "\n",
	    "3) Run your Python script on the **compute node**, and enable `ConnectionBridge` at the start of the program.\n",
	    "\n",
	    "### Scenario 3 — Jupyter kernel runs on a compute node (browser on your local machine)\n",
	    "\n",
	    "You usually need BOTH:\n",
	    "\n",
	    "- an **inbound** Jupyter tunnel (`ssh -L ...`) so your browser can reach Jupyter, and\n",
	    "- an **outbound** SOCKS tunnel (`ssh -R ...`) so the kernel can reach Ensembl.\n",
	    "\n",
	    "Option A (common): keep your existing Jupyter tunnel workflow, and open a second SSH tab for the `ssh -R ...` tunnel.\n",
	    "\n",
	    "Option B (single SSH session): only if you can SSH to the compute node (often via `-J`). Example (adjust ports/users):\n",
	    "\n",
	    "```bash\n",
	    "ssh -N -J user@login -L 8888:127.0.0.1:8888 -R 127.0.0.1:1080 user@compute123\n",
	    "```\n",
	    "\n",
	    "### What `ConnectionBridge` changes (briefly)\n",
	    "\n",
	    "- It patches `socket.socket` to PySocks’ `socks.socksocket` so most Python networking stacks transparently use the proxy\n",
	    "  (requests/urllib3, urllib, PyMySQL, etc.).\n",
	    "- It optionally sets `ALL_PROXY`/`all_proxy` so child processes inherit the proxy.\n",
	    "- It restores the original state on `stop()` (also best-effort on interpreter exit).\n",
	    "\n",
	    "> **Tip:** Prefer the context-manager form (`with ConnectionBridge(...)`) in scripts so cleanup is guaranteed.\n"
	   ]
	  },
	  {
	   "cell_type": "markdown",
   "id": "56b1a7c6",
	   "metadata": {},
	   "source": [
	    "### Copy/paste helper: ConnectionBridge SSH command\n",
	    "\n",
	    "There is a ready-to-use helper script in this repo:\n",
	    "\n",
	    "- `idtrack/reproducibility/scripts/connection_bridge_tunnel.sh`\n",
	    "\n",
	    "It is meant to run on your **local machine**. It prints only the SSH command (one line). Paste that line into a\n",
	    "terminal tab and keep it open.\n",
	    "\n",
	    "Examples:\n",
	    "\n",
	    "```bash\n",
	    "# Show full beginner-oriented help:\n",
	    "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --help\n",
	    "\n",
	    "# Server/login node (Python runs on the login node):\n",
	    "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh user@server\n",
	    "\n",
	    "# Compute node (Python runs on compute node; go via login as jump host):\n",
	    "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login compute123\n",
	    "\n",
	    "# Compute node host auto-detected from Jupyter URL (kernel on compute node):\n",
	    "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login \"http://compute123:8888/lab?token=...\"\n",
	    "\n",
	    "# Optional: one combined SSH command with both Jupyter (-L) and SOCKS (-R):\n",
	    "bash idtrack/reproducibility/scripts/connection_bridge_tunnel.sh --jump user@login --with-jupyter \"http://compute123:8888/lab?token=...\"\n",
	    "```\n"
	   ]
	  },
	  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7c0d6d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import idtrack\n",
    "\n",
    "# Enable the bridge inside the current Python process (e.g. a Jupyter kernel running on the server).\n",
    "# If `test=True` (default), the bridge pings Ensembl REST and rolls back automatically on failure.\n",
    "b = idtrack.ConnectionBridge(proxy_port=1080)\n",
    "ok = b.start(test=True)\n",
    "print('Bridge enabled:', ok)\n",
    "\n",
    "# ... run IDTrack code here ...\n",
    "# api = idtrack.API(local_repository='./idtrack_cache')\n",
    "# org, latest = api.resolve_organism('human')\n",
    "# api.build_graph(organism_name=org, snapshot_release=latest, calculate_caches=True)\n",
    "\n",
    "# Recommended pattern in scripts:\n",
    "# with idtrack.ConnectionBridge(proxy_port=1080) as _b:\n",
    "#     ... do work ...\n",
    "\n",
    "b.stop()\n",
    "print('Bridge disabled')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}