{ "cells": [ { "cell_type": "markdown", "id": "e93912ed", "metadata": {}, "source": [ "# Part 2 — External Database Configuration\n", "\n", "*Last updated:* 2026-01-08\n", "\n", "This notebook is the step-by-step guide for creating and validating the **external YAML** files that control which **external databases**\n", "(HGNC, MGI, UniProt, RefSeq, …) are included when IDTrack builds the identifier graph.\n", "\n", "By the end, you will:\n", "- have `_externals_modified.yml` files in your local repository\n", "- understand why keeping your external database selection **small and curated** improves mapping quality\n", "- know how to validate that your YAML is compatible with your chosen snapshot boundary\n", "\n", "This notebook covers:\n", "- **2.1 Human** (*Homo sapiens*) — multi-assembly (GRCh38 + GRCh37, and older archives when available)\n", "- **2.2 Mouse** (*Mus musculus*) — clean handoff (one maintained assembly per release: GRCm37 → GRCm38 → GRCm39)\n", "- **2.3 Pig** (*Sus scrofa*) — clean handoff (one maintained assembly per release: Sscrofa9.2 → Sscrofa10.2 → Sscrofa11.1)\n", "- **2.4 Adding a new organism** (advanced; may require a small code configuration step)\n", "\n", "In IDTrack, assemblies are a first-class dimension (not just “primary vs legacy”): the templates keep all assembly entries that Ensembl\n", "exposes for the species. Keeping them is usually the right choice, especially when you integrate datasets annotated with different GTFs or\n", "reference packages.\n", "\n", "> **Tip:** If you're new, start with `00_idtrack_overview.ipynb` (concepts) and `01_installation_guide.ipynb` (setup).\n" ] }, { "cell_type": "markdown", "id": "9f8dcd10", "metadata": {}, "source": [ "## 2.0 — Pre-requisites (what you need before you start)\n", "\n", "- A working Python environment with IDTrack installed (`pip install idtrack`)\n", "- Network access (first-time runs download Ensembl metadata)\n", "- A writable **local repository** folder (IDTrack cache)\n", "\n", "**What you get at the end:**\n", "- `homo_sapiens_externals_modified.yml`\n", "- `mus_musculus_externals_modified.yml`\n", "- `sus_scrofa_externals_modified.yml`\n", "\n", "Each file lives in your local repository and is safe to share with collaborators.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "fad329f4", "metadata": {}, "outputs": [], "source": [ "# Load notebook utilities (collapsible output magic for tutorials)\n", "%load_ext _notebook_utils" ] }, { "cell_type": "code", "execution_count": 2, "id": "0136c539", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Local repository: /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache\n" ] } ], "source": [ "# 1) Setup (run this once)\n", "from __future__ import annotations\n", "\n", "import os\n", "from pathlib import Path\n", "\n", "import yaml\n", "import idtrack\n", "\n", "LOCAL_REPOSITORY = Path(os.environ.get('IDTRACK_LOCAL_REPO', './idtrack_cache')).resolve()\n", "LOCAL_REPOSITORY.mkdir(parents=True, exist_ok=True)\n", "\n", "api = idtrack.API(local_repository=str(LOCAL_REPOSITORY))\n", "api.configure_logger()\n", "\n", "print('Local repository:', LOCAL_REPOSITORY)" ] }, { "cell_type": "markdown", "id": "764a9be6", "metadata": {}, "source": [ "### What is this local repository?\n", "\n", "Think of it as your **IDTrack workspace**. It will contain:\n", "- cached Ensembl tables (downloaded once, reused many times)\n", "- graph snapshot files (`graph__... .pickle`)\n", "- your external YAML files (`*_externals_modified.yml`)\n", "\n", "If you work on multiple projects, you can use one shared local repository (big but convenient),\n", "or separate local repositories (clean separation, easier to archive/share).\n" ] }, { "cell_type": "markdown", "id": "ae470099", "metadata": {}, "source": [ "## 2. External YAML in plain language\n", "\n", "The YAML answers the question:\n", "\n", "> **Which external databases should IDTrack trust and include as edges in the graph?**\n", "\n", "Ensembl knows about *many* external resources. Some are high quality and useful, some are redundant, and some\n", "create a lot of branching (ambiguity).\n", "\n", "A good rule of thumb:\n", "- enable a **small, curated** set of strong databases\n", "- avoid enabling lots of near-duplicates\n", "\n", "### YAML structure (what you will see)\n", "\n", "The generated template is nested like this:\n", "\n", "```yaml\n", ":\n", " gene:\n", " :\n", " Assembly:\n", " :\n", " Ensembl release: \"90,91,92,...\"\n", " Include: false\n", " Database Index: 123\n", " Potential Synonymous: \"\"\n", "```\n", "\n", "**You mainly edit one thing:** `Include: false` → `Include: true`.\n", "\n", "Why is there an `Assembly:` level?\n", "\n", "- Some external databases are only available (or only well-populated) on specific assembly/release combinations.\n", "- IDTrack can map across assemblies, so it needs to know which edges exist in which build context.\n", "\n", "For most curated databases, enabling them for **all assemblies listed in the template** is the right default.\n" ] }, { "cell_type": "markdown", "id": "1297c6b1", "metadata": {}, "source": [ "## 3. Choose your snapshot release (reproducibility knob)\n", "\n", "When you build graphs later, you will pick a **snapshot release** (maximum Ensembl release).\n", "\n", "For beginners, the best choice is usually:\n", "- **the latest release** for the organism\n", "\n", "For reproducible research projects, consider pinning:\n", "- the Ensembl release used by your reference annotation (e.g. the one your pipeline used)\n", "- or the release used by a published dataset you integrate\n" ] }, { "cell_type": "markdown", "id": "c8016c4a", "metadata": {}, "source": [ "## 2.1 — Human (Homo sapiens)\n", "\n", "Human is special in one convenient way: IDTrack ships a **default external YAML** for human.\n", "\n", "An important conceptual point before you start editing:\n", "\n", "- Assemblies are not just “primary vs legacy” in IDTrack.\n", "- Assemblies are part of the mapping problem.\n", "\n", "This matters in real projects (especially atlas building), where you often combine datasets annotated with different genome builds:\n", "- GRCh37-era references / GTFs\n", "- GRCh38-era references / GTFs\n", "\n", "IDTrack’s snapshot graphs are **multi-assembly**. Keeping multiple assembly blocks enabled in the external YAML helps the path-finder:\n", "- interpret input identifiers in the correct assembly context\n", "- map across assemblies when needed (in addition to mapping across releases)\n", "- use external databases that exist only on specific assemblies as additional “bridges”\n", "\n", "For human, the shipped default YAML already includes the common human assembly codes exposed by Ensembl (typically `38` = GRCh38 and\n", "`37` = GRCh37). Depending on release coverage, you may also see older archive assemblies in templates.\n", "\n", "You have two good options:\n", "\n", "1. **Use the shipped default as a starting point** (fast, recommended for most users)\n", "2. **Regenerate a fresh template from live Ensembl metadata** (slower, but useful if you want to refresh external database lists)\n", "\n", "Recommended external databases for human (high signal, widely used):\n", "- **HGNC Symbol** (human-readable gene symbols)\n", "- **EntrezGene**\n", "- **UniProtKB**\n", "- **RefSeq_mRNA** (optionally also RefSeq proteins if your workflow needs them)\n", "\n", "> **Warning:** Assembly codes are species-specific. `38` means GRCh38 for human, but `38` means GRCm38 for mouse.\n", "\n", "> **Tip:** Keep your allowlist small at first. You can expand later once you understand how ambiguity shows up in your results.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "f73a89e5", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-01-11 00:11:47 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n" ] }, { "data": { "text/plain": [ "('homo_sapiens', 115)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 4.1 Resolve organism name + latest release\n", "organism_hs, latest_release_hs = api.resolve_organism('human')\n", "SNAPSHOT_RELEASE_HS = latest_release_hs # change if you want to pin\n", "organism_hs, SNAPSHOT_RELEASE_HS\n" ] }, { "cell_type": "markdown", "id": "dec8a72f", "metadata": {}, "source": [ "### 2.1.1 Create a DatabaseManager snapshot\n", "\n", "This object is responsible for talking to Ensembl (downloads + caching) **bounded by your snapshot release**.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "e1919603", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-01-11 00:12:05 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 38: releases 76-115 (from config [76, None])\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm_hs = api.get_database_manager(organism_name=organism_hs, snapshot_release=SNAPSHOT_RELEASE_HS)\n", "dm_hs\n" ] }, { "cell_type": "markdown", "id": "d6cfc180", "metadata": {}, "source": [ "### 2.1.2 Generate a template YAML (optional for human, required for mouse/pig)\n", "\n", "This step can take time, because IDTrack enumerates database metadata across releases.\n", "You typically do it once per organism (and then keep the YAML).\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "45c5ce0a", "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "{'template_yaml': '/Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_template.yml',\n", " 'modified_yaml': '/Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
\n", "Click to show download logs\n", "
2026-01-11 00:12:13 INFO:database_manager: Processing assembly 38 for homo_sapiens...\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `76`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `77`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `78`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `79`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `80`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `81`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `82`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `83`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `84`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `85`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `86`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `87`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `88`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `89`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `90`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `91`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `92`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `93`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `94`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `95`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `96`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `97`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `98`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `99`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `100`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `101`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `102`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `103`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `104`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `105`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `106`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `107`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `108`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `109`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `110`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `111`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `112`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `113`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `114`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `gene`, ensembl release `115`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `76`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `77`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `78`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `79`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `80`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `81`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `82`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `83`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `84`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `85`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `86`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `87`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `88`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `89`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `90`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `91`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `92`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `93`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `94`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `95`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `96`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `97`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `98`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `99`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `100`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `101`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `102`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `103`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `104`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `105`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `106`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `107`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `108`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `109`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `110`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `111`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `112`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `113`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `114`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `transcript`, ensembl release `115`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `76`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `77`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `78`\n",
       "2026-01-11 00:12:13 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `79`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `80`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `81`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `82`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `83`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `84`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `85`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `86`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `87`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `88`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `89`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `90`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `91`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `92`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `93`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `94`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `95`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `96`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `97`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `98`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `99`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `100`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `101`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `102`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `103`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `104`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `105`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `106`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `107`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `108`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `109`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `110`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `111`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `112`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `113`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `114`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `38`, form `translation`, ensembl release `115`\n",
       "2026-01-11 00:12:14 INFO:database_manager: Processing assembly 37 for homo_sapiens...\n",
       "2026-01-11 00:12:15 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 37: releases 55-115 (from config [55, None])\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `55`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `56`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `57`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `58`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `59`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `60`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `61`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `62`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `63`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `64`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `65`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `66`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `67`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `68`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `69`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `70`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `71`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `72`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `73`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `74`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `75`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `76`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `77`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `78`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `79`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `80`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `81`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `82`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `83`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `84`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `85`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `86`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `87`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `88`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `89`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `90`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `91`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `92`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `93`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `94`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `95`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `96`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `97`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `98`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `99`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `100`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `101`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `102`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `103`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `104`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `105`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `106`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `107`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `108`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `109`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `110`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `111`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `112`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `113`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `114`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `gene`, ensembl release `115`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `55`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `56`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `57`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `58`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `59`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `60`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `61`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `62`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `63`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `64`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `65`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `66`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `67`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `68`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `69`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `70`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `71`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `72`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `73`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `74`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `75`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `76`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `77`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `78`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `79`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `80`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `81`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `82`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `83`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `84`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `85`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `86`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `87`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `88`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `89`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `90`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `91`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `92`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `93`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `94`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `95`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `96`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `97`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `98`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `99`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `100`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `101`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `102`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `103`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `104`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `105`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `106`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `107`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `108`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `109`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `110`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `111`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `112`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `113`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `114`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `transcript`, ensembl release `115`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `55`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `56`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `57`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `58`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `59`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `60`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `61`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `62`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `63`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `64`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `65`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `66`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `67`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `68`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `69`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `70`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `71`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `72`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `73`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `74`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `75`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `76`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `77`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `78`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `79`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `80`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `81`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `82`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `83`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `84`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `85`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `86`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `87`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `88`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `89`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `90`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `91`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `92`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `93`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `94`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `95`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `96`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `97`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `98`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `99`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `100`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `101`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `102`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `103`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `104`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `105`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `106`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `107`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `108`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `109`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `110`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `111`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `112`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `113`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `114`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `37`, form `translation`, ensembl release `115`\n",
       "2026-01-11 00:12:31 INFO:database_manager: Processing assembly 36 for homo_sapiens...\n",
       "2026-01-11 00:12:32 INFO:database_manager: Using assembly-specific release range for homo_sapiens assembly 36: releases 48-54 (from config [48, 54])\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `48`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `49`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `50`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `51`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `52`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `53`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `gene`, ensembl release `54`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `48`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `49`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `50`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `51`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `52`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `53`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `transcript`, ensembl release `54`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `48`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `49`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `50`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `51`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `52`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `53`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Database content is being created for `homo_sapiens`, assembly `36`, form `translation`, ensembl release `54`\n",
       "2026-01-11 00:12:38 INFO:database_manager: Assembly processing complete for homo_sapiens: Successfully processed 3/3 assemblies. Processed: [36, 37, 38], Failed: none\n",
       "2026-01-11 00:12:38 WARNING:external_databases: File created on /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_template.yml\n",
       "Please edit the file based on requested external databases and add '_modified' to the file name. See package documentation for further detail.
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "{'template_yaml': '/Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_template.yml',\n", " 'modified_yaml': '/Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml'}" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%collapse Click to show download logs\n", "# Included for tutorial purposes only.\n", "\n", "# NOTE: This can take a while on first run (network + caching).\n", "df_hs = dm_hs.create_database_content(just_download=False)\n", "dm_hs.external_inst.create_template_yaml(df_hs)\n", "\n", "template_hs = Path(dm_hs.external_inst.file_name_template_yaml())\n", "modified_hs = Path(dm_hs.external_inst.file_name_modified_yaml(mode='configured'))\n", "{\n", " 'template_yaml': str(template_hs),\n", " 'modified_yaml': str(modified_hs),\n", "}" ] }, { "cell_type": "markdown", "id": "24fb9c9d", "metadata": {}, "source": [ "### 2.1.3 Edit the YAML (two options)\n", "\n", "#### Option A — edit by hand (recommended at least once)\n", "1. Open the template file shown above (ends with `_externals_template.yml`).\n", "2. Search for a database you care about (e.g. `HGNC Symbol`).\n", "3. Change `Include: false` to `Include: true` for **every assembly block listed** (recommended).\n", " - This keeps the database available as a bridge regardless of which genome build your inputs were annotated against.\n", " - Only restrict to a subset of assemblies if you intentionally want a build-specific configuration.\n", "4. Save as `_externals_modified.yml` in the same folder (IDTrack will look for this first).\n", "\n", "#### Option B — programmatic toggles (great for reproducibility)\n", "The next cell applies the **shipped default configuration** used by IDTrack (`homo_sapiens_externals_modified.yml`).\n", "\n", "This default enables **49 form/database/assembly combinations** across three forms:\n", "- **gene** (28 combinations): HGNC Symbol, EntrezGene, NCBI gene, UniProtKB Gene Name, Havana gene, Vega gene, RFAM, Clone-based identifiers, and their synonym variants\n", "- **transcript** (12 combinations): CCDS, Havana transcript, RefSeq mRNA/ncRNA (curated + predicted)\n", "- **translation** (9 combinations): Havana translation, RefSeq peptide (curated + predicted), UniProt (Swiss-Prot + TrEMBL)\n", "\n", "You can modify `DEFAULT_HS_EXTERNALS` in the cell below to customize your allowlist." ] }, { "cell_type": "code", "execution_count": 6, "id": "ac4fb89c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enabled 63 form/database/assembly combinations (the shipped default).\n", "Wrote: /Users/kemalinecik/git_nosync/master_idtrack/idtrack/docs/_notebooks/idtrack_cache/homo_sapiens_externals_modified.yml\n" ] } ], "source": [ "# Default external databases for HUMAN (matches the shipped default config)\n", "# This dict is structured as: form -> database -> list of assemblies where Include=True\n", "# These 49 combinations are the default used by IDTrack for homo_sapiens.\n", "\n", "DEFAULT_HS_EXTERNALS = {\n", " 'gene': {\n", " 'Clone_based_ensembl_gene': [36, 37, 38],\n", " 'Clone_based_vega_gene': [36, 37, 38],\n", " 'EntrezGene': [36, 37, 38],\n", " 'HGNC Symbol': [36, 37, 38],\n", " 'Havana gene': [36, 37, 38],\n", " 'NCBI gene': [36, 37, 38],\n", " 'NCBI gene (formerly Entrezgene)': [36, 37, 38],\n", " 'RFAM': [36, 37, 38],\n", " 'UniProtKB Gene Name': [36, 37, 38],\n", " 'Vega gene': [36, 37, 38],\n", " 'Vega_gene': [36, 37, 38],\n", " 'synonym_id::EntrezGene': [36, 37, 38],\n", " 'synonym_id::HGNC Symbol': [36, 37, 38],\n", " 'synonym_id::NCBI gene': [36, 37, 38],\n", " 'synonym_id::NCBI gene (formerly Entrezgene)': [36, 37, 38],\n", " 'synonym_id::UniProtKB Gene Name': [36, 37, 38],\n", " },\n", " 'transcript': {\n", " 'CCDS': [36, 37, 38],\n", " 'Havana transcript': [36, 37, 38],\n", " 'RefSeq_mRNA': [36, 37, 38],\n", " 'RefSeq_mRNA_predicted': [36, 37, 38],\n", " 'RefSeq_ncRNA': [36, 37, 38],\n", " 'RefSeq_ncRNA_predicted': [36, 37, 38],\n", " },\n", " 'translation': {\n", " 'Havana translation': [36, 37, 38],\n", " 'RefSeq_peptide': [36, 37, 38],\n", " 'RefSeq_peptide_predicted': [36, 37, 38],\n", " 'Uniprot/SPTREMBL': [36, 37, 38],\n", " 'Uniprot/SWISSPROT': [36, 37, 38],\n", " },\n", "}\n", "\n", "# Load the template YAML\n", "y = yaml.safe_load(template_hs.read_text(encoding='utf-8'))\n", "\n", "# Apply the default configuration: set Include=True for matching form/database/assembly\n", "enabled_count = 0\n", "for form, databases in DEFAULT_HS_EXTERNALS.items():\n", " if form not in y[organism_hs]:\n", " continue\n", " for db_name, assemblies in databases.items():\n", " if db_name not in y[organism_hs][form]:\n", " print(f' [SKIP] {form}/{db_name} not found in template')\n", " continue\n", " for asm_code in assemblies:\n", " asm_str = str(asm_code)\n", " if asm_str in y[organism_hs][form][db_name]['Assembly']:\n", " y[organism_hs][form][db_name]['Assembly'][asm_str]['Include'] = True\n", " enabled_count += 1\n", "\n", "modified_hs.write_text(yaml.safe_dump(y, sort_keys=False, allow_unicode=True), encoding='utf-8')\n", "\n", "print(f'Enabled {enabled_count} form/database/assembly combinations (the shipped default).')\n", "print('Wrote:', modified_hs)" ] }, { "cell_type": "markdown", "id": "4618fbc2", "metadata": {}, "source": [ "### 2.1.4 Validate + preview your selections\n", "\n", "This load step is important: it confirms your YAML contains your snapshot release and that IDTrack can parse it.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "0db45f45", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enabled DBs: ['EntrezGene', 'HGNC Symbol', 'UniProtKB Gene Name', 'synonym_id::NCBI gene (formerly Entrezgene)', 'NCBI gene (formerly Entrezgene)', 'synonym_id::HGNC Symbol', 'RFAM']\n", "Assemblies in play: [37, 38]\n", "Enabled DBs: ['Clone_based_vega_gene', 'HGNC Symbol', 'Havana gene', 'synonym_id::HGNC Symbol', 'Clone_based_ensembl_gene']\n", "Assemblies in play: [36]\n" ] } ], "source": [ "_ = dm_hs.external_inst.load_modified_yaml()\n", "print('Enabled DBs:', dm_hs.external_inst.give_list_for_case('db'))\n", "print('Assemblies in play:', dm_hs.external_inst.give_list_for_case('assembly'))\n", "\n", "# Your dm_hs was created with snapshot_release=115, but GRCh36 data in Ensembl only goes up to \n", "# around release 54-55 (when it was superseded by GRCh37). \n", "_dm_hs = dm_hs.change_release_auto_assembly(50)\n", "_ = _dm_hs.external_inst.load_modified_yaml()\n", "print('Enabled DBs:', _dm_hs.external_inst.give_list_for_case('db'))\n", "print('Assemblies in play:', _dm_hs.external_inst.give_list_for_case('assembly'))" ] }, { "cell_type": "markdown", "id": "66639317", "metadata": {}, "source": [ "## 2.2 — Mouse (Mus musculus)\n", "\n", "Mouse does not ship with a default external YAML in the package, so the usual workflow is:\n", "\n", "1. generate a template YAML (from Ensembl metadata)\n", "2. enable a curated allowlist of databases\n", "3. validate the YAML\n", "\n", "\n", "Mouse templates list multiple assemblies across history, but Ensembl is a clean-handoff species (one maintained assembly per release). You will usually see assembly codes like:\n", "- `39` = GRCm39\n", "- `38` = GRCm38\n", "- `37` = GRCm37\n", "\n", "> **Tip:** For clean-handoff species, enabling the same database across the listed assemblies is usually fine;\n", "> the per-assembly release ranges are disjoint, so you typically do not get overlapping assemblies within a single release.\n", "\n", "Mouse-specific, commonly useful databases:\n", "- **MGI Symbol** (mouse gene symbols)\n", "- **EntrezGene**\n", "- **UniProtKB**\n", "- **RefSeq_mRNA**\n", "\n", "> **Expected output:** You will create `mus_musculus_externals_modified.yml` in your local repository.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "b35d7f2a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-01-11 00:12:39 INFO:verify_organism: Ensembl Rest API query to get the organism names and associated releases.\n" ] }, { "data": { "text/plain": [ "('mus_musculus', 115)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 5.1 Resolve organism name + latest release\n", "organism_mm, latest_release_mm = api.resolve_organism('mus musculus')\n", "SNAPSHOT_RELEASE_MM = latest_release_mm\n", "organism_mm, SNAPSHOT_RELEASE_MM\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "2bff46e2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-01-11 00:12:40 INFO:database_manager: Using assembly-specific release range for mus_musculus assembly 39: releases 103-115 (from config [103, None])\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 5.2 Create DatabaseManager snapshot\n", "dm_mm = api.get_database_manager(organism_name=organism_mm, snapshot_release=SNAPSHOT_RELEASE_MM)\n", "dm_mm\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6098e5e2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "Click to show download logs\n", "
2026-01-11 00:12:43 INFO:database_manager: Processing assembly 39 for mus_musculus...\n",
       "2026-01-11 00:12:43 INFO:database_manager: Database content is being created for `mus_musculus`, assembly `39`, form `gene`, ensembl release `103`\n",
       "2026-01-11 00:12:44 INFO:database_manager: Raw table for `gene` on ensembl release `103` was downloaded for following columns: gene_id, stable_id, version.\n",
       "2026-01-11 00:12:44 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_mysql_gene_COL_gene_id_COL_stable_id_COL_version`\n",
       "2026-01-11 00:12:45 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_processed_idsraw_gene_gene`\n",
       "2026-01-11 00:13:01 INFO:database_manager: Raw table for `object_xref` on ensembl release `103` was downloaded for following columns: ensembl_id, ensembl_object_type, xref_id, object_xref_id.\n",
       "2026-01-11 00:13:01 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_mysql_object_xref_COL_ensembl_id_COL_ensembl_object_type_COL_object_xref_id_COL_xref_id`\n",
       "2026-01-11 00:13:51 INFO:database_manager: Raw table for `xref` on ensembl release `103` was downloaded for following columns: xref_id, external_db_id, dbprimary_acc, display_label.\n",
       "2026-01-11 00:13:51 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_mysql_xref_COL_dbprimary_acc_COL_display_label_COL_external_db_id_COL_xref_id`\n",
       "2026-01-11 00:14:10 INFO:database_manager: Raw table for `external_db` on ensembl release `103` was downloaded for following columns: external_db_id, db_name, db_display_name.\n",
       "2026-01-11 00:14:10 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_mysql_external_db_COL_db_display_name_COL_db_name_COL_external_db_id`\n",
       "2026-01-11 00:14:12 INFO:database_manager: Raw table for `identity_xref` on ensembl release `103` was downloaded for following columns: ensembl_identity, xref_identity, object_xref_id.\n",
       "2026-01-11 00:14:12 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_mysql_identity_xref_COL_ensembl_identity_COL_object_xref_id_COL_xref_identity`\n",
       "2026-01-11 00:14:14 INFO:database_manager: Raw table for `external_synonym` on ensembl release `103` was downloaded for following columns: xref_id, synonym.\n",
       "2026-01-11 00:14:14 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_mysql_external_synonym_COL_synonym_COL_xref_id`\n",
       "2026-01-11 00:14:17 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens103_processed_external_database_gene`\n",
       "2026-01-11 00:14:17 INFO:database_manager: Database content is being created for `mus_musculus`, assembly `39`, form `gene`, ensembl release `104`\n",
       "2026-01-11 00:14:18 INFO:database_manager: Raw table for `gene` on ensembl release `104` was downloaded for following columns: gene_id, stable_id, version.\n",
       "2026-01-11 00:14:18 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_mysql_gene_COL_gene_id_COL_stable_id_COL_version`\n",
       "2026-01-11 00:14:19 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_processed_idsraw_gene_gene`\n",
       "2026-01-11 00:14:48 INFO:database_manager: Raw table for `object_xref` on ensembl release `104` was downloaded for following columns: ensembl_id, ensembl_object_type, xref_id, object_xref_id.\n",
       "2026-01-11 00:14:48 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_mysql_object_xref_COL_ensembl_id_COL_ensembl_object_type_COL_object_xref_id_COL_xref_id`\n",
       "2026-01-11 00:16:10 INFO:database_manager: Raw table for `xref` on ensembl release `104` was downloaded for following columns: xref_id, external_db_id, dbprimary_acc, display_label.\n",
       "2026-01-11 00:16:10 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_mysql_xref_COL_dbprimary_acc_COL_display_label_COL_external_db_id_COL_xref_id`\n",
       "2026-01-11 00:16:34 INFO:database_manager: Raw table for `external_db` on ensembl release `104` was downloaded for following columns: external_db_id, db_name, db_display_name.\n",
       "2026-01-11 00:16:35 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_mysql_external_db_COL_db_display_name_COL_db_name_COL_external_db_id`\n",
       "2026-01-11 00:16:36 INFO:database_manager: Raw table for `identity_xref` on ensembl release `104` was downloaded for following columns: ensembl_identity, xref_identity, object_xref_id.\n",
       "2026-01-11 00:16:36 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_mysql_identity_xref_COL_ensembl_identity_COL_object_xref_id_COL_xref_identity`\n",
       "2026-01-11 00:16:38 INFO:database_manager: Raw table for `external_synonym` on ensembl release `104` was downloaded for following columns: xref_id, synonym.\n",
       "2026-01-11 00:16:38 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_mysql_external_synonym_COL_synonym_COL_xref_id`\n",
       "2026-01-11 00:16:42 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens104_processed_external_database_gene`\n",
       "2026-01-11 00:16:42 INFO:database_manager: Database content is being created for `mus_musculus`, assembly `39`, form `gene`, ensembl release `105`\n",
       "2026-01-11 00:16:43 INFO:database_manager: Raw table for `gene` on ensembl release `105` was downloaded for following columns: gene_id, stable_id, version.\n",
       "2026-01-11 00:16:43 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_mysql_gene_COL_gene_id_COL_stable_id_COL_version`\n",
       "2026-01-11 00:16:44 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_processed_idsraw_gene_gene`\n",
       "2026-01-11 00:17:08 INFO:database_manager: Raw table for `object_xref` on ensembl release `105` was downloaded for following columns: ensembl_id, ensembl_object_type, xref_id, object_xref_id.\n",
       "2026-01-11 00:17:08 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_mysql_object_xref_COL_ensembl_id_COL_ensembl_object_type_COL_object_xref_id_COL_xref_id`\n",
       "2026-01-11 00:18:16 INFO:database_manager: Raw table for `xref` on ensembl release `105` was downloaded for following columns: xref_id, external_db_id, dbprimary_acc, display_label.\n",
       "2026-01-11 00:18:16 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_mysql_xref_COL_dbprimary_acc_COL_display_label_COL_external_db_id_COL_xref_id`\n",
       "2026-01-11 00:18:43 INFO:database_manager: Raw table for `external_db` on ensembl release `105` was downloaded for following columns: external_db_id, db_name, db_display_name.\n",
       "2026-01-11 00:18:43 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_mysql_external_db_COL_db_display_name_COL_db_name_COL_external_db_id`\n",
       "2026-01-11 00:18:46 INFO:database_manager: Raw table for `identity_xref` on ensembl release `105` was downloaded for following columns: ensembl_identity, xref_identity, object_xref_id.\n",
       "2026-01-11 00:18:46 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_mysql_identity_xref_COL_ensembl_identity_COL_object_xref_id_COL_xref_identity`\n",
       "2026-01-11 00:18:49 INFO:database_manager: Raw table for `external_synonym` on ensembl release `105` was downloaded for following columns: xref_id, synonym.\n",
       "2026-01-11 00:18:49 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_mysql_external_synonym_COL_synonym_COL_xref_id`\n",
       "2026-01-11 00:18:53 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens105_processed_external_database_gene`\n",
       "2026-01-11 00:18:53 INFO:database_manager: Database content is being created for `mus_musculus`, assembly `39`, form `gene`, ensembl release `106`\n",
       "2026-01-11 00:18:56 INFO:database_manager: Raw table for `gene` on ensembl release `106` was downloaded for following columns: gene_id, stable_id, version.\n",
       "2026-01-11 00:18:56 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens106_mysql_gene_COL_gene_id_COL_stable_id_COL_version`\n",
       "2026-01-11 00:18:57 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens106_processed_idsraw_gene_gene`\n",
       "2026-01-11 00:19:23 INFO:database_manager: Raw table for `object_xref` on ensembl release `106` was downloaded for following columns: ensembl_id, ensembl_object_type, xref_id, object_xref_id.\n",
       "2026-01-11 00:19:23 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens106_mysql_object_xref_COL_ensembl_id_COL_ensembl_object_type_COL_object_xref_id_COL_xref_id`\n",
       "2026-01-11 00:20:28 INFO:database_manager: Raw table for `xref` on ensembl release `106` was downloaded for following columns: xref_id, external_db_id, dbprimary_acc, display_label.\n",
       "2026-01-11 00:20:28 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens106_mysql_xref_COL_dbprimary_acc_COL_display_label_COL_external_db_id_COL_xref_id`\n",
       "2026-01-11 00:20:54 INFO:database_manager: Raw table for `external_db` on ensembl release `106` was downloaded for following columns: external_db_id, db_name, db_display_name.\n",
       "2026-01-11 00:20:54 INFO:database_manager: Exporting to the following file `mus_musculus_assembly-39.h5` with key `ens106_mysql_external_db_COL_db_display_name_COL_db_name_COL_external_db_id`\n",
       "2026-01-11 00:20:59 INFO:database_manager: Raw table for `identity_xref` on ensembl release `106` was downloaded for following columns: ensembl_identity, xref_identity, object_xref_id.
\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%collapse Click to show download logs\n", "# Included for tutorial purposes only.\n", "\n", "# 5.3 Generate template YAML\n", "df_mm = dm_mm.create_database_content(just_download=False)\n", "dm_mm.external_inst.create_template_yaml(df_mm)\n", "\n", "template_mm = Path(dm_mm.external_inst.file_name_template_yaml())\n", "modified_mm = Path(dm_mm.external_inst.file_name_modified_yaml(mode='configured'))\n", "{\n", " 'template_yaml': str(template_mm),\n", " 'modified_yaml': str(modified_mm),\n", "}" ] }, { "cell_type": "markdown", "id": "cdd9713d", "metadata": {}, "source": [ "### 2.2.1 Programmatic allowlist (mouse)\n", "\n", "Mouse gene symbols come from **MGI** (Mouse Genome Informatics), so a typical set includes `MGI Symbol`.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9dfbcdb9", "metadata": {}, "outputs": [], "source": [ "allowlist_mm = [\n", " 'MGI Symbol',\n", " 'EntrezGene',\n", " 'UniProtKB',\n", " 'RefSeq_mRNA',\n", "]\n", "\n", "y = yaml.safe_load(template_mm.read_text(encoding='utf-8'))\n", "the_form = list(y[organism_mm].keys())[0]\n", "\n", "enabled = []\n", "for db_name in allowlist_mm:\n", " if db_name not in y[organism_mm][the_form]:\n", " continue\n", " for _asm, attrs in y[organism_mm][the_form][db_name]['Assembly'].items():\n", " attrs['Include'] = True\n", " enabled.append(db_name)\n", "\n", "modified_mm.write_text(yaml.safe_dump(y, sort_keys=False, allow_unicode=True), encoding='utf-8')\n", "print('Enabled (found in template):', enabled)\n", "print('Wrote:', modified_mm)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "87275860", "metadata": {}, "outputs": [], "source": [ "_ = dm_mm.external_inst.load_modified_yaml()\n", "print('Enabled DBs:', dm_mm.external_inst.give_list_for_case('db'))\n", "print('Assemblies in play:', dm_mm.external_inst.give_list_for_case('assembly'))\n" ] }, { "cell_type": "markdown", "id": "69056a20", "metadata": {}, "source": [ "## 2.3 — Pig (Sus scrofa)\n", "\n", "Pig templates list multiple assemblies across history, but Ensembl is a clean-handoff species (one maintained assembly per release). Common Ensembl assembly codes you will see include:\n", "- `111` = Sscrofa11.1\n", "- `102` = Sscrofa10.2\n", "- `9` = Sscrofa9.2\n", "\n", "> **Tip:** For clean-handoff species, enabling the same database across the listed assemblies is usually fine;\n", "> older assemblies mainly matter for legacy datasets and archive releases.\n", "\n", "\n", "\n", "Pig also requires generating a template YAML and then enabling a curated allowlist.\n", "\n", "Pig datasets often mix different naming conventions, so enabling a small set of high-signal externals is especially important.\n", "\n", "Commonly useful databases for pig:\n", "- **EntrezGene**\n", "- **UniProtKB**\n", "- **RefSeq_mRNA**\n", "\n", "> **Expected output:** You will create `sus_scrofa_externals_modified.yml` in your local repository.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6e93e334", "metadata": {}, "outputs": [], "source": [ "# 6.1 Resolve organism name + latest release\n", "organism_ss, latest_release_ss = api.resolve_organism('sus scrofa')\n", "SNAPSHOT_RELEASE_SS = latest_release_ss\n", "organism_ss, SNAPSHOT_RELEASE_SS\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bc0401c3", "metadata": {}, "outputs": [], "source": [ "# 6.2 Create DatabaseManager snapshot\n", "dm_ss = api.get_database_manager(organism_name=organism_ss, snapshot_release=SNAPSHOT_RELEASE_SS)\n", "dm_ss\n" ] }, { "cell_type": "code", "execution_count": null, "id": "cc1ec799", "metadata": {}, "outputs": [], "source": [ "%%collapse Click to show download logs\n", "# 6.3 Generate template YAML\n", "df_ss = dm_ss.create_database_content(just_download=False)\n", "dm_ss.external_inst.create_template_yaml(df_ss)\n", "\n", "template_ss = Path(dm_ss.external_inst.file_name_template_yaml())\n", "modified_ss = Path(dm_ss.external_inst.file_name_modified_yaml(mode='configured'))\n", "{\n", " 'template_yaml': str(template_ss),\n", " 'modified_yaml': str(modified_ss),\n", "}" ] }, { "cell_type": "markdown", "id": "07a93389", "metadata": {}, "source": [ "### 2.3.1 Programmatic allowlist (pig)\n", "\n", "Pig symbol databases can vary across releases. A safe starter set often includes `EntrezGene` and `UniProtKB`.\n", "Use the template to inspect what is available for your snapshot release.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8fd76a37", "metadata": {}, "outputs": [], "source": [ "allowlist_ss = [\n", " 'EntrezGene',\n", " 'UniProtKB',\n", " 'RefSeq_mRNA',\n", "]\n", "\n", "y = yaml.safe_load(template_ss.read_text(encoding='utf-8'))\n", "the_form = list(y[organism_ss].keys())[0]\n", "\n", "enabled = []\n", "for db_name in allowlist_ss:\n", " if db_name not in y[organism_ss][the_form]:\n", " continue\n", " for _asm, attrs in y[organism_ss][the_form][db_name]['Assembly'].items():\n", " attrs['Include'] = True\n", " enabled.append(db_name)\n", "\n", "modified_ss.write_text(yaml.safe_dump(y, sort_keys=False, allow_unicode=True), encoding='utf-8')\n", "print('Enabled (found in template):', enabled)\n", "print('Wrote:', modified_ss)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0ac3408c", "metadata": {}, "outputs": [], "source": [ "_ = dm_ss.external_inst.load_modified_yaml()\n", "print('Enabled DBs:', dm_ss.external_inst.give_list_for_case('db'))\n", "print('Assemblies in play:', dm_ss.external_inst.give_list_for_case('assembly'))\n" ] }, { "cell_type": "markdown", "id": "f5edec7c", "metadata": {}, "source": [ "## 2.4 — Adding a New Organism (Advanced)\n", "\n", "IDTrack currently ships with built-in support for a small set of organisms (human/mouse/pig). If you want to use a different\n", "Ensembl-supported species, there are **two layers** to set up:\n", "\n", "1. **Core configuration (required):** IDTrack must know the canonical Ensembl species name and which numeric **assembly codes**\n", " appear in schema names like `_core__` (configure `idtrack/_db.py`).\n", " - Direct MySQL connectivity is optional: when ports are blocked, IDTrack downloads the same tables from the HTTPS/FTP MySQL dumps.\n", "2. **External YAML (required for externals):** once the organism/assemblies are configured, you generate a template YAML and curate an\n", " allowlist exactly like mouse/pig above.\n", "\n", "Practical recipe:\n", "\n", "- Step A: Resolve the canonical Ensembl species name (snake_case) with `api.resolve_organism(...)`.\n", "- Step B: Configure the organism and its relevant assemblies in `idtrack/_db.py` by extending `DB.assembly_mysqlport_priority`.\n", " - Listing multiple assemblies is what enables cross-assembly mapping and assembly-scoped external databases.\n", "- Step C: Re-run the YAML-template generation workflow (`DatabaseManager.create_database_content(...)` → `create_template_yaml(...)`).\n", "- Step D: Build a graph snapshot (Part 3) and run the sanity checks.\n", "\n", "> **Warning:** Until Step B is done, `DatabaseManager` will raise `NotImplementedError` for that organism.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9b3a982c", "metadata": {}, "outputs": [], "source": [ "# Minimal helper cell: check if an organism is supported by *this* IDTrack version.\n", "# (Safe to run; it will not modify your installation.)\n", "\n", "from idtrack._db import DB\n", "\n", "print('Supported organisms in this IDTrack version:')\n", "print(' ' + ', '.join(DB.supported_organisms))\n", "\n", "# Example: resolve an organism name via Ensembl REST (works even if the organism isn't configured locally)\n", "organism_query = 'danio rerio' # zebrafish (example)\n", "formal_name, latest_release = api.resolve_organism(organism_query)\n", "print('Resolved via Ensembl REST ->', organism_query, '→', formal_name, '(latest release:', latest_release, ')')\n", "\n", "if formal_name not in DB.supported_organisms:\n", " print()\n", " print('This organism is not yet configured in this IDTrack version.')\n", " print('To add it: extend DB.assembly_mysqlport_priority in idtrack/_db.py with its assembly code + port list,')\n", " print('then regenerate the external YAML using the same workflow as mouse/pig above.')\n", "else:\n", " print()\n", " print('This organism is already configured. You can now generate a template YAML and continue.')\n" ] }, { "cell_type": "markdown", "id": "70a1fc84", "metadata": {}, "source": [ "## 2.5 — Final checklist (before you build graphs)\n", "\n", "You should now have these three files in your local repository:\n", "- `homo_sapiens_externals_modified.yml`\n", "- `mus_musculus_externals_modified.yml`\n", "- `sus_scrofa_externals_modified.yml`\n", "\n", "The next notebook (`03_initialization_graph.ipynb`) will build graphs using these configs.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3572adf9", "metadata": {}, "outputs": [], "source": [ "# Quick existence check\n", "expected = [\n", " LOCAL_REPOSITORY / 'homo_sapiens_externals_modified.yml',\n", " LOCAL_REPOSITORY / 'mus_musculus_externals_modified.yml',\n", " LOCAL_REPOSITORY / 'sus_scrofa_externals_modified.yml',\n", "]\n", "for p in expected:\n", " print('OK' if p.exists() else 'MISSING', '-', p)\n" ] }, { "cell_type": "markdown", "id": "ec4018c3", "metadata": {}, "source": [ "## 2.6 — Best practices (once you get comfortable)\n", "\n", "1. **Keep allowlists small**: enabling many overlapping databases often increases ambiguity.\n", "2. **Multiple profiles**: you can maintain different YAMLs per project (e.g. ‘strict’ vs ‘broad’).\n", "3. **Assembly awareness**:\n", " - In Ensembl schema names, the last number encodes the genome assembly (e.g. human 38 = GRCh38; mouse 39 = GRCm39; pig 111 = Sscrofa11.1).\n", " - Assembly codes are species-specific, and external databases can be assembly-scoped.\n", " - Keeping all assembly blocks in your YAML is usually beneficial when you integrate mixed-build datasets.\n", "4. **Re-running**: you can regenerate templates after a new Ensembl release and re-apply your allowlist.\n" ] } ], "metadata": { "kernelspec": { "display_name": "idtrack_dev_env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.12" } }, "nbformat": 4, "nbformat_minor": 5 }