About CABase

Interactive scientific database for carbonic anhydrase pathway research. Parkkila Group · Tampere University

Overview

CABase is a panel-based interactive platform for exploring the human carbonic anhydrase (CA) gene family across genomic, transcriptomic, epigenomic, proteomic, and clinical dimensions. It integrates data from GTEx, Human Protein Atlas, TCGA, CELLxGENE, Tahoe100m, ENCODE, ChIP-Atlas, LINCS L1000, and more into a unified sidebar + tabbed interface with interactive Plotly.js charts, an IGV.js genome browser, D3 force-directed co-expression networks, and an AI-powered research assistant grounded in 11K+ carbonic anhydrase publications.

The platform covers 15 human carbonic anhydrase genes organized by subcellular localization: cytosolic (CA1, CA2, CA3, CA7, CA13), mitochondrial (CA5A, CA5B), membrane-associated (CA4 GPI-anchored, CA9/CA12/CA14 transmembrane), secreted (CA6), and CA-related proteins lacking catalytic activity (CA8, CA10, CA11). Carbonic anhydrases catalyze the reversible hydration of CO2 to bicarbonate and are critical for pH regulation, electrolyte secretion, and biosynthetic reactions. Key disease associations include tumor hypoxia (CA9, CA12), renal tubular acidosis (CA2), glaucoma (CA2, CA4), and they are important drug targets for sulfonamide inhibitors.

Data Sections

Gene Summary tab: Gene Summary

Gene Identity Card (full name, genomic coordinates, aliases, NCBI summary, cross-reference IDs: Entrez, HGNC, Ensembl, UniProt, RefSeq, Pfam, PDB) plus AI-generated summaries across 5 dimensions: Genomic, Expression, Pathway, Functional, Clinical.

Good for: Quick overview of any gene's role, function, disease associations, expression patterns, and database identifiers.

Genomic Context tab: Genomic

IGV.js genome browser (hg38) with an expandable track catalog:

Good for: Gene structure, exon/intron layout, clinical variants, regulatory landscape, TF binding, epigenomics, conservation.

RNA-Seq tab: RNA-Seq

Plotly.js charts across 6 data sources, with a Normal Tissue / Cancer & Disease toggle:

Good for: Tissue expression patterns, cancer expression, cell line data, survival analysis, immunotherapy survival, cross-database comparison, multi-gene analysis.

Proteomics tab: Proteomics

Protein-level data across 5 sources with interactive 3D structure viewer:

Good for: Protein expression across tissues and cancers, post-translational modifications, kinase-substrate relationships, 3D structure visualization with PTM overlays.

scRNA-Seq tab: scRNA-Seq

Good for: Cell-type-specific expression, which cell types express a gene, tumor microenvironment expression.

Correlation Analysis tab: Correlation

Good for: Co-expression partners, tissue-specific networks, pathway enrichment, functional associations.

Perturbations tab: Perturbations

Two complementary perturbation datasets accessible via sub-tab toggle, identifying drugs and genetic manipulations that significantly alter expression of carbonic anhydrase genes.

LINCS L1000 — Compound Perturbations

LINCS L1000 — Genetic Perturbations (CRISPR, shRNA, Overexpression)

14 of 15 CA genes are measured in the L1000 panel. Not measured: CA13.

Tahoe-100M — Single-Cell Perturbations

Good for: Drug discovery, target validation, single-cell perturbation responses, identifying compounds that modulate CA gene expression, cross-referencing bulk (LINCS) and single-cell (Tahoe) evidence.

Cross References tab: Cross Refs

Dynamic links to 30+ external databases (GeneCards, UniProt, NCBI, Ensembl, KEGG, Reactome, ClinVar, OMIM, etc.).

Good for: Finding a gene in other databases, jumping to external resources.

AI Research Assistant

RAG-powered chatbot grounded in 11,712 indexed carbonic anhydrase publications (15,883 text chunks). Accessible via the floating widget from any tab. Supports multi-turn conversation with history, markdown rendering, and adjustable temperature.

Architecture

Data Integration

The AI assistant can optionally include structured experimental data from the site alongside literature context. Selectable data sections:

Changelog

May 11, 2026 — Tampere Rebrand & Engine Sync from WntHub
  • Tampere University / Parkkila Group Branding: CABase is a Parkkila-group project (Faculty of Medicine and Health Technology, Department of Anatomy, Tampere University) — the Oulu branding was carried over by accident from the shared engine template. Replaced University of Oulu logo (and link) with the official Tampere University logo across the sidebar, about-page subtitle, and about-page footer; updated credit lines to Parkkila Group · Tampere University; updated contact email to tuni.fi.
  • Engine Cherry-Picks from WntHub: Pulled in several robustness improvements developed during WntHub’s gene-set expansion: hash-based cache invalidation in build_network_json.py (per-(gene, tissue) network JSONs self-invalidate when the gene set changes via embedded _geneSetHash); LINCS pipelines gained per-gene skip-if-exists with --force override; gene-list dedup across build_*.py and gtex/06_build_site_correlations.py (single source of truth in config.py); HPA pipeline fixed for renamed FANTOM5 column and retired rna_cancer.tsv.zip endpoint.
  • Site-Neutral Engine Identifiers: CABase’s engine code now uses GENES / GENE_SET / GENE_SET_HASH / _geneSetHash / pathway_gene / gene_adj instead of CA-prefixed names. CA_GENES kept as a one-line alias in config.py for site-specific scripts. Engine commits from WntHub now cherry-pick auto-clean instead of conflicting on parallel renames.
  • Site-Config-Driven Identity: config.py reads site-config.json at import and re-exports SITE_NAME, GENE_SET_LABEL, GENE_SET_FULL, DATA_SUFFIX. Engine code never needs hardcoded “CABase” / “CA” / “CAs” strings.
  • Numeric-Aware Gene Picker: Sidebar dropdown now uses Intl.Collator({numeric: true}), so CA9 renders before CA10 instead of after CA1 (lexicographic sort previously put CA10..CA14 before CA2). Same fix benefits any sister-site gene family with embedded numbers.
  • Frontend: CELLxGENE & Symlink Fix: Front-end now loads CELLxGENE_gene_expression.latest.CAs.tsv.gz rather than chasing a dated filename. The .latest file is a real copy (not a symlink) because Netlify’s CDN doesn’t follow symlinks — lesson learned during the WntHub deploy.
  • Stub-ified gtex/04_network_json.py: The unused per-tissue duplicate of the active scripts/build_network_json.py is now a tiny deprecation stub that prints a redirect and exits with code 2. Closes the dual-maintenance hazard that had already let the duplicate drift from its sibling.
  • About Page RAG Corpus Counts Corrected: The RAG section had been carrying WntHub’s 23,323 publications / 31,773 chunks — copied at fork time and never updated. CABase’s actual indexed corpus: 11,712 indexed publications, 15,883 text chunks. Tagline updated to “grounded in 11K+” (was “grounded in CA publications” without a count).
  • Automated Site Stats: New engine scripts (scripts/build_site_stats.py + scripts/render_site_text.py) compute every data-derived number from data/ + config.py + manual overrides and substitute them into HTML/JS via <span data-stat="key"> markers. Wired into master_rebuild_all.sh as Step 17, before the _site/ rsync. About-page stat drift caught en route: GTEx 55 → 53 subtissues; ChIP-Atlas 1,086 → 931 TFs; TCGA 1,451 → 426 curves; PRECOG 2,145/2,129/490 → 710/685/172 records; iPTMnet 1,118 → 103 sites; networks 609 → 553; LINCS V1 22,927 → 11,456; LINCS V2 13,177 → 6,581; Tahoe 145,026 → 72,505 records.
  • Sidebar Logo Aspect Fix: Tampere logo’s “2-line” layout has a ~1.74:1 aspect ratio (vs Oulu’s ~2.47:1). The legacy width="200" height="60" HTML attributes were forcing browsers to stretch the new logo ~1.9× horizontally. Added width: auto + object-fit: contain to the CSS so the rendered width derives from the image’s actual intrinsic ratio.
April 16, 2026 — Engine Port & Feature Sync from WntHub
  • SITE_CONFIG Parameterization: Engine code now reads all site-specific values (brand name, gene-set labels, data-file suffix, function URLs, RAG metadata) from js/site-config.js (frontend) and site-config.json + site-knowledge.txt (backend). Engine JS files are now byte-identical with WntHub — future fixes cherry-pick between repos.
  • RNA-Seq Normal/Cancer Toggle: New sub-tab toggle on the RNA-Seq tab. Normal Tissue view (GTEx, HPA, FANTOM5, Cross-Database) and Cancer & Disease view (DepMap, TCGA+Survival, PRECOG). Cancer charts lazy-render on first switch. Multi-Gene Comparison always visible.
  • Perturbation Enrichment Pills: Enrichment chart now switchable between MOA, Drug Target, Cell Lineage, and Primary Disease (both LINCS and Tahoe sub-tabs).
  • Tahoe Volcano Plot: New panel on Tahoe sub-tab: effect score vs −log10(BH p-value) with significance + effect-size thresholds.
  • Waterfall Paired Bars: Drugs with experiments in both directions now show a dominant solid bar + a hatched secondary bar in the same row, placed in the dominant-direction section.
  • Radar Hover Fix: Top Tissues radar plots now show tissue name + expression value on hover (was showing "trace 0").
  • Netlify Functions Renamed: ca-query-* → query-* (shared filenames across sites). SITE_CONFIG.functionPrefix drives URL routing.
  • About Page Counts Corrected: ChIP-Atlas 1,086 → 1,846; GTEx 53 → 55 subtissues; CELLxGENE 38/187 → 64/867; Tahoe 116M → 2.3M DMSO; networks 2,088 → 609; L1000 coverage + Tahoe record counts updated to CA-specific values.
March 24, 2026 — Tahoe-100M Perturbations & iPTMnet Improvements
  • Tahoe-100M Perturbations: New sub-tab on the Perturbations page integrating single-cell RNA-seq drug response data from the Tahoe-100M dataset (~77M cells, 379 drugs, 50 cell lines, 14 plates). Pseudobulk replicate approach: cells split into 25- or 50-cell replicates, Wilcoxon rank-sum test vs matched DMSO replicates, BH FDR correction. 217,915 records across 32 genes (84.3% BH-significant). Waterfall chart, MOA enrichment, and filterable table with p-values, replicate counts, and confidence tiers
  • Info Tooltips: Added contextual info icons across all 10 site-wide data tables explaining statistical metrics, column meanings, and data sources. Floating tooltip design escapes all CSS stacking contexts
  • Tab Persistence Fix: Fixed Plotly charts going blank when navigating away from a tab and returning. Charts now re-render from cached data on tab switch
  • iPTMnet Plot: Y-axis now uses iPTMnet score (instead of known enzymes), circle size represents number of enzymes. Hover tooltip includes enzyme list and publication count
  • iPTMnet Table: Rebuilt with TableViewer for sorting, filtering, and export. Added Score and Position columns, enzyme names link to UniProt, dropdown filters on Type/Score/Known/Evidence
  • Mol* Viewer Fix: Fixed issue where switching from PTM overlays to structure property themes would fail. Viewer now reloads cleanly when transitioning between overlay categories
March 22, 2026 — Perturbations Tab & Proteomics Enhancements
  • Perturbations Tab: New tab integrating LINCS L1000 compound perturbation data (720K experiments, 33K compounds, 230 cell lines). Waterfall chart of top activators/repressors, MOA enrichment analysis, and full filterable/exportable table with CLUE.io compound links. 39 of 46 genes measured
  • Genetic Perturbations: LINCS L1000 CRISPR (142K), shRNA (238K), and overexpression (34K) data in two views: (1) genetic perturbations affecting each CA gene (45,868 records, 39 genes), (2) downstream effects split into Knockout (CRISPR + shRNA) and Overexpression sections (59,533 records, 37 genes). Waterfall charts + filterable tables for all sections
  • PRECOG Survival Analysis: PRECOG v2 survival z-scores added to RNA-Seq tab. Three databases: Adult (46/46 genes, 51 cancers, ~28K patients), Pediatric (44/46 genes, 12 cancers, ~3K patients), ICI immunotherapy (46/46 genes, 20 cancers, ~4K patients). Waterfall charts + filterable tables. ICI table enriched with ICI target, tumor stage, treatment status, cohort size, outcome type, and study source. Pipeline: scripts/pipelines/precog/01_extract_survival_zscores.py
  • LINCS Pipelines: Compound pipeline (scripts/pipelines/lincs/01_extract_perturbations.py) extracts per-gene perturbation data at |modz| ≥ 3.0 from Level 5 GCTX files (401K perturbations, 12,735 compounds, 702 MOA classes). Genetic pipeline (scripts/pipelines/lincs/02_extract_genetic_perturbations.py) extracts CRISPR/shRNA/overexpression data with the same threshold
  • PTM Table: PMID counts replaced with clickable PubMed links (collapsible when >3). Table card scrollable with sticky header
  • Structure Viewer: Overlay selection persists across gene changes. New built-in Mol* overlays: secondary structure, hydrophobicity, residue type, sequence position, B-factor/pLDDT. Fullscreen now fills the viewport
  • Dev Workflow: Added [dev] publish = "." to netlify.toml — local dev serves from project root, no rsync needed. Production deploys via netlify deploy --prod (build is automatic)
March 21, 2026 — Proteomics Tab & AI Query Classifier
  • Proteomics Tab: New dedicated tab with protein-level data from 5 sources — HPA IHC (normal tissue + cancer), CPTAC mass-spec proteomics (11 cancers), ProteomicsDB (67 tissues), iPTMnet PTM sites (1,118 sites with kinase-substrate relationships)
  • 3D Structure Viewer: PDBe Mol* integration — interactive PDB and AlphaFold structures with PTM overlays (phosphorylation, ubiquitination). Structure descriptions fetched from RCSB PDB and AlphaFold APIs. 26 genes with PDB structures, all genes with AlphaFold
  • AI Query Classifier: Qwen3-14B pre-classifies queries as site/hybrid/science. Site questions ("which species?", "where is variant data?") skip RAG entirely and answer in ~1-2s. Science questions proceed through full RAG pipeline with zero overhead
  • Data Pipelines: New pipelines for HPA protein IHC, CPTAC, ProteomicsDB (OData API), and iPTMnet (REST API). All follow existing retrieve-filter-compress pattern
  • Gene Identity Card: New card at top of Gene Summary tab with full name, genomic coordinates, aliases, NCBI summary, and cross-reference IDs (Entrez, HGNC, Ensembl, UniProt, RefSeq, Pfam, PDB) with clickable links
  • Tab Rename & Reorder: Expression → RNA-Seq, Single Cell → scRNA-Seq. New order: Gene Summary → Genomic → RNA-Seq → Proteomics → scRNA-Seq → Correlation → Cross Refs
  • Sidebar Fixes: Ensembl ID parsing improvements (nested JSON fallback). Ensembl ID text removed from sidebar info card. CABase logo links to Overview tab. University of Oulu logo enlarged
March 20, 2026 — AI Data Integration & TCGA Restructure
  • AI — GTEx Correlations: New data section sends top co-expressed genes to AI. Automatic pairwise lookup when multiple genes queried. Tissue-aware (uses specific tissue or ALL_SAMPLES)
  • AI — TCGA Expression: New data section sends tumor expression stats (median, IQR, n) across 33 TCGA cancer types. Smart cancer detection from natural language ("liver cancer" → LIHC)
  • AI — Conversation: Multi-turn chat with history. Gene detection from conversation context. Temperature slider. Markdown rendering. Inline PMID citations. New Chat button
  • AI — Model: Upgraded to Qwen3-Next-80B MoE (fast inference with reliable instruction following)
  • Data: TCGA data reorganized under data/TCGA/ (survival + expression). All paths and pipelines updated
  • UI: AI tab removed (floating widget only). Hero page updated. About page restyled. Chatbot colors matched to site palette
March 20, 2026 — TCGA Survival, Pipeline & Network Upgrades
  • TCGA Survival: KM curves from UCSC Xena (33 cancers, 1,451 curves). Box plot with scatter overlay. All on same TCGA row
  • Co-expression Networks: Proper p-values + BH FDR. 55 subtissues. 2,088 network files
  • Data Pipeline: 3-stage architecture (Retrieve/Analyze/Format). Vectorized extraction 10x faster. Site 524MB to 404MB
  • Multi-Gene: Database selector (GTEx/HPA/FANTOM5). Auto-populated. Tall charts
  • UI: AI tab removed (floating widget only). Compact export buttons. Model display fixed
March 19, 2026 — Phase 7: Tracks, Networks & Polish
  • IGV.js Tracks: ENCODE cCREs, DNA Methylation Atlas (39 cell types), RNA-seq signal. Info tooltips. Auto-reload
  • Networks: All-vs-all Spearman across 50+ GTEx tissues. D3 force-directed two-hop ego networks
  • Skeleton Loading: Shimmer animations across all tabs
  • UI: Unified sidebar gene selector. Side-by-side network + tables. Fullscreen fix for SVG
March 18, 2026 — Phases 1-6: Full Redesign
  • Vertical-scroll SPA replaced with sidebar + 8 tabbed views + panel grid
  • Plotly.js kitchen-sink expression charts, IGV.js genome browser, enhanced TableViewer
  • LLM optimization: Nemotron-3 Super default (~10s responses), BAAI/bge-base-en-v1.5 embeddings
September 2025 — Initial Release
  • CABase platform launch with genomic context, expression, correlation, gene regulation, cross-references, and AI assistant

Credits

CABase integrates data from:

Developed at Tampere University, Faculty of Medicine and Health Technology, Department of Anatomy (Parkkila Group). Contact: harlan[dot]barker[at]tuni.fi