Overview Virtual Cell Atlas · Web Software
In liver-targeted drug delivery and single-cell omics, researchers face format fragmentation, complex pipelines, and limited cross-platform visualization. This web tool streamlines data loading, QC, analysis, and rich visualization to accelerate studies on the mechanism of action of all-trans retinoic acid (ATRA) in HCC and normal hepatocytes.
.h5ad, .zarr, .loom, 10x formats), workflows
span cleaning → DR → clustering → trajectory, and collaboration suffers without unified interactive tools. Our design lowers the
entry barrier while keeping rigorous, reproducible outputs.
Research Background & Challenges
Format & Scale
- Massive single-cell transcriptomes
- Diverse formats:
.h5ad,.zarr,.loom, 10x (.h5/.mtx)
Pipeline Complexity
- QC, cleaning, NaN/Inf repair
- DR: PCA/UMAP/t-SNE, clustering & trajectory (PAGA)
Collaboration & Viz
- Hard to share/replicate parameters
- Lack of unified, interactive web UI
Highlights
Data Ingestion
Multi-source & multi-format
.h5ad .zarr .loom .csv/.tsv/.txt 10x .mtx .zip
Smart Repair & Cache
- Auto-handle NaN/Inf, drop all-zero genes/cells
- Auto-select best expression matrix (X/raw.X/layers)
- Local caching for fast reloads
Analysis & Viz
- PCA, UMAP, t-SNE, Leiden, PAGA
- Matplotlib / Seaborn / Plotly options
Validated
- Benchmarked on public DBs
- iGEM experimental datasets
Tool Components
Core Backend
scanpy,anndata,lamindb- Caching:
.h5ad+ metadata JSON
Frontend
- Streamlit web interface
- Interactive parameters & plots
Dependencies
requirements.txtpyproject.tomlfor reproducibility
Data Cache Layer
- Auto-generate fast reload artifacts
- Repeatable analysis across sessions
User Experience & API
Web UI
- Upload, configure, visualize in browser
- Intuitive panels & progress feedback
Unified API
- Data loading & cache management
- Easy extension / secondary dev
Compatibility
- Windows / Linux / macOS
- Recommended Python 3.11
Openness
- GenBank / standard formats
- Integrates with bioinformatics tools
Collaboration & Validation
- Experimental Validation: Public single-cell DBs and iGEM team data show high consistency with experimental findings.
- Collaboration: Deep integration with open-source platforms like LaminDB to ensure source reliability.
- User Feedback: Piloted in small research teams; features iterated based on feedback.
Supported Formats & Data Sources (full list)
Formats: .h5ad, .zarr, .loom, .csv/.tsv/.txt, 10x .h5/.mtx triplets, .zip.
Sources: Local upload, Hugging Face Hub, LaminDB instances, public clouds (S3/GCS).
Next: Data Preprocessing & QC (we’ll add as the second part).
Data Preprocessing & QC
Principle
VCAExplorer provides a reliable, fault-tolerant, and user-friendly environment for processing and visualizing liver-related single-cell transcriptomics.
Goals and Challenges
- Diverse data formats that are hard to handle uniformly.
- Inconsistent quality (NaN/Inf, all-zero cells/genes).
- Complex steps (dimensionality reduction, clustering, trajectory) that may fail on outliers.
- Repeated analyses without efficient caching.
Core Design Philosophy
- Automatic Matrix Selection: pick the most informative matrix from X, raw.X, and layers, without manual work.
- Intelligent Repair: replace NaN/Inf, remove all-zero rows/columns, and merge duplicates to stabilize inputs.
- Hierarchical DR with Fallback: prefer PCA → UMAP/t-SNE → PAGA; if a step fails, automatically fall back to a PCA embedding so plots are always available.
- Caching & Reproducibility: save results and metadata locally for fast reloads and reproducible runs.
Modular Extensibility: analysis logic is decoupled from the UI, making it easy to add new methods (e.g., PHATE, scVelo).
QC & DR (simulated)
Zero-rate filter (threshold = 20%)
Illustrative distribution: samples above 20% zero-rate are flagged for removal; others are retained.
Data Processing Workflow
Software · Data Processing Workflow
1, change page=1 in the URL above.
Key Implementation Details
- Data source integration: local uploads, Hugging Face Hub, LaminDB, and cloud storage; remote data fetched via a universal file-system interface and cached locally.
- Dependencies and ecosystem: mainstream single-cell and data-management libraries; multiple plotting backends for visualization.
- Cache management: results and metadata are saved with timestamps, source modes, and instance identifiers for quick reuse and auditability.
- User configuration: personalized parameters are saved to keep sessions consistent.
- Error handling: detect and fix row/column orientation; automatically switch algorithms when a DR method fails.
Design Advantages
-
Plug-and-play
Users can perform data analysis without needing to understand underlying file structures.
-
Highly fault-tolerant
Built-in auto-repair and fallback mechanisms ensure uninterrupted workflow.
-
Efficient and reproducible
Caching and metadata logging enhance repeat experiment efficiency.
-
Scalable
Supports additional algorithms and external database integration, facilitating team collaboration and future feature upgrades.
User Interface and Compatibility (Web UI, API, formats, UX)
Web UI
Interaction Framework: Streamlit front-end, no CLI required.
Design: Modular views for upload, parameter setup, and visualization. Users can:
- Upload local files;
- Select remote sources (e.g., LaminDB, Hugging Face Hub);
- Adjust parameters (e.g., clustering resolution, neighbors);
- Preview/export PCA/UMAP/t-SNE figures.
Dynamic Feedback: progress bars, warnings, and concise run summaries.
API Interface
Data Loading
- auto_fetch_adata() — fetch remote data → AnnData
- load_uploaded_to_adata() — parse local uploads → unified object
- persist_to_cache() — write results and metadata.json
Config
- load_user_config() / save_user_config() — user preferences
Sanitization
- sanitize_and_prepare_matrix()
- pick_nonzero_matrix()
Format Compatibility
Supported inputs
- Professional: .h5ad, .zarr, .loom, 10x .h5 / .mtx
- General: .csv, .tsv, .txt (auto orientation & transpose)
- Archives: .zip (auto extract & detect)
Data sources
- Local upload
- Hugging Face Hub (hf://)
- LaminDB instances
- S3/GCS (anonymous supported)
OS: Windows/Linux/macOS; Python 3.11 recommended.
User Experience
- Caching: store .h5ad + metadata.json to avoid re-compute
- Personalization: user_config.json remembers preferences
- Error tolerance: instant feedback & repair hints on bad files
- Visual options: Matplotlib / Seaborn / Plotly exportables
Tutorials
1. Homepage Access
Overview: Launch the app, read the title bar, check the basic info (cell/gene counts, cache time, data source, matrix source), then move to the top navigation with eight modules.
This flow highlights initial checks before exploring modules.
2. Data/Filter Module
Process: Choose the tab → select metadata columns (cell_type, disease, cell_line, tissue) → pick values (multi-select) → see live counts and success prompt → set a grouping field for coloring/statistics.
Filtering narrows to the relevant cells/genes and prepares grouping for plots.
3. Embedding Visualization Module
Process: Enter the tab → optionally repair poor-quality data → select UMAP/t-SNE/PCA/PHATE → set 3D/facet/color-by options → inspect interactive scatter and optional density overlay.
Pick the embedding that best separates populations; density helps reveal structure.
4. Clustering Module
Process: Select Leiden/KMeans/HDBSCAN and parameters → run → results overlay on UMAP → evaluate with silhouette and ARI/NMI → optional cell cycle, gene-set scoring, doublet detection.
Tune resolution/cluster number to balance granularity and biological interpretability.
5. Enrichment Analysis Module
Process: Differential analysis (optional reference) → Volcano/MA with thresholds → Enrichr (db names, run, CSV) → GSEA prerank (group, score, gene-set DBs).
Use Volcano/MA to verify signal direction; Enrichr/GSEA summarize pathways.
6. Proportion Analysis Module
Process: Choose grouping/stratification → view proportion tables and stacked bars → build Sankey for category flows → run proportion difference tests and export CSV.
Proportion and flow views help compare cohorts and trace cell-state transitions.
7. General Operation Tips
- Data Saving: Important results from each module can be downloaded through the “Export/Save” tab.
- Interactivity: All charts are interactive, supporting zoom and hover for detailed information.
- Real-time Feedback: The system displays progress status and completion prompts after operations.
- Error Handling: If operations fail, the system shows detailed error messages and solution suggestions.
Data/Filter Module
1. Module Purpose
Core Purpose: The Data/Filter module serves as the entry checkpoint of the analysis workflow. Its primary goal is to allow researchers to quickly and intuitively extract specific cell populations of interest from large-scale single-cell datasets, thereby enabling targeted downstream analyses.
Main Functions:
- Conditional filtering: Selects subsets of cells based on metadata (e.g., cell type, tissue of origin, disease state, experimental condition).
- Population grouping: Groups filtered results by user-specified fields (e.g., cell_type, tissue), for visualization coloring and comparative analysis.
- Data subsetting: Applies consistent subsetting across the expression matrix, metadata, and embedding matrices to preserve integrity.
- Prepares inputs for downstream modules: Ensures the filtered dataset can directly feed into clustering, enrichment analysis, proportion analysis, and other modules.
In short, this module acts like a “data magnifier”, focusing the global view onto the specific populations of biological interest.
2. Assumptions
- Data structure assumption: The input is a standard single-cell object (e.g., AnnData), containing expression data, metadata, and embeddings.
- Metadata completeness: At least one categorical attribute (e.g., cell_type, tissue) is available for filtering.
- Embedding availability: Dimensionality reduction results (UMAP/PCA/t-SNE/PHATE) have already been computed, or can be reconstructed after filtering.
- Filter validity: User-selected conditions must match existing metadata values; if overly strict, the system must provide warnings.
- Data consistency: Subsetting must update expression, metadata, and embeddings simultaneously to prevent misaligned analyses or visualizations.
3. Mathematical Model
From a mathematical perspective, the Data/Filter module performs a set selection based on Boolean masking.
- Expression matrix X, where (n) = number of cells, (p) = number of genes.
- Metadata matrix O, with (m) annotation fields.
- Embedding matrix E, representing each cell in a low-dimensional space.
User filtering conditions are defined as:
Where: (C): the set of chosen metadata fields, e.g. ({cell_type, tissue}). (V(c)): the allowed values within each field, e.g. ({Tcell, Bcell}).
For each cell (i), define a Boolean mask:
The filtered dataset is:
4. Data Sources
Inputs:
- Expression matrix (cells × genes), either raw counts or normalized expression values.
- Metadata (obs): cell annotations, such as cell_type, disease status, tissue origin, experimental condition.
- Low-dimensional embeddings (obsm): UMAP/t-SNE/PCA/PHATE coordinates.
- User input: chosen filtering fields and values.
Outputs:
- A new subsetted AnnData object, containing:
- Filtered expression matrix (X')
- Filtered metadata (O')
- Synchronized embedding matrix (E')
5. Implementation (Conceptual Workflow)
Logical steps (code-agnostic description):
- Detect available fields: The system scans metadata to identify categorical attributes suitable for filtering.
- User selects conditions: The researcher specifies one or more fields and values (e.g., tissue = liver, cell_type = T cell).
- Generate filtering mask: A Boolean vector is constructed to mark retained cells.
- Apply subsetting:
- Expression matrix: rows corresponding to selected cells are kept.
- Metadata: rows are subset accordingly.
- Embedding matrices: rows are trimmed to maintain alignment.
- Output and reporting: A new dataset is returned, and the system reports the number of cells and genes after filtering.
This ensures seamless integration into downstream modules without requiring additional preprocessing.
6. Outputs
The module produces results on two levels:
- Data result:
- A new subsetted AnnData object, with consistent expression data, metadata, and embeddings.
- Ready-to-use for downstream visualization and analysis.
- Interface feedback:
- Displays the number of cells before and after filtering (e.g., from 20,000 to 3,245).
- Gene count (usually unchanged, retaining the full gene set).
- Highlights selected grouping fields (e.g., cell_type) for coloring and statistical analysis.
7. Module Features and Highlights
- User-friendly interaction:
- Researchers can apply complex filtering through simple multi-select operations, without coding.
- High flexibility:
- Supports multi-field, multi-value combinations, enabling complex cohort comparisons (e.g., “T and B cells in liver tissue”).
- Consistency guarantee:
- Ensures expression, metadata, and embeddings are always updated in sync, avoiding misalignment.
- Robustness:
- Provides warnings when conditions are too strict or fields are missing, instead of failing.
- Performance optimized:
- Based on sparse matrix and Boolean indexing, ensuring scalability to million-cell datasets.
- Seamless downstream integration:
- Outputs can directly feed into Embedding Visualization, Clustering, Enrichment Analysis, Proportion Analysis, forming a coherent pipeline.
Embedding Visualization Module
1. Module Purpose
The Embedding Visualization module transforms high-dimensional single-cell gene expression matrices into low-dimensional spaces, enabling intuitive visualization of relationships among cells. It allows researchers to directly observe cell-type distributions, differentiation trajectories, and population structures.
Main Objectives:
- Dimensionality reduction visualization: Project tens of thousands to millions of cells into 2D or 3D space.
- Structural discovery: Reveal local and global structures such as clusters, gradients, and lineage transitions.
- Metadata integration: Combine biological annotations (e.g., tissue, disease, cell type) for color-based grouping.
- Exploratory analysis entry point: Serve as a visual gateway to downstream modules such as clustering, enrichment, and proportion analysis.
2. Assumptions
- Preprocessing assumption: The input data has undergone quality control (QC), normalization, and batch correction.
- Manifold assumption: The high-dimensional gene expression data lies on a low-dimensional manifold with meaningful geometry.
- Distance preservation assumption: Similar cells remain close together in the embedded space, while distinct groups are separable.
- Interpretability assumption: The embedding structure aligns with biological metadata to support meaningful interpretation.
3. Mathematical Model
The mathematical foundation of this module is nonlinear dimensionality reduction, which seeks a mapping that preserves the geometric or statistical relationships of the high-dimensional data in a lower-dimensional representation.
Given an expression matrix:
\[ X = [x_1, x_2, \ldots, x_n]^{\top} \in \mathbf{R}^{n\times p} \] where \(n\) is the number of cells and \(p\) is the number of genes, we aim to find a mapping:
\[ f:\ \mathbf{R}^{p} \rightarrow \mathbf{R}^{d},\quad d \in \{2,3\} \]
such that the embedded coordinates
\[ Y = f(X) \in \mathbf{R}^{n\times d} \] preserve as much structural information from \(X\) as possible.
(1) PCA (Principal Component Analysis)
PCA finds a linear projection that maximizes variance: \[ Y = X W_d,\qquad W_d = [\,w_1, w_2, \ldots, w_d\,] \] where \(W_d\) consists of the top \(d\) eigenvectors of \(X^{\top} X\). Objective function: \[ \max_{W_d^{\top} W_d = I_d}\ \mathrm{Tr}\!\left(W_d^{\top}\, X^{\top} X\, W_d\right). \]
(2) t-SNE (t-distributed Stochastic Neighbor Embedding)
High-D similarities: \[ P_{j|i}=\frac{\exp\!\left(-\|x_i-x_j\|^{2}/(2\sigma_i^{2})\right)} {\sum_{k\neq i}\exp\!\left(-\|x_i-x_k\|^{2}/(2\sigma_i^{2})\right)},\qquad P_{ij}=\frac{P_{j|i}+P_{i|j}}{2n}. \] Low-D similarities: \[ Q_{ij}=\frac{\bigl(1+\|y_i-y_j\|^{2}\bigr)^{-1}} {\sum_{k\neq l}\bigl(1+\|y_k-y_l\|^{2}\bigr)^{-1}}. \] Minimize the divergence: \[ \min_{Y}\ \mathrm{KL}(P\Vert Q)=\sum_{i\neq j} P_{ij}\,\log\frac{P_{ij}}{Q_{ij}}. \]
(3) UMAP (Uniform Manifold Approximation and Projection)
Construct fuzzy graphs \(G_H\) and \(G_L\) with connectivities \(p_{ij}\) and \(q_{ij}\); minimize the cross-entropy:
\[
\min_{Y}\ \sum_{i
(4) PHATE
Preserve diffusion geometry by matching diffusion distances. Let \(P_H\) and \(P_L\) be diffusion operators with timescale \(t\): \[ D_H=\log\!\bigl(P_H^{\,t}+\varepsilon\bigr),\qquad D_L=\log\!\bigl(P_L^{\,t}+\varepsilon\bigr), \] and minimize \[ \min_{Y}\ \|D_H - D_L\|_F^2. \]
4. Data Sources
- Inputs:
- High-dimensional expression matrix (normalized and batch-corrected).
- Metadata annotations (e.g., cell type, tissue, disease) for grouping and coloring.
- Optional: filtered subsets from the Data/Filter module.
- Outputs:
- Low-dimensional embedding matrix \(Y \in \mathbf{R}^{n\times d}\), stored in
adata.obsm. - Visualization configuration data (color schemes, scales, labels, and axes information).
- Low-dimensional embedding matrix \(Y \in \mathbf{R}^{n\times d}\), stored in
5. Implementation
- Algorithm selection: Users can choose PCA, t-SNE, UMAP, or PHATE.
- Embedding computation: The system computes the mapping from high- to low-dimensional space and produces the coordinate matrix.
- Result storage: The embedding and relevant metadata are stored in the AnnData object for reuse.
- Visualization generation:
- 2D/3D scatter plots with color coding by metadata (e.g., cell_type, tissue, disease).
- Expression-based coloring (Feature plots).
- Cluster highlighting.
- Interactive exploration: The interface allows zooming, rotation, and selection of specific cells for detailed analysis.
6. Outputs
- Data-level outputs:
- Low-dimensional coordinate matrix \(Y\), containing each cell’s 2D/3D coordinates.
- Embedding parameters (e.g., number of neighbors, minimum distance, random seed) for reproducibility.
- Updated AnnData object ready for downstream analysis.
- Visualization-level outputs:
- 2D scatter plots: Reveal overall structure, cluster boundaries, and transitional regions.
- 3D interactive plots: Enable rotation and depth exploration for complex systems.
- Multi-mode coloring: Supports overlays for metadata, gene expression, and clustering labels.
- Statistical summaries: Display the number of groups, density distributions, and explained variance ratios.
- Export options: Allow exporting images, embedding matrices, and parameter reports for publication or presentation.
7. Features and Highlights
- Multi-algorithm integration: Combines linear (PCA) and nonlinear (t-SNE, UMAP, PHATE) approaches for flexible analysis.
- High interpretability: Provides intuitive visualization of biological relationships and transitions.
- Interactive exploration: Supports real-time zooming, labeling, and local analysis.
- Layered visualization: Enables combined display of metadata and gene expression gradients.
- High computational efficiency: Uses sparse matrices and neighbor-based acceleration for large-scale datasets.
- Seamless integration: Directly connects with the Data/Filter module upstream and provides embeddings for downstream modules such as Clustering, Enrichment, and Proportion Analysis.
Clustering Module
1. Module Purpose
The Clustering module uncovers cellular heterogeneity in single-cell transcriptomics by grouping cells with similar gene-expression profiles into subpopulations, enabling identification of cell types, functional states, and developmental lineages.
Primary objectives:
- Group high-dimensional single-cell data into biologically meaningful clusters.
- Reveal hidden structures such as subpopulations and transitional states.
- Provide cluster labels for downstream modules (enrichment, proportion, mechanism analysis).
- Enable visual exploration in embedding spaces (UMAP / t-SNE).
2. Assumptions
- Expression similarity: cells with similar transcriptional profiles tend to share biological functions.
- Manifold: cells lie on a low-dimensional manifold; neighborhood relations carry structure.
- Separability: distinct subgroups form compact, separable regions in the embedding space.
- Embedding validity: input embeddings (e.g., PCA/UMAP) retain essential biological information.
- Stability: under similar parameters, clustering is consistent and reproducible.
3. Mathematical Model
Let the embedded matrix after dimensionality reduction be
\[ Y = [\,y_1, y_2, \ldots, y_n\,]^{\top} \in \mathbf{R}^{\,n\times d}, \] where \(y_i \in \mathbf{R}^{\,d}\) is the embedding of the \(i\)-th cell. The goal is to divide all cells into \(k\) clusters
\[ \mathcal{C}=\{C_1,\ldots,C_k\},\quad \bigcup_{j=1}^{k} C_j=\{1,2,\ldots,n\},\quad C_a\cap C_b = \{\}\ \ (a\ne b) \]
and minimize intra-cluster variance while maximizing inter-cluster separation. Let \(D(\cdot,\cdot)\) be a distance and \(\mu_j\) be the centroid of cluster \(C_j\):
\[ \min_{\{\mu_j\},\,\mathcal{C}}\ \sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]
(1) K-Means (centroid-based)
Objective (Within-Cluster Sum of Squares, WCSS):
\[ \min_{\{\mu_j\}}\ \sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]
Iteration: (i) assignment — send each \(y_i\) to the nearest centroid; (ii) update — recompute \(\mu_j\) as the mean of assigned points; repeat until convergence.
(2) Louvain / Leiden (graph-based community detection)
Build a k-nearest-neighbor graph \(G=(V,E)\) on \(\{y_i\}\) with edge weights \[ w_{ij}=\exp\!\left(-\frac{\|y_i-y_j\|^{2}}{2\sigma^{2}}\right). \] Optimize network modularity: \[ Q=\frac{1}{2m}\sum_{i,j}\Big[\,w_{ij}-\frac{k_i k_j}{2m}\Big]\ \delta(c_i,c_j), \]
Here \(m = \frac{1}{2}\sum_{i,j} w_{ij}\) is the total edge weight, \(k_i = \sum_j w_{ij}\) is the (weighted) degree, and \(\delta(c_i,c_j)=1\) if nodes \(i\) and \(j\) are in the same community (0 otherwise).
(3) Gaussian Mixture Model (GMM)
Assume a mixture of \(K\) Gaussians: \[ p(y_i)=\sum_{k=1}^{K}\ \pi_k\,\mathcal{N}\!\left(y_i\mid \mu_k,\ \Sigma_k\right), \] where \(\pi_k\) are mixture weights and \((\mu_k,\Sigma_k)\) are mean and covariance. Estimate parameters by maximizing the log-likelihood via EM: \[ \max_{\Theta}\ \sum_{i=1}^{n}\log p(y_i),\qquad \Theta=\{\pi_k,\mu_k,\Sigma_k\}_{k=1}^{K}. \]
(4) Spectral Clustering
Given similarity \(W=[w_{ij}]\) and degree matrix \(D=\mathrm{diag}(d_1,\ldots,d_n)\) with \(d_i=\sum_j w_{ij}\), define the Laplacian \[ L=D-W,\qquad L_{\mathrm{sym}}=I-D^{-1/2} W D^{-1/2}. \] Compute the top \(k\) eigenvectors of \(L_{\mathrm{sym}}\) to form \(U\in\mathbf{R}^{\,n\times k}\), then run K-Means on the rows of \(U\) to obtain cluster labels.
4. Data Sources
- Input:
- Low-dimensional embeddings (from PCA, UMAP, t-SNE, PHATE, etc.).
- Cell–cell similarity or adjacency matrix.
- Optional metadata for biological annotation.
- Output:
- Cluster labels (cluster IDs).
- Cluster statistics (size, centroid, intra-variance, inter-distance, modularity).
- Updated AnnData object (
adata.obs["cluster"]).
5. Implementation
5.1 Data Input
- Load the embedding matrix from the Embedding Visualization module.
- Auto-detect dense/sparse format.
- Choose algorithm based on dataset size and structure.
5.2 Processing Flow
- Neighborhood construction: compute pairwise distances (Euclidean / cosine) and build a kNN graph.
- Algorithm & parameters: select K-Means / Leiden / GMM / Spectral; set \(k\), resolution, neighbor size, seed.
- Clustering computation: run the selected algorithm to assign labels.
- Evaluation & tuning: assess with Silhouette, Modularity, Inertia, or Davies–Bouldin; auto-tune if needed.
- Result integration: store labels to AnnData and wire to visualization for color mapping.
5.3 Output Results
- Cluster label per cell.
- Per-cluster statistics (size, density, centroid).
- Quality metrics (Silhouette, Modularity, Inertia, DB-Index).
- Exportable results (CSV/JSON) and interactive visual components.
Silhouette (−1 to 1): higher is better; compares intra-cluster cohesion \(a(i)\) to nearest-cluster separation \(b(i)\).
\[ s(i)=\frac{b(i)-a(i)}{\max\{a(i),\,b(i)\}}. \]
Modularity \(Q\) (0–1): higher is better; measures community structure in graphs.
\[ Q=\frac{1}{2m}\sum_{i,j}\Big[\,w_{ij}-\frac{k_i k_j}{2m}\Big]\ \delta(c_i,c_j). \]
Inertia (WCSS): lower is better; total within-cluster squared distance.
\[ \mathrm{Inertia}=\sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]
Davies–Bouldin Index (DBI): lower is better; ratio of within-cluster scatter to between-cluster separation.
\[ \mathrm{DBI}=\frac{1}{k}\sum_{j=1}^{k}\ \max_{l\neq j}\ \frac{S_j+S_l}{M_{jl}}, \] where \(S_j\) is scatter of cluster \(j\) and \(M_{jl}\) is distance between centroids \(\mu_j,\mu_l\).
6. Output Explanation
- Data-level: cluster IDs and centroids; cells per cluster; intra-variance; quality metrics; parameter logs (algorithm, seed, kNN size).
- Visualization-level: UMAP/t-SNE with color-coded clusters; cluster-distance heatmaps; mean-expression heatmaps; interactive selection and metadata inspection.
7. Module Features and Highlights
- Multi-Algorithm Integration: K-Means, Louvain/Leiden, GMM, Spectral.
- Topology Preservation: graph-based modeling maintains neighborhood structure.
- High Biological Interpretability: aligns with metadata for cell-type annotation.
- Automatic Parameter Optimization: internal metrics for self-tuning.
- Scalability: efficient for large datasets with sparse/GPU acceleration.
- Seamless Integration: works with Embedding, Enrichment, Proportion modules.
- Interactive Visualization: instant sync between labels and graphics.
Enrichment Analysis Module
1. Module Purpose
The Enrichment Analysis module bridges data-driven clustering and biological mechanism discovery. It tests specific clusters or DEG sets to find significantly enriched biological processes, molecular functions, pathways, and regulatory programs.
Main objectives:
- Reveal functional characteristics and active pathways of each cluster.
- Identify key processes (e.g., metabolism, immune response, differentiation).
- Link gene expression changes to regulatory/signaling mechanisms.
- Provide pathway-level input for mechanism analysis with interpretable visuals.
2. Assumptions
- Functional modularity: genes act in modules (pathways/complexes) rather than in isolation.
- Expression–function correlation: up/down-regulated genes imply activation/inhibition of functions.
- Approximate independence: gene-level statistics are sufficiently independent for set-level inference.
- Annotation reliability: curated databases (GO, KEGG, Reactome, MSigDB, etc.).
- Sample representativeness: each analyzed cluster contains enough cells for robust statistics.
3. Mathematical Model
The core principle is to test whether a predefined functional gene set \(S\) is statistically over-represented in a target set \(T\).
Let: \(M\) = total background genes; \(M_S\) = genes in set \(S\); \(N\) = genes in target \(T\); \(k\) = overlap of \(S\) and \(T\).
(1) Hypergeometric Test — for discrete sets (e.g., up-regulated DEGs).
\[ C(n,k) \;=\; \frac{n!}{\,k!\,(n-k)!\,}. \]
\[ P \;=\; 1 \;-\; \sum_{i=0}^{\,k-1} \frac{\,C(M_S,\,i)\; C(M-M_S,\,N-i)\,}{\,C(M,\,N)\,}\, . \]
If \(P<0.05\), the pathway or set is considered significantly enriched (assuming sampling from the background universe).
(2) Gene Set Enrichment Analysis (GSEA) — for ranked, continuous expression.
Rank genes \(g_1,\ldots,g_N\) with scores \(r_i\) (e.g., logFC, t-statistic). Define cumulative distributions:
\[ P_{\mathrm{hit}}(i) \;=\; \sum_{\,g_j\in S,\ j\le i}\ \frac{|r_j|^{\,p}}{\displaystyle \sum_{g\in S}|r_g|^{\,p}}, \qquad P_{\mathrm{miss}}(i) \;=\; \sum_{\,g_j\notin S,\ j\le i}\ \frac{1}{\,N-|S|\,}. \]
Enrichment score: \[ ES \;=\; \max_{\,i}\ \Big( P_{\mathrm{hit}}(i) - P_{\mathrm{miss}}(i) \Big). \] Normalized enrichment score (permutation based): \[ NES \;=\; \frac{ES}{\mathrm{avg}\!\left(|ES_{\mathrm{random}}|\right)} . \]
(3) GSVA (Gene Set Variation Analysis) — non-parametric estimation of pathway activity per sample/cell.
\[ \mathrm{GSVA}(S,c) \;=\; \text{Rank-based kernel} \!\left(\, \frac{\exp\big(r(S,c)\big) - \exp\big(r_{\mathrm{background}}(c)\big)} {\sigma_c} \right), \] where \(r(S,c)\) is the rank statistic for set \(S\) in cell/sample \(c\), and \(\sigma_c\) is a scale term.
(4) Significance and Correction
Multiple testing adjusts raw \(P\)-values to control the expected false discovery proportion across many gene-set queries.
4. Data Sources
- Input:
- DE results (gene, logFC, P-value).
- Cluster labels from the Clustering module.
- Functional databases: GO, KEGG, Reactome, MSigDB, WikiPathways, TRANSFAC, TRRUST.
- Optional user-defined sets (GMT/CSV).
- Output:
- Significant pathways/functions per cluster.
- Statistics (ES, NES, P-value, FDR) with visual and tabular summaries.
5. Implementation
5.1 Data Input
- Load cluster information from the Clustering module.
- Import DEG lists (logFC, P-value).
- Load gene-set libraries (GO, KEGG, MSigDB, etc.).
- Optionally accept custom sets.
5.2 Processing Flow
- Preparation: deduplicate/normalize genes (HGNC/ENSEMBL); define background.
- Computation: ORA (hypergeometric) for discrete lists; GSEA for ranked lists; GSVA for cell/sample-level activities.
- Statistics: compute P/FDR and ES/NES; filter by \(FDR<0.05\).
- Integration: map enriched pathways to clusters; auto-annotate cluster functions.
- Visualization: bubble/bar plots, GSEA curves, heatmaps; interactive drill-down.
5.3 Output Results
- Enrichment table (pathway, category, P, FDR, gene count, ES/NES).
- Key gene lists within enriched pathways.
- Visual outputs: bubble plot, top-bar plot, GSEA curve, heatmap.
- Export formats: CSV, Excel, JSON, PDF.
6. Output Explanation
- Data-level: pathway statistics (P, FDR, ES/NES); contributing genes; cluster–pathway matrix.
- Visualization-level: bubble plot (x=enrichment score; y=pathway; size=gene count), top-bar plot, GSEA trend, heatmap; optional network to show gene overlap.
- Statistical summary: number of significant pathways, direction (up/down), mean FDR; auto-generated report for publication/export.
7. Module Features and Highlights
- Multi-algorithm integration: ORA, GSEA, GSVA, PAGE.
- Comprehensive databases: GO, KEGG, Reactome, MSigDB.
- Hierarchical analysis: single-cluster, multi-cluster, global comparisons.
- Statistical rigor: FDR correction and permutation testing.
- High interpretability: connects gene-level changes to pathway-level meaning.
- Extensible: custom gene sets and cross-species annotation.
- Dynamic visualization: interactive and filterable outputs.
- Seamless integration with Clustering, Mechanism, and Proportion modules.
Proportion Analysis Module
1. Module Purpose
The Proportion Analysis module quantifies how cell population fractions change across experimental conditions (e.g., control vs. treatment). Based on clustering results, it computes per-cluster fractions per sample, performs statistical testing, and visualizes composition shifts.
Main objectives:
- Quantify cluster proportions under different conditions.
- Detect statistically significant changes in composition.
- Reveal structural remodeling under perturbations.
- Provide bar/stacked/heatmap views and supply results to downstream modules.
2. Assumptions
- Label accuracy: clustering labels are biologically meaningful.
- Sample comparability: comparable depth/size so that proportions reflect biology.
- Independence: cell assignments across samples are treated as independent observations.
- Stability: replicates show similar proportions without perturbation.
- Closed population: total counts are fixed or normalized across samples.
3. Mathematical Model
The objective is to determine whether cell-type proportions differ significantly between conditions. Let:
- \(N_{ij}\): number of cells in cluster \(j\) for sample \(i\);
- \(n_i=\sum_j N_{ij}\): total cells in sample \(i\);
- Cell proportion: \(\displaystyle p_{ij}=\frac{N_{ij}}{n_i}\).
4. Statistical Tests
(1) Proportion Difference Test (two groups) — Chi-square test for contingency counts.
\[ \chi^{2} \;=\; \sum_{j=1}^{k} \frac{(O_j - E_j)^{2}}{E_j}, \]
where \(O_j\) is the observed count and \(E_j\) is the expected count under the null hypothesis. If \(P<0.05\), that cell-type proportion differs significantly.
Fisher’s exact test — for small counts (2×2 table). The exact tail probability is computed from the hypergeometric distribution.
\[ P \;=\; \sum_{\text{tables } t \ \text{as or more extreme}} \frac{\,C(n_{1\cdot},\,t_{11})\,C(n_{2\cdot},\,t_{21})\,}{\,C(n_{\cdot\cdot},\,n_{\cdot 1})\,}, \] using \(C(n,k)=\frac{n!}{k!\,(n-k)!}\) for combinations.
(2) Multi-group comparison — ANOVA on proportions (when normality is acceptable). Tests equality of group means of \(p_{ij}\).
Report \(F\) statistic and \(P\) value; proceed with post-hoc comparisons if significant.
Kruskal–Wallis (rank-based, nonparametric) — for \(g\) groups with total \(N\) observations.
\[ H \;=\; \frac{12}{\,N(N+1)\,}\ \sum_{i=1}^{g} n_i \left(\,\bar{R}_i - \bar{R}\,\right)^{2}, \]
where \(\bar{R}_i\) is the mean rank of group \(i\), \(n_i\) is the group size, and \(\bar{R}\) is the overall mean rank. A significant result (\(P<0.05\)) indicates at least one group differs.
(3) Bayesian proportion modeling — with Beta priors for per-cluster proportions.
\[ p_{ij} \sim \mathrm{Beta}(\alpha_{ij},\,\beta_{ij}), \] and posterior credible intervals are compared across conditions to assess proportion differences.
(4) Fold-Change (intuitive comparison)
\[ \mathrm{FC}_{j} \;=\; \frac{p^{\mathrm{treatment}}_{j}}{p^{\mathrm{control}}_{j}}, \qquad \log_{2}(\mathrm{FC}_{j}). \]
Values >1 (or positive \(\log_{2}\)) indicate increased proportion under treatment, and <1 indicate decrease.
4. Data Sources
- Input data:
- Cluster labels (
adata.obs['cluster']). - Condition labels (
adata.obs['condition']). - Per-sample cell count matrices; sample metadata.
- Cluster labels (
- Output data:
- Cell proportion table for each cluster across groups.
- Statistical results (P, FDR, fold-change).
- Visualizations (bar, stacked, heatmap).
5. Implementation
5.1 Data Input
- Load cluster and condition labels; allow user-defined grouping (tissue, timepoint, etc.).
- Normalize counts per sample to ensure comparability.
5.2 Processing Flow
- Aggregation & proportions: count \(N_{ij}\); compute \(p_{ij}=N_{ij}/n_i\); form a sample×cluster proportion matrix.
- Testing & significance: Chi-square / Fisher for two groups; ANOVA or Kruskal–Wallis for multiple groups; compute \(P\) and adjust FDR; flag \(FDR<0.05\).
- Visualization & reporting: bar, stacked, heatmap, volcano/bubble; annotate significant clusters; export tables.
- Integration: link proportion shifts with enrichment/mechanism results.
5.3 Output Results
- Proportion matrix across conditions.
- Summary table (P, FDR, \(\log_{2}\)FC).
- Visual outputs: bar, stacked, heatmap, volcano, bubble.
- Exportable files (CSV, Excel, JSON).
6. Output Explanation
- Data-level: cluster×condition matrix; mean, sd, \(\log_{2}\)FC, P, FDR; markers for increases/decreases.
- Visualization: bar (per group), stacked (global composition), heatmap (sample comparison), volcano/bubble (effect vs. significance).
- Statistical report: summary of shifts; method details; multiple-testing correction summary.
7. Features and Highlights
- Automated quantification across conditions.
- Comprehensive testing (χ²/Fisher; ANOVA/Kruskal–Wallis).
- High interpretability through linkage with clustering and enrichment.
- Rich but clean visualization options.
- Statistical rigor with FDR control.
- Extensible to multi-condition/time-series/spatial data.
- Interactive exploration for quick insight.
- Seamless integration with Embedding, Clustering, Enrichment modules.
8. Summary and Resources
The Virtual Cell Atlas Explorer (VCAExplorer) forms an end-to-end, fault-tolerant, and interactive framework for liver-related single-cell research, integrating LaminDB connectivity, filtering, embedding, clustering, enrichment, and proportion analysis into a coherent pipeline for reproducible biological insight.
Scientific Contribution
- Data-level integration: Harmonizes diverse single-cell data formats (.h5ad, .zarr, .loom, 10x .mtx) for unified downstream analysis.
- Algorithmic innovation: Implements multi-stage fallback mechanisms to ensure successful dimensionality reduction and clustering even on noisy datasets.
- Functional insight: Bridges molecular-level data (gene expression) with system-level phenomena (cell-type composition and pathway activation).
- Reproducibility: Caching and metadata tracking guarantee reproducible analyses across sessions and users.
- Extensibility: Each module supports modular expansion (e.g., PHATE, scVelo, Harmony), maintaining long-term scalability.
Impact across pipeline stages
Program Architecture Overview
Design Features
- Fully automated preprocessing, normalization, and metadata alignment.
- Interactive Streamlit-based Web UI with multi-format visualization.
- API layer for developers to embed VCAExplorer functions in larger workflows.
- Cloud-compatible architecture supporting LaminDB, Hugging Face, and S3/GCS storage.
Error Handling
- Smart auto-repair of corrupted or misaligned data matrices.
- Automatic fallback to simpler embeddings (PCA) when advanced algorithms fail.
- Detailed logs and metadata for reproducibility and debugging.
Output and Application
Research Use Cases
- ATRA mechanism studies in liver cancer and healthy hepatocytes.
- Drug-response profiling and cellular phenotype tracking.
- Immune & metabolic pathway analysis under treatment conditions.
Output Formats
- Interactive visualizations: UMAP, t-SNE, PHATE, and bar/heat/volcano plots.
- Exportable tables & metadata: .CSV, .JSON, .H5AD.
- Reports: summaries of cell composition, enrichment scores, and statistical significance.
Selected items will be exported with readable filenames.
Resources and Availability
| Resource Type | Description |
|---|---|
| Repository | Available on open-source hosting platforms (e.g., GitHub / Hugging Face). Contains full codebase, documentation, and dependencies (requirements.txt, pyproject.toml). |
| Documentation | Detailed bilingual tutorials (English/Chinese) covering installation, workflow, and parameter configuration. |
| Databases | Integrated with LaminDB, Hugging Face datasets, and public single-cell resources (e.g., Human Cell Atlas, GEO). |
| Supported Formats | .h5ad, .zarr, .loom, .csv, .mtx, .zip archives (auto-detection and decompression supported). |
| Environment | Python ≥ 3.11, compatible with Windows, Linux, and macOS. |
| Visualization Libraries | Matplotlib, Seaborn, Plotly (3D and interactive support). |
| External APIs | LaminDB API, Hugging Face Hub, fsspec cloud storage connectors. |
Future Directions
Simple roadmap tracker (adjust statuses to estimate progress).
| Direction | Description | Status |
|---|---|---|
| Mechanism Analysis Module | Integration of transcription factor (TF) networks, ligand–receptor interactions, and gene regulatory modeling. | |
| Multi-omics Integration | Expansion to scATAC-seq and spatial transcriptomics for joint inference of chromatin–transcriptome coupling. | |
| AI-driven Analysis | Machine learning modules for automated feature extraction and biomarker discovery. | |
| Collaborative Research Platform | Cloud dashboards and database-linked annotations to enhance team sharing and review. |
Status scale: Planned → In progress → Beta → Done.
Conclusion
VCAExplorer delivers a comprehensive, transparent, and modular platform for single-cell liver research. It bridges computational rigor with biological interpretability—empowering scientists to move seamlessly from raw transcriptomic data to mechanistic understanding and therapeutic insight.
Through its open-source design, database connectivity, and reproducible architecture, the system sets a benchmark for integrated bioinformatics workflows in precision medicine and liver-targeted drug discovery.