Description | SYPHU-CHINA - iGEM 2025
Loading
Loading . . .
Navigation Bar

Description

hero
0%

Overview Virtual Cell Atlas · Web Software

In liver-targeted drug delivery and single-cell omics, researchers face format fragmentation, complex pipelines, and limited cross-platform visualization. This web tool streamlines data loading, QC, analysis, and rich visualization to accelerate studies on the mechanism of action of all-trans retinoic acid (ATRA) in HCC and normal hepatocytes.

Problem Statement. Datasets are massive and heterogeneous (.h5ad, .zarr, .loom, 10x formats), workflows span cleaning → DR → clustering → trajectory, and collaboration suffers without unified interactive tools. Our design lowers the entry barrier while keeping rigorous, reproducible outputs.

Research Background & Challenges

Format & Scale

  • Massive single-cell transcriptomes
  • Diverse formats: .h5ad, .zarr, .loom, 10x (.h5/.mtx)

Pipeline Complexity

  • QC, cleaning, NaN/Inf repair
  • DR: PCA/UMAP/t-SNE, clustering & trajectory (PAGA)

Collaboration & Viz

  • Hard to share/replicate parameters
  • Lack of unified, interactive web UI

Highlights

Data Ingestion

Multi-source & multi-format

Hugging Face LaminDB S3/GCS Local

.h5ad .zarr .loom .csv/.tsv/.txt 10x .mtx .zip

Smart Repair & Cache

  • Auto-handle NaN/Inf, drop all-zero genes/cells
  • Auto-select best expression matrix (X/raw.X/layers)
  • Local caching for fast reloads

Analysis & Viz

  • PCA, UMAP, t-SNE, Leiden, PAGA
  • Matplotlib / Seaborn / Plotly options

Validated

  • Benchmarked on public DBs
  • iGEM experimental datasets

Tool Components

Core Backend

  • scanpy, anndata, lamindb
  • Caching: .h5ad + metadata JSON

Frontend

  • Streamlit web interface
  • Interactive parameters & plots

Dependencies

  • requirements.txt
  • pyproject.toml for reproducibility

Data Cache Layer

  • Auto-generate fast reload artifacts
  • Repeatable analysis across sessions

User Experience & API

Web UI

  • Upload, configure, visualize in browser
  • Intuitive panels & progress feedback

Unified API

  • Data loading & cache management
  • Easy extension / secondary dev

Compatibility

  • Windows / Linux / macOS
  • Recommended Python 3.11

Openness

  • GenBank / standard formats
  • Integrates with bioinformatics tools

Collaboration & Validation

  • Experimental Validation: Public single-cell DBs and iGEM team data show high consistency with experimental findings.
  • Collaboration: Deep integration with open-source platforms like LaminDB to ensure source reliability.
  • User Feedback: Piloted in small research teams; features iterated based on feedback.
Supported Formats & Data Sources (full list)

Formats: .h5ad, .zarr, .loom, .csv/.tsv/.txt, 10x .h5/.mtx triplets, .zip.

Sources: Local upload, Hugging Face Hub, LaminDB instances, public clouds (S3/GCS).


Next: Data Preprocessing & QC (we’ll add as the second part).

Data Preprocessing & QC

Principle

VCAExplorer provides a reliable, fault-tolerant, and user-friendly environment for processing and visualizing liver-related single-cell transcriptomics.

Goals and Challenges

  • Diverse data formats that are hard to handle uniformly.
  • Inconsistent quality (NaN/Inf, all-zero cells/genes).
  • Complex steps (dimensionality reduction, clustering, trajectory) that may fail on outliers.
  • Repeated analyses without efficient caching.

Core Design Philosophy

  1. Automatic Matrix Selection: pick the most informative matrix from X, raw.X, and layers, without manual work.
  2. Intelligent Repair: replace NaN/Inf, remove all-zero rows/columns, and merge duplicates to stabilize inputs.
  3. Hierarchical DR with Fallback: prefer PCA → UMAP/t-SNE → PAGA; if a step fails, automatically fall back to a PCA embedding so plots are always available.
  4. Caching & Reproducibility: save results and metadata locally for fast reloads and reproducible runs.

Modular Extensibility: analysis logic is decoupled from the UI, making it easy to add new methods (e.g., PHATE, scVelo).

QC & DR (simulated)

Zero-rate filter (threshold = 20%)

20% cutoff 0% 20% 40% 60% 80%
kept removed cutoff

Illustrative distribution: samples above 20% zero-rate are flagged for removal; others are retained.

Data Processing Workflow

Software · Data Processing Workflow

Figure: Data Processing Workflow (PDF preview). If the figure is not on page 1, change page=1 in the URL above.

Key Implementation Details

  • Data source integration: local uploads, Hugging Face Hub, LaminDB, and cloud storage; remote data fetched via a universal file-system interface and cached locally.
  • Dependencies and ecosystem: mainstream single-cell and data-management libraries; multiple plotting backends for visualization.
  • Cache management: results and metadata are saved with timestamps, source modes, and instance identifiers for quick reuse and auditability.
  • User configuration: personalized parameters are saved to keep sessions consistent.
  • Error handling: detect and fix row/column orientation; automatically switch algorithms when a DR method fails.

Design Advantages

  • Plug-and-play

    Users can perform data analysis without needing to understand underlying file structures.

  • Highly fault-tolerant

    Built-in auto-repair and fallback mechanisms ensure uninterrupted workflow.

  • Efficient and reproducible

    Caching and metadata logging enhance repeat experiment efficiency.

  • Scalable

    Supports additional algorithms and external database integration, facilitating team collaboration and future feature upgrades.

User Interface and Compatibility (Web UI, API, formats, UX)

Tutorials

1. Homepage Access

Overview: Launch the app, read the title bar, check the basic info (cell/gene counts, cache time, data source, matrix source), then move to the top navigation with eight modules.

Homepage flow diagram (vertical)
Flow diagram (click to enlarge)

This flow highlights initial checks before exploring modules.

Homepage web UI
Web UI (homepage)

2. Data/Filter Module

Process: Choose the tab → select metadata columns (cell_type, disease, cell_line, tissue) → pick values (multi-select) → see live counts and success prompt → set a grouping field for coloring/statistics.

Data/Filter flow diagram (vertical)
Flow diagram (click to enlarge)

Filtering narrows to the relevant cells/genes and prepares grouping for plots.

Data/Filter web UI
Web UI (filter panel)

3. Embedding Visualization Module

Process: Enter the tab → optionally repair poor-quality data → select UMAP/t-SNE/PCA/PHATE → set 3D/facet/color-by options → inspect interactive scatter and optional density overlay.

Embedding flow diagram (vertical)
Flow diagram (click to enlarge)

Pick the embedding that best separates populations; density helps reveal structure.

Embedding web UI
Web UI (embedding options)

4. Clustering Module

Process: Select Leiden/KMeans/HDBSCAN and parameters → run → results overlay on UMAP → evaluate with silhouette and ARI/NMI → optional cell cycle, gene-set scoring, doublet detection.

Clustering flow diagram (vertical)
Flow diagram (click to enlarge)

Tune resolution/cluster number to balance granularity and biological interpretability.

Clustering web UI
Web UI (clustering controls)

5. Enrichment Analysis Module

Process: Differential analysis (optional reference) → Volcano/MA with thresholds → Enrichr (db names, run, CSV) → GSEA prerank (group, score, gene-set DBs).

Enrichment flow diagram (vertical)
Flow diagram (click to enlarge)

Use Volcano/MA to verify signal direction; Enrichr/GSEA summarize pathways.

Enrichment web UI
Web UI (differential & enrichment)

6. Proportion Analysis Module

Process: Choose grouping/stratification → view proportion tables and stacked bars → build Sankey for category flows → run proportion difference tests and export CSV.

Proportion flow diagram (vertical)
Flow diagram (click to enlarge)

Proportion and flow views help compare cohorts and trace cell-state transitions.

Proportion analysis web UI
Web UI (proportions & Sankey)

7. General Operation Tips

  • Data Saving: Important results from each module can be downloaded through the “Export/Save” tab.
  • Interactivity: All charts are interactive, supporting zoom and hover for detailed information.
  • Real-time Feedback: The system displays progress status and completion prompts after operations.
  • Error Handling: If operations fail, the system shows detailed error messages and solution suggestions.

Data/Filter Module

1. Module Purpose

Core Purpose: The Data/Filter module serves as the entry checkpoint of the analysis workflow. Its primary goal is to allow researchers to quickly and intuitively extract specific cell populations of interest from large-scale single-cell datasets, thereby enabling targeted downstream analyses.

Main Functions:

  1. Conditional filtering: Selects subsets of cells based on metadata (e.g., cell type, tissue of origin, disease state, experimental condition).
  2. Population grouping: Groups filtered results by user-specified fields (e.g., cell_type, tissue), for visualization coloring and comparative analysis.
  3. Data subsetting: Applies consistent subsetting across the expression matrix, metadata, and embedding matrices to preserve integrity.
  4. Prepares inputs for downstream modules: Ensures the filtered dataset can directly feed into clustering, enrichment analysis, proportion analysis, and other modules.

In short, this module acts like a “data magnifier”, focusing the global view onto the specific populations of biological interest.

Concept overview: selection of metadata fields defines a Boolean mask, which is applied consistently to expression, metadata, and embeddings.
Chosen fields cell_type = {T cell, B cell} tissue = {liver} disease = {healthy, disease} OR within a field AND across fields Boolean mask mask(i) = ∧ fields true = keep cell i Expression (X) → X′ rows: kept cells Metadata (O) → O′ rows: kept cells Embeddings (E) → E′ rows: kept cells

2. Assumptions

  1. Data structure assumption: The input is a standard single-cell object (e.g., AnnData), containing expression data, metadata, and embeddings.
  2. Metadata completeness: At least one categorical attribute (e.g., cell_type, tissue) is available for filtering.
  3. Embedding availability: Dimensionality reduction results (UMAP/PCA/t-SNE/PHATE) have already been computed, or can be reconstructed after filtering.
  4. Filter validity: User-selected conditions must match existing metadata values; if overly strict, the system must provide warnings.
  5. Data consistency: Subsetting must update expression, metadata, and embeddings simultaneously to prevent misaligned analyses or visualizations.
Filter logic guide:
Within a chosen field, multiple values are combined with OR (e.g., tissue = liver OR blood).

3. Mathematical Model

From a mathematical perspective, the Data/Filter module performs a set selection based on Boolean masking.

  • Expression matrix X, where (n) = number of cells, (p) = number of genes.
  • Metadata matrix O, with (m) annotation fields.
  • Embedding matrix E, representing each cell in a low-dimensional space.

User filtering conditions are defined as:

Where: (C): the set of chosen metadata fields, e.g. ({cell_type, tissue}). (V(c)): the allowed values within each field, e.g. ({Tcell, Bcell}).

For each cell (i), define a Boolean mask:

mask(i) = AND over chosen fields
keep i if its value ∈ allowed set

The filtered dataset is:

X′ = X[mask, :]
O′ = O[mask, :]
E′ = E[mask, :]
Expression, metadata, and embeddings are subset simultaneously under the same mask (consistency guarantee).

4. Data Sources

Inputs:

  • Expression matrix (cells × genes), either raw counts or normalized expression values.
  • Metadata (obs): cell annotations, such as cell_type, disease status, tissue origin, experimental condition.
  • Low-dimensional embeddings (obsm): UMAP/t-SNE/PCA/PHATE coordinates.
  • User input: chosen filtering fields and values.

Outputs:

  • A new subsetted AnnData object, containing:
    • Filtered expression matrix (X')
    • Filtered metadata (O')
    • Synchronized embedding matrix (E')
Inputs • Expression (cells × genes) • Metadata (obs) • Embeddings (obsm) • User-selected fields & values Filtering & Subsetting Boolean mask Consistency guarantee Outputs • X′ (filtered expression) • O′ (filtered metadata) • E′ (filtered embeddings) • Ready for downstream modules

5. Implementation (Conceptual Workflow)

Logical steps (code-agnostic description):

  1. Detect available fields: The system scans metadata to identify categorical attributes suitable for filtering.
  2. User selects conditions: The researcher specifies one or more fields and values (e.g., tissue = liver, cell_type = T cell).
  3. Generate filtering mask: A Boolean vector is constructed to mark retained cells.
  4. Apply subsetting:
    • Expression matrix: rows corresponding to selected cells are kept.
    • Metadata: rows are subset accordingly.
    • Embedding matrices: rows are trimmed to maintain alignment.
  5. Output and reporting: A new dataset is returned, and the system reports the number of cells and genes after filtering.

This ensures seamless integration into downstream modules without requiring additional preprocessing.

Focus a workflow step:
Detect fields Select conditions Generate mask Apply subset Report

6. Outputs

The module produces results on two levels:

  1. Data result:
    • A new subsetted AnnData object, with consistent expression data, metadata, and embeddings.
    • Ready-to-use for downstream visualization and analysis.
  2. Interface feedback:
    • Displays the number of cells before and after filtering (e.g., from 20,000 to 3,245).
    • Gene count (usually unchanged, retaining the full gene set).
    • Highlights selected grouping fields (e.g., cell_type) for coloring and statistical analysis.
Cells before → after
Grouping field highlighted
Gene count usually unchanged

7. Module Features and Highlights

  1. User-friendly interaction:
    • Researchers can apply complex filtering through simple multi-select operations, without coding.
  2. High flexibility:
    • Supports multi-field, multi-value combinations, enabling complex cohort comparisons (e.g., “T and B cells in liver tissue”).
  3. Consistency guarantee:
    • Ensures expression, metadata, and embeddings are always updated in sync, avoiding misalignment.
  4. Robustness:
    • Provides warnings when conditions are too strict or fields are missing, instead of failing.
  5. Performance optimized:
    • Based on sparse matrix and Boolean indexing, ensuring scalability to million-cell datasets.
  6. Seamless downstream integration:
    • Outputs can directly feed into Embedding Visualization, Clustering, Enrichment Analysis, Proportion Analysis, forming a coherent pipeline.

Embedding Visualization Module

1. Module Purpose

The Embedding Visualization module transforms high-dimensional single-cell gene expression matrices into low-dimensional spaces, enabling intuitive visualization of relationships among cells. It allows researchers to directly observe cell-type distributions, differentiation trajectories, and population structures.

Main Objectives:

  • Dimensionality reduction visualization: Project tens of thousands to millions of cells into 2D or 3D space.
  • Structural discovery: Reveal local and global structures such as clusters, gradients, and lineage transitions.
  • Metadata integration: Combine biological annotations (e.g., tissue, disease, cell type) for color-based grouping.
  • Exploratory analysis entry point: Serve as a visual gateway to downstream modules such as clustering, enrichment, and proportion analysis.
Concept overview: mapping high-dimensional expression \(X\) to a low-dimensional embedding \(Y\) for visual interpretation.
High-D expression (X) Mapping f: X → Y preserve geometry/statistics Low-D (Y)

2. Assumptions

  1. Preprocessing assumption: The input data has undergone quality control (QC), normalization, and batch correction.
  2. Manifold assumption: The high-dimensional gene expression data lies on a low-dimensional manifold with meaningful geometry.
  3. Distance preservation assumption: Similar cells remain close together in the embedded space, while distinct groups are separable.
  4. Interpretability assumption: The embedding structure aligns with biological metadata to support meaningful interpretation.

3. Mathematical Model

The mathematical foundation of this module is nonlinear dimensionality reduction, which seeks a mapping that preserves the geometric or statistical relationships of the high-dimensional data in a lower-dimensional representation.

Given an expression matrix:

\[ X = [x_1, x_2, \ldots, x_n]^{\top} \in \mathbf{R}^{n\times p} \] where \(n\) is the number of cells and \(p\) is the number of genes, we aim to find a mapping:

\[ f:\ \mathbf{R}^{p} \rightarrow \mathbf{R}^{d},\quad d \in \{2,3\} \]

such that the embedded coordinates

\[ Y = f(X) \in \mathbf{R}^{n\times d} \] preserve as much structural information from \(X\) as possible.

Algorithms (switch to view):

(1) PCA (Principal Component Analysis)

PCA finds a linear projection that maximizes variance: \[ Y = X W_d,\qquad W_d = [\,w_1, w_2, \ldots, w_d\,] \] where \(W_d\) consists of the top \(d\) eigenvectors of \(X^{\top} X\). Objective function: \[ \max_{W_d^{\top} W_d = I_d}\ \mathrm{Tr}\!\left(W_d^{\top}\, X^{\top} X\, W_d\right). \]

4. Data Sources

  • Inputs:
    • High-dimensional expression matrix (normalized and batch-corrected).
    • Metadata annotations (e.g., cell type, tissue, disease) for grouping and coloring.
    • Optional: filtered subsets from the Data/Filter module.
  • Outputs:
    • Low-dimensional embedding matrix \(Y \in \mathbf{R}^{n\times d}\), stored in adata.obsm.
    • Visualization configuration data (color schemes, scales, labels, and axes information).

5. Implementation

  1. Algorithm selection: Users can choose PCA, t-SNE, UMAP, or PHATE.
  2. Embedding computation: The system computes the mapping from high- to low-dimensional space and produces the coordinate matrix.
  3. Result storage: The embedding and relevant metadata are stored in the AnnData object for reuse.
  4. Visualization generation:
    • 2D/3D scatter plots with color coding by metadata (e.g., cell_type, tissue, disease).
    • Expression-based coloring (Feature plots).
    • Cluster highlighting.
  5. Interactive exploration: The interface allows zooming, rotation, and selection of specific cells for detailed analysis.

6. Outputs

  1. Data-level outputs:
    • Low-dimensional coordinate matrix \(Y\), containing each cell’s 2D/3D coordinates.
    • Embedding parameters (e.g., number of neighbors, minimum distance, random seed) for reproducibility.
    • Updated AnnData object ready for downstream analysis.
  2. Visualization-level outputs:
    • 2D scatter plots: Reveal overall structure, cluster boundaries, and transitional regions.
    • 3D interactive plots: Enable rotation and depth exploration for complex systems.
    • Multi-mode coloring: Supports overlays for metadata, gene expression, and clustering labels.
    • Statistical summaries: Display the number of groups, density distributions, and explained variance ratios.
    • Export options: Allow exporting images, embedding matrices, and parameter reports for publication or presentation.
Export options example — page 1
Export options example — page 2

7. Features and Highlights

  1. Multi-algorithm integration: Combines linear (PCA) and nonlinear (t-SNE, UMAP, PHATE) approaches for flexible analysis.
  2. High interpretability: Provides intuitive visualization of biological relationships and transitions.
  3. Interactive exploration: Supports real-time zooming, labeling, and local analysis.
  4. Layered visualization: Enables combined display of metadata and gene expression gradients.
  5. High computational efficiency: Uses sparse matrices and neighbor-based acceleration for large-scale datasets.
  6. Seamless integration: Directly connects with the Data/Filter module upstream and provides embeddings for downstream modules such as Clustering, Enrichment, and Proportion Analysis.

Clustering Module

1. Module Purpose

The Clustering module uncovers cellular heterogeneity in single-cell transcriptomics by grouping cells with similar gene-expression profiles into subpopulations, enabling identification of cell types, functional states, and developmental lineages.

Primary objectives:

  • Group high-dimensional single-cell data into biologically meaningful clusters.
  • Reveal hidden structures such as subpopulations and transitional states.
  • Provide cluster labels for downstream modules (enrichment, proportion, mechanism analysis).
  • Enable visual exploration in embedding spaces (UMAP / t-SNE).

2. Assumptions

  1. Expression similarity: cells with similar transcriptional profiles tend to share biological functions.
  2. Manifold: cells lie on a low-dimensional manifold; neighborhood relations carry structure.
  3. Separability: distinct subgroups form compact, separable regions in the embedding space.
  4. Embedding validity: input embeddings (e.g., PCA/UMAP) retain essential biological information.
  5. Stability: under similar parameters, clustering is consistent and reproducible.

3. Mathematical Model

Let the embedded matrix after dimensionality reduction be

\[ Y = [\,y_1, y_2, \ldots, y_n\,]^{\top} \in \mathbf{R}^{\,n\times d}, \] where \(y_i \in \mathbf{R}^{\,d}\) is the embedding of the \(i\)-th cell. The goal is to divide all cells into \(k\) clusters

\[ \mathcal{C}=\{C_1,\ldots,C_k\},\quad \bigcup_{j=1}^{k} C_j=\{1,2,\ldots,n\},\quad C_a\cap C_b = \{\}\ \ (a\ne b) \]

and minimize intra-cluster variance while maximizing inter-cluster separation. Let \(D(\cdot,\cdot)\) be a distance and \(\mu_j\) be the centroid of cluster \(C_j\):

\[ \min_{\{\mu_j\},\,\mathcal{C}}\ \sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]

Algorithms (switch to view):

(1) K-Means (centroid-based)

Objective (Within-Cluster Sum of Squares, WCSS):

\[ \min_{\{\mu_j\}}\ \sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]

Iteration: (i) assignment — send each \(y_i\) to the nearest centroid; (ii) update — recompute \(\mu_j\) as the mean of assigned points; repeat until convergence.

4. Data Sources

  • Input:
    • Low-dimensional embeddings (from PCA, UMAP, t-SNE, PHATE, etc.).
    • Cell–cell similarity or adjacency matrix.
    • Optional metadata for biological annotation.
  • Output:
    • Cluster labels (cluster IDs).
    • Cluster statistics (size, centroid, intra-variance, inter-distance, modularity).
    • Updated AnnData object (adata.obs["cluster"]).

5. Implementation

5.1 Data Input

  • Load the embedding matrix from the Embedding Visualization module.
  • Auto-detect dense/sparse format.
  • Choose algorithm based on dataset size and structure.

5.2 Processing Flow

  1. Neighborhood construction: compute pairwise distances (Euclidean / cosine) and build a kNN graph.
  2. Algorithm & parameters: select K-Means / Leiden / GMM / Spectral; set \(k\), resolution, neighbor size, seed.
  3. Clustering computation: run the selected algorithm to assign labels.
  4. Evaluation & tuning: assess with Silhouette, Modularity, Inertia, or Davies–Bouldin; auto-tune if needed.
  5. Result integration: store labels to AnnData and wire to visualization for color mapping.

5.3 Output Results

  • Cluster label per cell.
  • Per-cluster statistics (size, density, centroid).
  • Quality metrics (Silhouette, Modularity, Inertia, DB-Index).
  • Exportable results (CSV/JSON) and interactive visual components.
Quality metrics (switch to view):

Silhouette (−1 to 1): higher is better; compares intra-cluster cohesion \(a(i)\) to nearest-cluster separation \(b(i)\).

\[ s(i)=\frac{b(i)-a(i)}{\max\{a(i),\,b(i)\}}. \]

6. Output Explanation

  1. Data-level: cluster IDs and centroids; cells per cluster; intra-variance; quality metrics; parameter logs (algorithm, seed, kNN size).
  2. Visualization-level: UMAP/t-SNE with color-coded clusters; cluster-distance heatmaps; mean-expression heatmaps; interactive selection and metadata inspection.
Clustered embedding view — example page 1
Cluster statistics and controls — example page 2

7. Module Features and Highlights

  1. Multi-Algorithm Integration: K-Means, Louvain/Leiden, GMM, Spectral.
  2. Topology Preservation: graph-based modeling maintains neighborhood structure.
  3. High Biological Interpretability: aligns with metadata for cell-type annotation.
  4. Automatic Parameter Optimization: internal metrics for self-tuning.
  5. Scalability: efficient for large datasets with sparse/GPU acceleration.
  6. Seamless Integration: works with Embedding, Enrichment, Proportion modules.
  7. Interactive Visualization: instant sync between labels and graphics.

Enrichment Analysis Module

1. Module Purpose

The Enrichment Analysis module bridges data-driven clustering and biological mechanism discovery. It tests specific clusters or DEG sets to find significantly enriched biological processes, molecular functions, pathways, and regulatory programs.

Main objectives:

  • Reveal functional characteristics and active pathways of each cluster.
  • Identify key processes (e.g., metabolism, immune response, differentiation).
  • Link gene expression changes to regulatory/signaling mechanisms.
  • Provide pathway-level input for mechanism analysis with interpretable visuals.

2. Assumptions

  1. Functional modularity: genes act in modules (pathways/complexes) rather than in isolation.
  2. Expression–function correlation: up/down-regulated genes imply activation/inhibition of functions.
  3. Approximate independence: gene-level statistics are sufficiently independent for set-level inference.
  4. Annotation reliability: curated databases (GO, KEGG, Reactome, MSigDB, etc.).
  5. Sample representativeness: each analyzed cluster contains enough cells for robust statistics.

3. Mathematical Model

The core principle is to test whether a predefined functional gene set \(S\) is statistically over-represented in a target set \(T\).

Let: \(M\) = total background genes; \(M_S\) = genes in set \(S\); \(N\) = genes in target \(T\); \(k\) = overlap of \(S\) and \(T\).

Methods (switch to view):

(1) Hypergeometric Test — for discrete sets (e.g., up-regulated DEGs).

\[ C(n,k) \;=\; \frac{n!}{\,k!\,(n-k)!\,}. \]

\[ P \;=\; 1 \;-\; \sum_{i=0}^{\,k-1} \frac{\,C(M_S,\,i)\; C(M-M_S,\,N-i)\,}{\,C(M,\,N)\,}\, . \]

If \(P<0.05\), the pathway or set is considered significantly enriched (assuming sampling from the background universe).

4. Data Sources

  • Input:
    • DE results (gene, logFC, P-value).
    • Cluster labels from the Clustering module.
    • Functional databases: GO, KEGG, Reactome, MSigDB, WikiPathways, TRANSFAC, TRRUST.
    • Optional user-defined sets (GMT/CSV).
  • Output:
    • Significant pathways/functions per cluster.
    • Statistics (ES, NES, P-value, FDR) with visual and tabular summaries.

5. Implementation

5.1 Data Input

  • Load cluster information from the Clustering module.
  • Import DEG lists (logFC, P-value).
  • Load gene-set libraries (GO, KEGG, MSigDB, etc.).
  • Optionally accept custom sets.

5.2 Processing Flow

  1. Preparation: deduplicate/normalize genes (HGNC/ENSEMBL); define background.
  2. Computation: ORA (hypergeometric) for discrete lists; GSEA for ranked lists; GSVA for cell/sample-level activities.
  3. Statistics: compute P/FDR and ES/NES; filter by \(FDR<0.05\).
  4. Integration: map enriched pathways to clusters; auto-annotate cluster functions.
  5. Visualization: bubble/bar plots, GSEA curves, heatmaps; interactive drill-down.

5.3 Output Results

  • Enrichment table (pathway, category, P, FDR, gene count, ES/NES).
  • Key gene lists within enriched pathways.
  • Visual outputs: bubble plot, top-bar plot, GSEA curve, heatmap.
  • Export formats: CSV, Excel, JSON, PDF.

6. Output Explanation

  1. Data-level: pathway statistics (P, FDR, ES/NES); contributing genes; cluster–pathway matrix.
  2. Visualization-level: bubble plot (x=enrichment score; y=pathway; size=gene count), top-bar plot, GSEA trend, heatmap; optional network to show gene overlap.
  3. Statistical summary: number of significant pathways, direction (up/down), mean FDR; auto-generated report for publication/export.
Enrichment analysis visualization example

7. Module Features and Highlights

  1. Multi-algorithm integration: ORA, GSEA, GSVA, PAGE.
  2. Comprehensive databases: GO, KEGG, Reactome, MSigDB.
  3. Hierarchical analysis: single-cluster, multi-cluster, global comparisons.
  4. Statistical rigor: FDR correction and permutation testing.
  5. High interpretability: connects gene-level changes to pathway-level meaning.
  6. Extensible: custom gene sets and cross-species annotation.
  7. Dynamic visualization: interactive and filterable outputs.
  8. Seamless integration with Clustering, Mechanism, and Proportion modules.

Proportion Analysis Module

1. Module Purpose

The Proportion Analysis module quantifies how cell population fractions change across experimental conditions (e.g., control vs. treatment). Based on clustering results, it computes per-cluster fractions per sample, performs statistical testing, and visualizes composition shifts.

Main objectives:

  • Quantify cluster proportions under different conditions.
  • Detect statistically significant changes in composition.
  • Reveal structural remodeling under perturbations.
  • Provide bar/stacked/heatmap views and supply results to downstream modules.

2. Assumptions

  1. Label accuracy: clustering labels are biologically meaningful.
  2. Sample comparability: comparable depth/size so that proportions reflect biology.
  3. Independence: cell assignments across samples are treated as independent observations.
  4. Stability: replicates show similar proportions without perturbation.
  5. Closed population: total counts are fixed or normalized across samples.

3. Mathematical Model

The objective is to determine whether cell-type proportions differ significantly between conditions. Let:

  • \(N_{ij}\): number of cells in cluster \(j\) for sample \(i\);
  • \(n_i=\sum_j N_{ij}\): total cells in sample \(i\);
  • Cell proportion: \(\displaystyle p_{ij}=\frac{N_{ij}}{n_i}\).
Composition illustration (mock): stacked fractions for two groups.
Cluster A Cluster B Cluster C
Control
Treatment

4. Statistical Tests

(1) Proportion Difference Test (two groups) — Chi-square test for contingency counts.

\[ \chi^{2} \;=\; \sum_{j=1}^{k} \frac{(O_j - E_j)^{2}}{E_j}, \]

where \(O_j\) is the observed count and \(E_j\) is the expected count under the null hypothesis. If \(P<0.05\), that cell-type proportion differs significantly.

Fisher’s exact test — for small counts (2×2 table). The exact tail probability is computed from the hypergeometric distribution.

\[ P \;=\; \sum_{\text{tables } t \ \text{as or more extreme}} \frac{\,C(n_{1\cdot},\,t_{11})\,C(n_{2\cdot},\,t_{21})\,}{\,C(n_{\cdot\cdot},\,n_{\cdot 1})\,}, \] using \(C(n,k)=\frac{n!}{k!\,(n-k)!}\) for combinations.

4. Data Sources

  • Input data:
    • Cluster labels (adata.obs['cluster']).
    • Condition labels (adata.obs['condition']).
    • Per-sample cell count matrices; sample metadata.
  • Output data:
    • Cell proportion table for each cluster across groups.
    • Statistical results (P, FDR, fold-change).
    • Visualizations (bar, stacked, heatmap).

5. Implementation

5.1 Data Input

  • Load cluster and condition labels; allow user-defined grouping (tissue, timepoint, etc.).
  • Normalize counts per sample to ensure comparability.

5.2 Processing Flow

  1. Aggregation & proportions: count \(N_{ij}\); compute \(p_{ij}=N_{ij}/n_i\); form a sample×cluster proportion matrix.
  2. Testing & significance: Chi-square / Fisher for two groups; ANOVA or Kruskal–Wallis for multiple groups; compute \(P\) and adjust FDR; flag \(FDR<0.05\).
  3. Visualization & reporting: bar, stacked, heatmap, volcano/bubble; annotate significant clusters; export tables.
  4. Integration: link proportion shifts with enrichment/mechanism results.

5.3 Output Results

  • Proportion matrix across conditions.
  • Summary table (P, FDR, \(\log_{2}\)FC).
  • Visual outputs: bar, stacked, heatmap, volcano, bubble.
  • Exportable files (CSV, Excel, JSON).
Proportion analysis — visualization example

6. Output Explanation

  1. Data-level: cluster×condition matrix; mean, sd, \(\log_{2}\)FC, P, FDR; markers for increases/decreases.
  2. Visualization: bar (per group), stacked (global composition), heatmap (sample comparison), volcano/bubble (effect vs. significance).
  3. Statistical report: summary of shifts; method details; multiple-testing correction summary.

7. Features and Highlights

  1. Automated quantification across conditions.
  2. Comprehensive testing (χ²/Fisher; ANOVA/Kruskal–Wallis).
  3. High interpretability through linkage with clustering and enrichment.
  4. Rich but clean visualization options.
  5. Statistical rigor with FDR control.
  6. Extensible to multi-condition/time-series/spatial data.
  7. Interactive exploration for quick insight.
  8. Seamless integration with Embedding, Clustering, Enrichment modules.

8. Summary and Resources

The Virtual Cell Atlas Explorer (VCAExplorer) forms an end-to-end, fault-tolerant, and interactive framework for liver-related single-cell research, integrating LaminDB connectivity, filtering, embedding, clustering, enrichment, and proportion analysis into a coherent pipeline for reproducible biological insight.

Scientific Contribution

  • Data-level integration: Harmonizes diverse single-cell data formats (.h5ad, .zarr, .loom, 10x .mtx) for unified downstream analysis.
  • Algorithmic innovation: Implements multi-stage fallback mechanisms to ensure successful dimensionality reduction and clustering even on noisy datasets.
  • Functional insight: Bridges molecular-level data (gene expression) with system-level phenomena (cell-type composition and pathway activation).
  • Reproducibility: Caching and metadata tracking guarantee reproducible analyses across sessions and users.
  • Extensibility: Each module supports modular expansion (e.g., PHATE, scVelo, Harmony), maintaining long-term scalability.

Impact across pipeline stages

Scale: 0 (none) → 5 (strong) Selected: Data Integration — strongest in ingestion & embedding.
Ingestion
Embedding
Clustering
Enrichment
Proportion
Reproducibility checklist
Cache (.h5ad + JSON) Run metadata User config
Reproducibility score: 3 / 3
Tip: uncheck items to simulate missing pieces.

Program Architecture Overview

Data Flow — Source → Filter → Embedding → Clustering → Enrichment → Proportion → Mechanism
Active Fallback labels Fallback simulation
Data Sourcehf:// · laminDB · S3/GCS · Local
Data/Filtermetadata subsetting & sync
Embedding VisualizationPCA fallback ready
Clusteringaccepts PCA fallback input
EnrichmentORA · GSEA · GSVA
Proportionper-group fractions
Mechanism Analysisintegrative reasoning
Select a stage to see details.
Program Architecture — UI · API · Core Analytics · Storage & Cache · Cloud
Web UI (Streamlit) Interactive controls & multi-format plots API Layer Load · Cache · Sanitize · Run Core Analytics scanpy · anndata · lamindb Storage & Cache Local cache (.h5ad + meta.json) Cloud & Data Sources LaminDB · Hugging Face · S3/GCS

Design Features

  • Fully automated preprocessing, normalization, and metadata alignment.
  • Interactive Streamlit-based Web UI with multi-format visualization.
  • API layer for developers to embed VCAExplorer functions in larger workflows.
  • Cloud-compatible architecture supporting LaminDB, Hugging Face, and S3/GCS storage.

Error Handling

  • Smart auto-repair of corrupted or misaligned data matrices.
  • Automatic fallback to simpler embeddings (PCA) when advanced algorithms fail.
  • Detailed logs and metadata for reproducibility and debugging.

Output and Application

Research Use Cases

  • ATRA mechanism studies in liver cancer and healthy hepatocytes.
  • Drug-response profiling and cellular phenotype tracking.
  • Immune & metabolic pathway analysis under treatment conditions.
Focus: ATRA mechanism — compare tumor vs. normal hepatocytes.
Tip: click a pill to switch the preview.

Output Formats

  • Interactive visualizations: UMAP, t-SNE, PHATE, and bar/heat/volcano plots.
  • Exportable tables & metadata: .CSV, .JSON, .H5AD.
  • Reports: summaries of cell composition, enrichment scores, and statistical significance.
Choose visuals
Choose tables
Export summary

Selected items will be exported with readable filenames.

Resources and Availability

Tip: use the filter to quickly locate an item.
Resource Type Description
Repository Available on open-source hosting platforms (e.g., GitHub / Hugging Face). Contains full codebase, documentation, and dependencies (requirements.txt, pyproject.toml).
Documentation Detailed bilingual tutorials (English/Chinese) covering installation, workflow, and parameter configuration.
Databases Integrated with LaminDB, Hugging Face datasets, and public single-cell resources (e.g., Human Cell Atlas, GEO).
Supported Formats .h5ad, .zarr, .loom, .csv, .mtx, .zip archives (auto-detection and decompression supported).
Environment Python ≥ 3.11, compatible with Windows, Linux, and macOS.
Visualization Libraries Matplotlib, Seaborn, Plotly (3D and interactive support).
External APIs LaminDB API, Hugging Face Hub, fsspec cloud storage connectors.

Future Directions

Simple roadmap tracker (adjust statuses to estimate progress).

Overall progress: 0%
Direction Description Status
Mechanism Analysis Module Integration of transcription factor (TF) networks, ligand–receptor interactions, and gene regulatory modeling.
Multi-omics Integration Expansion to scATAC-seq and spatial transcriptomics for joint inference of chromatin–transcriptome coupling.
AI-driven Analysis Machine learning modules for automated feature extraction and biomarker discovery.
Collaborative Research Platform Cloud dashboards and database-linked annotations to enhance team sharing and review.

Status scale: PlannedIn progressBetaDone.

Conclusion

VCAExplorer delivers a comprehensive, transparent, and modular platform for single-cell liver research. It bridges computational rigor with biological interpretability—empowering scientists to move seamlessly from raw transcriptomic data to mechanistic understanding and therapeutic insight.

Through its open-source design, database connectivity, and reproducible architecture, the system sets a benchmark for integrated bioinformatics workflows in precision medicine and liver-targeted drug discovery.

Software · Desktop App

Demo (Desktop): A dedicated page to preview and explain the browser version of our software.
Opens a separate page that keeps the same look & feel as this site.
Footer