Software | SYPHU-CHINA

Overview Virtual Cell Atlas · Web Software

In liver-targeted drug delivery and single-cell omics, researchers face format fragmentation, complex pipelines, and limited cross-platform visualization. This web tool streamlines data loading, QC, analysis, and rich visualization to accelerate studies on the mechanism of action of all-trans retinoic acid (ATRA) in HCC and normal hepatocytes.

Problem Statement. Datasets are massive and heterogeneous (.h5ad, .zarr, .loom, 10x formats), workflows span cleaning → DR → clustering → trajectory, and collaboration suffers without unified interactive tools. Our design lowers the entry barrier while keeping rigorous, reproducible outputs.

Research Background & Challenges

Format & Scale

Massive single-cell transcriptomes
Diverse formats: .h5ad, .zarr, .loom, 10x (.h5/.mtx)

Pipeline Complexity

QC, cleaning, NaN/Inf repair
DR: PCA/UMAP/t-SNE, clustering & trajectory (PAGA)

Collaboration & Viz

Hard to share/replicate parameters
Lack of unified, interactive web UI

Highlights

Data Ingestion

Multi-source & multi-format

Hugging Face LaminDB S3/GCS Local

.h5ad .zarr .loom .csv/.tsv/.txt 10x .mtx .zip

Smart Repair & Cache

Auto-handle NaN/Inf, drop all-zero genes/cells
Auto-select best expression matrix (X/raw.X/layers)
Local caching for fast reloads

Analysis & Viz

PCA, UMAP, t-SNE, Leiden, PAGA
Matplotlib / Seaborn / Plotly options

Validated

Benchmarked on public DBs
iGEM experimental datasets

Tool Components

Core Backend

scanpy, anndata, lamindb
Caching: .h5ad + metadata JSON

Frontend

Streamlit web interface
Interactive parameters & plots

Dependencies

requirements.txt
pyproject.toml for reproducibility

Data Cache Layer

Auto-generate fast reload artifacts
Repeatable analysis across sessions

User Experience & API

Web UI

Upload, configure, visualize in browser
Intuitive panels & progress feedback

Unified API

Data loading & cache management
Easy extension / secondary dev

Compatibility

Windows / Linux / macOS
Recommended Python 3.11

Openness

GenBank / standard formats
Integrates with bioinformatics tools

Collaboration & Validation

Experimental Validation: Public single-cell DBs and iGEM team data show high consistency with experimental findings.
Collaboration: Deep integration with open-source platforms like LaminDB to ensure source reliability.
User Feedback: Piloted in small research teams; features iterated based on feedback.

Supported Formats & Data Sources (full list)

Formats: .h5ad, .zarr, .loom, .csv/.tsv/.txt, 10x .h5/.mtx triplets, .zip.

Sources: Local upload, Hugging Face Hub, LaminDB instances, public clouds (S3/GCS).

Next: Data Preprocessing & QC (we’ll add as the second part).

Data Preprocessing & QC

Principle

VCAExplorer provides a reliable, fault-tolerant, and user-friendly environment for processing and visualizing liver-related single-cell transcriptomics.

Goals and Challenges

Diverse data formats that are hard to handle uniformly.
Inconsistent quality (NaN/Inf, all-zero cells/genes).
Complex steps (dimensionality reduction, clustering, trajectory) that may fail on outliers.
Repeated analyses without efficient caching.

Core Design Philosophy

Automatic Matrix Selection: pick the most informative matrix from X, raw.X, and layers, without manual work.
Intelligent Repair: replace NaN/Inf, remove all-zero rows/columns, and merge duplicates to stabilize inputs.
Hierarchical DR with Fallback: prefer PCA → UMAP/t-SNE → PAGA; if a step fails, automatically fall back to a PCA embedding so plots are always available.
Caching & Reproducibility: save results and metadata locally for fast reloads and reproducible runs.

Modular Extensibility: analysis logic is decoupled from the UI, making it easy to add new methods (e.g., PHATE, scVelo).

QC & DR (simulated)

Zero-rate filter (threshold = 20%)

kept removed cutoff

Illustrative distribution: samples above 20% zero-rate are flagged for removal; others are retained.

Data Processing Workflow

Software · Data Processing Workflow

Open in new tab Download PDF

Figure: Data Processing Workflow (PDF preview). If the figure is not on page 1, change page=1 in the URL above.

Key Implementation Details

Data source integration: local uploads, Hugging Face Hub, LaminDB, and cloud storage; remote data fetched via a universal file-system interface and cached locally.
Dependencies and ecosystem: mainstream single-cell and data-management libraries; multiple plotting backends for visualization.
Cache management: results and metadata are saved with timestamps, source modes, and instance identifiers for quick reuse and auditability.
User configuration: personalized parameters are saved to keep sessions consistent.
Error handling: detect and fix row/column orientation; automatically switch algorithms when a DR method fails.

Design Advantages

Plug-and-play

Users can perform data analysis without needing to understand underlying file structures.
Highly fault-tolerant

Built-in auto-repair and fallback mechanisms ensure uninterrupted workflow.
Efficient and reproducible

Caching and metadata logging enhance repeat experiment efficiency.
Scalable

Supports additional algorithms and external database integration, facilitating team collaboration and future feature upgrades.

User Interface and Compatibility (Web UI, API, formats, UX)

Tutorials

1. Homepage Access

Overview: Launch the app, read the title bar, check the basic info (cell/gene counts, cache time, data source, matrix source), then move to the top navigation with eight modules.

This flow highlights initial checks before exploring modules.

2. Data/Filter Module

Process: Choose the tab → select metadata columns (cell_type, disease, cell_line, tissue) → pick values (multi-select) → see live counts and success prompt → set a grouping field for coloring/statistics.

Filtering narrows to the relevant cells/genes and prepares grouping for plots.

3. Embedding Visualization Module

Process: Enter the tab → optionally repair poor-quality data → select UMAP/t-SNE/PCA/PHATE → set 3D/facet/color-by options → inspect interactive scatter and optional density overlay.

Pick the embedding that best separates populations; density helps reveal structure.

4. Clustering Module

Process: Select Leiden/KMeans/HDBSCAN and parameters → run → results overlay on UMAP → evaluate with silhouette and ARI/NMI → optional cell cycle, gene-set scoring, doublet detection.

Tune resolution/cluster number to balance granularity and biological interpretability.

5. Enrichment Analysis Module

Process: Differential analysis (optional reference) → Volcano/MA with thresholds → Enrichr (db names, run, CSV) → GSEA prerank (group, score, gene-set DBs).

Use Volcano/MA to verify signal direction; Enrichr/GSEA summarize pathways.

6. Proportion Analysis Module

Process: Choose grouping/stratification → view proportion tables and stacked bars → build Sankey for category flows → run proportion difference tests and export CSV.

Proportion and flow views help compare cohorts and trace cell-state transitions.

7. General Operation Tips

Data Saving: Important results from each module can be downloaded through the “Export/Save” tab.
Interactivity: All charts are interactive, supporting zoom and hover for detailed information.
Real-time Feedback: The system displays progress status and completion prompts after operations.
Error Handling: If operations fail, the system shows detailed error messages and solution suggestions.

Data/Filter Module

1. Module Purpose

Core Purpose: The Data/Filter module serves as the entry checkpoint of the analysis workflow. Its primary goal is to allow researchers to quickly and intuitively extract specific cell populations of interest from large-scale single-cell datasets, thereby enabling targeted downstream analyses.

Main Functions:

Conditional filtering: Selects subsets of cells based on metadata (e.g., cell type, tissue of origin, disease state, experimental condition).
Population grouping: Groups filtered results by user-specified fields (e.g., cell_type, tissue), for visualization coloring and comparative analysis.
Data subsetting: Applies consistent subsetting across the expression matrix, metadata, and embedding matrices to preserve integrity.
Prepares inputs for downstream modules: Ensures the filtered dataset can directly feed into clustering, enrichment analysis, proportion analysis, and other modules.

In short, this module acts like a “data magnifier”, focusing the global view onto the specific populations of biological interest.

Concept overview: selection of metadata fields defines a Boolean mask, which is applied consistently to expression, metadata, and embeddings.

2. Assumptions

Data structure assumption: The input is a standard single-cell object (e.g., AnnData), containing expression data, metadata, and embeddings.
Metadata completeness: At least one categorical attribute (e.g., cell_type, tissue) is available for filtering.
Embedding availability: Dimensionality reduction results (UMAP/PCA/t-SNE/PHATE) have already been computed, or can be reconstructed after filtering.
Filter validity: User-selected conditions must match existing metadata values; if overly strict, the system must provide warnings.
Data consistency: Subsetting must update expression, metadata, and embeddings simultaneously to prevent misaligned analyses or visualizations.

Filter logic guide:

Within a chosen field, multiple values are combined with OR (e.g., tissue = liver OR blood).

3. Mathematical Model

From a mathematical perspective, the Data/Filter module performs a set selection based on Boolean masking.

Expression matrix X, where (n) = number of cells, (p) = number of genes.
Metadata matrix O, with (m) annotation fields.
Embedding matrix E, representing each cell in a low-dimensional space.

User filtering conditions are defined as:

Where: (C): the set of chosen metadata fields, e.g. ({cell_type, tissue}). (V(c)): the allowed values within each field, e.g. ({Tcell, Bcell}).

For each cell (i), define a Boolean mask:

mask(i) = AND over chosen fields

keep i if its value ∈ allowed set

The filtered dataset is:

X′ = X[mask, :]

O′ = O[mask, :]

E′ = E[mask, :]

Expression, metadata, and embeddings are subset simultaneously under the same mask (consistency guarantee).

4. Data Sources

Inputs:

Expression matrix (cells × genes), either raw counts or normalized expression values.
Metadata (obs): cell annotations, such as cell_type, disease status, tissue origin, experimental condition.
Low-dimensional embeddings (obsm): UMAP/t-SNE/PCA/PHATE coordinates.
User input: chosen filtering fields and values.

Outputs:

A new subsetted AnnData object, containing:
- Filtered expression matrix (X')
- Filtered metadata (O')
- Synchronized embedding matrix (E')

5. Implementation (Conceptual Workflow)

Logical steps (code-agnostic description):

Detect available fields: The system scans metadata to identify categorical attributes suitable for filtering.
User selects conditions: The researcher specifies one or more fields and values (e.g., tissue = liver, cell_type = T cell).
Generate filtering mask: A Boolean vector is constructed to mark retained cells.
Apply subsetting:
- Expression matrix: rows corresponding to selected cells are kept.
- Metadata: rows are subset accordingly.
- Embedding matrices: rows are trimmed to maintain alignment.
Output and reporting: A new dataset is returned, and the system reports the number of cells and genes after filtering.

This ensures seamless integration into downstream modules without requiring additional preprocessing.

Focus a workflow step:

6. Outputs

The module produces results on two levels:

Data result:
- A new subsetted AnnData object, with consistent expression data, metadata, and embeddings.
- Ready-to-use for downstream visualization and analysis.
Interface feedback:
- Displays the number of cells before and after filtering (e.g., from 20,000 to 3,245).
- Gene count (usually unchanged, retaining the full gene set).
- Highlights selected grouping fields (e.g., cell_type) for coloring and statistical analysis.

Cells before → after

Grouping field highlighted

Gene count usually unchanged

7. Module Features and Highlights

User-friendly interaction:
- Researchers can apply complex filtering through simple multi-select operations, without coding.
High flexibility:
- Supports multi-field, multi-value combinations, enabling complex cohort comparisons (e.g., “T and B cells in liver tissue”).
Consistency guarantee:
- Ensures expression, metadata, and embeddings are always updated in sync, avoiding misalignment.
Robustness:
- Provides warnings when conditions are too strict or fields are missing, instead of failing.
Performance optimized:
- Based on sparse matrix and Boolean indexing, ensuring scalability to million-cell datasets.
Seamless downstream integration:
- Outputs can directly feed into Embedding Visualization, Clustering, Enrichment Analysis, Proportion Analysis, forming a coherent pipeline.

Embedding Visualization Module

1. Module Purpose

The Embedding Visualization module transforms high-dimensional single-cell gene expression matrices into low-dimensional spaces, enabling intuitive visualization of relationships among cells. It allows researchers to directly observe cell-type distributions, differentiation trajectories, and population structures.

Main Objectives:

Dimensionality reduction visualization: Project tens of thousands to millions of cells into 2D or 3D space.
Structural discovery: Reveal local and global structures such as clusters, gradients, and lineage transitions.
Metadata integration: Combine biological annotations (e.g., tissue, disease, cell type) for color-based grouping.
Exploratory analysis entry point: Serve as a visual gateway to downstream modules such as clustering, enrichment, and proportion analysis.

Concept overview: mapping high-dimensional expression \(X\) to a low-dimensional embedding \(Y\) for visual interpretation.

2. Assumptions

Preprocessing assumption: The input data has undergone quality control (QC), normalization, and batch correction.
Manifold assumption: The high-dimensional gene expression data lies on a low-dimensional manifold with meaningful geometry.
Distance preservation assumption: Similar cells remain close together in the embedded space, while distinct groups are separable.
Interpretability assumption: The embedding structure aligns with biological metadata to support meaningful interpretation.

3. Mathematical Model

The mathematical foundation of this module is nonlinear dimensionality reduction, which seeks a mapping that preserves the geometric or statistical relationships of the high-dimensional data in a lower-dimensional representation.

Given an expression matrix:

\[ X = [x_1, x_2, \ldots, x_n]^{\top} \in \mathbf{R}^{n\times p} \] where \(n\) is the number of cells and \(p\) is the number of genes, we aim to find a mapping:

\[ f:\ \mathbf{R}^{p} \rightarrow \mathbf{R}^{d},\quad d \in \{2,3\} \]

such that the embedded coordinates

\[ Y = f(X) \in \mathbf{R}^{n\times d} \] preserve as much structural information from \(X\) as possible.

Algorithms (switch to view):

(1) PCA (Principal Component Analysis)

PCA finds a linear projection that maximizes variance: \[ Y = X W_d,\qquad W_d = [\,w_1, w_2, \ldots, w_d\,] \] where \(W_d\) consists of the top \(d\) eigenvectors of \(X^{\top} X\). Objective function: \[ \max_{W_d^{\top} W_d = I_d}\ \mathrm{Tr}\!\left(W_d^{\top}\, X^{\top} X\, W_d\right). \]

4. Data Sources

Inputs:
- High-dimensional expression matrix (normalized and batch-corrected).
- Metadata annotations (e.g., cell type, tissue, disease) for grouping and coloring.
- Optional: filtered subsets from the Data/Filter module.
Outputs:
- Low-dimensional embedding matrix \(Y \in \mathbf{R}^{n\times d}\), stored in adata.obsm.
- Visualization configuration data (color schemes, scales, labels, and axes information).

5. Implementation

Algorithm selection: Users can choose PCA, t-SNE, UMAP, or PHATE.
Embedding computation: The system computes the mapping from high- to low-dimensional space and produces the coordinate matrix.
Result storage: The embedding and relevant metadata are stored in the AnnData object for reuse.
Visualization generation:
- 2D/3D scatter plots with color coding by metadata (e.g., cell_type, tissue, disease).
- Expression-based coloring (Feature plots).
- Cluster highlighting.
Interactive exploration: The interface allows zooming, rotation, and selection of specific cells for detailed analysis.

6. Outputs

Data-level outputs:
- Low-dimensional coordinate matrix \(Y\), containing each cell’s 2D/3D coordinates.
- Embedding parameters (e.g., number of neighbors, minimum distance, random seed) for reproducibility.
- Updated AnnData object ready for downstream analysis.
Visualization-level outputs:
- 2D scatter plots: Reveal overall structure, cluster boundaries, and transitional regions.
- 3D interactive plots: Enable rotation and depth exploration for complex systems.
- Multi-mode coloring: Supports overlays for metadata, gene expression, and clustering labels.
- Statistical summaries: Display the number of groups, density distributions, and explained variance ratios.
- Export options: Allow exporting images, embedding matrices, and parameter reports for publication or presentation.

7. Features and Highlights

Multi-algorithm integration: Combines linear (PCA) and nonlinear (t-SNE, UMAP, PHATE) approaches for flexible analysis.
High interpretability: Provides intuitive visualization of biological relationships and transitions.
Interactive exploration: Supports real-time zooming, labeling, and local analysis.
Layered visualization: Enables combined display of metadata and gene expression gradients.
High computational efficiency: Uses sparse matrices and neighbor-based acceleration for large-scale datasets.
Seamless integration: Directly connects with the Data/Filter module upstream and provides embeddings for downstream modules such as Clustering, Enrichment, and Proportion Analysis.

Clustering Module

1. Module Purpose

The Clustering module uncovers cellular heterogeneity in single-cell transcriptomics by grouping cells with similar gene-expression profiles into subpopulations, enabling identification of cell types, functional states, and developmental lineages.

Primary objectives:

Group high-dimensional single-cell data into biologically meaningful clusters.
Reveal hidden structures such as subpopulations and transitional states.
Provide cluster labels for downstream modules (enrichment, proportion, mechanism analysis).
Enable visual exploration in embedding spaces (UMAP / t-SNE).

2. Assumptions

Expression similarity: cells with similar transcriptional profiles tend to share biological functions.
Manifold: cells lie on a low-dimensional manifold; neighborhood relations carry structure.
Separability: distinct subgroups form compact, separable regions in the embedding space.
Embedding validity: input embeddings (e.g., PCA/UMAP) retain essential biological information.
Stability: under similar parameters, clustering is consistent and reproducible.

3. Mathematical Model

Let the embedded matrix after dimensionality reduction be

\[ Y = [\,y_1, y_2, \ldots, y_n\,]^{\top} \in \mathbf{R}^{\,n\times d}, \] where \(y_i \in \mathbf{R}^{\,d}\) is the embedding of the \(i\)-th cell. The goal is to divide all cells into \(k\) clusters

\[ \mathcal{C}=\{C_1,\ldots,C_k\},\quad \bigcup_{j=1}^{k} C_j=\{1,2,\ldots,n\},\quad C_a\cap C_b = \{\}\ \ (a\ne b) \]

and minimize intra-cluster variance while maximizing inter-cluster separation. Let \(D(\cdot,\cdot)\) be a distance and \(\mu_j\) be the centroid of cluster \(C_j\):

\[ \min_{\{\mu_j\},\,\mathcal{C}}\ \sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]

Algorithms (switch to view):

(1) K-Means (centroid-based)

Objective (Within-Cluster Sum of Squares, WCSS):

\[ \min_{\{\mu_j\}}\ \sum_{j=1}^{k}\ \sum_{i\in C_j}\ \|\,y_i-\mu_j\,\|^2 . \]

Iteration: (i) assignment — send each \(y_i\) to the nearest centroid; (ii) update — recompute \(\mu_j\) as the mean of assigned points; repeat until convergence.

4. Data Sources

Input:
- Low-dimensional embeddings (from PCA, UMAP, t-SNE, PHATE, etc.).
- Cell–cell similarity or adjacency matrix.
- Optional metadata for biological annotation.
Output:
- Cluster labels (cluster IDs).
- Cluster statistics (size, centroid, intra-variance, inter-distance, modularity).
- Updated AnnData object (adata.obs["cluster"]).

5. Implementation

5.1 Data Input

Load the embedding matrix from the Embedding Visualization module.
Auto-detect dense/sparse format.
Choose algorithm based on dataset size and structure.

5.2 Processing Flow

Neighborhood construction: compute pairwise distances (Euclidean / cosine) and build a kNN graph.
Algorithm & parameters: select K-Means / Leiden / GMM / Spectral; set \(k\), resolution, neighbor size, seed.
Clustering computation: run the selected algorithm to assign labels.
Evaluation & tuning: assess with Silhouette, Modularity, Inertia, or Davies–Bouldin; auto-tune if needed.
Result integration: store labels to AnnData and wire to visualization for color mapping.

5.3 Output Results

Cluster label per cell.
Per-cluster statistics (size, density, centroid).
Quality metrics (Silhouette, Modularity, Inertia, DB-Index).
Exportable results (CSV/JSON) and interactive visual components.

Quality metrics (switch to view):

Silhouette (−1 to 1): higher is better; compares intra-cluster cohesion \(a(i)\) to nearest-cluster separation \(b(i)\).

\[ s(i)=\frac{b(i)-a(i)}{\max\{a(i),\,b(i)\}}. \]

6. Output Explanation

Data-level: cluster IDs and centroids; cells per cluster; intra-variance; quality metrics; parameter logs (algorithm, seed, kNN size).
Visualization-level: UMAP/t-SNE with color-coded clusters; cluster-distance heatmaps; mean-expression heatmaps; interactive selection and metadata inspection.

Clustered embedding view — example page 1

Cluster statistics and controls — example page 2

7. Module Features and Highlights

Multi-Algorithm Integration: K-Means, Louvain/Leiden, GMM, Spectral.
Topology Preservation: graph-based modeling maintains neighborhood structure.
High Biological Interpretability: aligns with metadata for cell-type annotation.
Automatic Parameter Optimization: internal metrics for self-tuning.
Scalability: efficient for large datasets with sparse/GPU acceleration.
Seamless Integration: works with Embedding, Enrichment, Proportion modules.
Interactive Visualization: instant sync between labels and graphics.

Enrichment Analysis Module

1. Module Purpose

The Enrichment Analysis module bridges data-driven clustering and biological mechanism discovery. It tests specific clusters or DEG sets to find significantly enriched biological processes, molecular functions, pathways, and regulatory programs.

Main objectives:

Reveal functional characteristics and active pathways of each cluster.
Identify key processes (e.g., metabolism, immune response, differentiation).
Link gene expression changes to regulatory/signaling mechanisms.
Provide pathway-level input for mechanism analysis with interpretable visuals.

2. Assumptions

Functional modularity: genes act in modules (pathways/complexes) rather than in isolation.
Expression–function correlation: up/down-regulated genes imply activation/inhibition of functions.
Approximate independence: gene-level statistics are sufficiently independent for set-level inference.
Annotation reliability: curated databases (GO, KEGG, Reactome, MSigDB, etc.).
Sample representativeness: each analyzed cluster contains enough cells for robust statistics.

3. Mathematical Model

The core principle is to test whether a predefined functional gene set \(S\) is statistically over-represented in a target set \(T\).

Let: \(M\) = total background genes; \(M_S\) = genes in set \(S\); \(N\) = genes in target \(T\); \(k\) = overlap of \(S\) and \(T\).

Methods (switch to view):

(1) Hypergeometric Test — for discrete sets (e.g., up-regulated DEGs).

\[ C(n,k) \;=\; \frac{n!}{\,k!\,(n-k)!\,}. \]

\[ P \;=\; 1 \;-\; \sum_{i=0}^{\,k-1} \frac{\,C(M_S,\,i)\; C(M-M_S,\,N-i)\,}{\,C(M,\,N)\,}\, . \]

If \(P<0.05\), the pathway or set is considered significantly enriched (assuming sampling from the background universe).

4. Data Sources

Input:
- DE results (gene, logFC, P-value).
- Cluster labels from the Clustering module.
- Functional databases: GO, KEGG, Reactome, MSigDB, WikiPathways, TRANSFAC, TRRUST.
- Optional user-defined sets (GMT/CSV).
Output:
- Significant pathways/functions per cluster.
- Statistics (ES, NES, P-value, FDR) with visual and tabular summaries.

5. Implementation

5.1 Data Input

Load cluster information from the Clustering module.
Import DEG lists (logFC, P-value).
Load gene-set libraries (GO, KEGG, MSigDB, etc.).
Optionally accept custom sets.

5.2 Processing Flow

Preparation: deduplicate/normalize genes (HGNC/ENSEMBL); define background.
Computation: ORA (hypergeometric) for discrete lists; GSEA for ranked lists; GSVA for cell/sample-level activities.
Statistics: compute P/FDR and ES/NES; filter by \(FDR<0.05\).
Integration: map enriched pathways to clusters; auto-annotate cluster functions.
Visualization: bubble/bar plots, GSEA curves, heatmaps; interactive drill-down.

5.3 Output Results

Enrichment table (pathway, category, P, FDR, gene count, ES/NES).
Key gene lists within enriched pathways.
Visual outputs: bubble plot, top-bar plot, GSEA curve, heatmap.
Export formats: CSV, Excel, JSON, PDF.

6. Output Explanation

Data-level: pathway statistics (P, FDR, ES/NES); contributing genes; cluster–pathway matrix.
Visualization-level: bubble plot (x=enrichment score; y=pathway; size=gene count), top-bar plot, GSEA trend, heatmap; optional network to show gene overlap.
Statistical summary: number of significant pathways, direction (up/down), mean FDR; auto-generated report for publication/export.

Enrichment analysis visualization example

7. Module Features and Highlights

Multi-algorithm integration: ORA, GSEA, GSVA, PAGE.
Comprehensive databases: GO, KEGG, Reactome, MSigDB.
Hierarchical analysis: single-cluster, multi-cluster, global comparisons.
Statistical rigor: FDR correction and permutation testing.
High interpretability: connects gene-level changes to pathway-level meaning.
Extensible: custom gene sets and cross-species annotation.
Dynamic visualization: interactive and filterable outputs.
Seamless integration with Clustering, Mechanism, and Proportion modules.

Proportion Analysis Module

1. Module Purpose

The Proportion Analysis module quantifies how cell population fractions change across experimental conditions (e.g., control vs. treatment). Based on clustering results, it computes per-cluster fractions per sample, performs statistical testing, and visualizes composition shifts.

Main objectives:

Quantify cluster proportions under different conditions.
Detect statistically significant changes in composition.
Reveal structural remodeling under perturbations.
Provide bar/stacked/heatmap views and supply results to downstream modules.

2. Assumptions

Label accuracy: clustering labels are biologically meaningful.
Sample comparability: comparable depth/size so that proportions reflect biology.
Independence: cell assignments across samples are treated as independent observations.
Stability: replicates show similar proportions without perturbation.
Closed population: total counts are fixed or normalized across samples.

3. Mathematical Model

The objective is to determine whether cell-type proportions differ significantly between conditions. Let:

\(N_{ij}\): number of cells in cluster \(j\) for sample \(i\);
\(n_i=\sum_j N_{ij}\): total cells in sample \(i\);
Cell proportion: \(\displaystyle p_{ij}=\frac{N_{ij}}{n_i}\).

Composition illustration (mock): stacked fractions for two groups.

Cluster A Cluster B Cluster C

Control

Treatment

4. Statistical Tests

(1) Proportion Difference Test (two groups) — Chi-square test for contingency counts.

\[ \chi^{2} \;=\; \sum_{j=1}^{k} \frac{(O_j - E_j)^{2}}{E_j}, \]

where \(O_j\) is the observed count and \(E_j\) is the expected count under the null hypothesis. If \(P<0.05\), that cell-type proportion differs significantly.

Fisher’s exact test — for small counts (2×2 table). The exact tail probability is computed from the hypergeometric distribution.

\[ P \;=\; \sum_{\text{tables } t \ \text{as or more extreme}} \frac{\,C(n_{1\cdot},\,t_{11})\,C(n_{2\cdot},\,t_{21})\,}{\,C(n_{\cdot\cdot},\,n_{\cdot 1})\,}, \] using \(C(n,k)=\frac{n!}{k!\,(n-k)!}\) for combinations.

4. Data Sources

Input data:
- Cluster labels (adata.obs['cluster']).
- Condition labels (adata.obs['condition']).
- Per-sample cell count matrices; sample metadata.
Output data:
- Cell proportion table for each cluster across groups.
- Statistical results (P, FDR, fold-change).
- Visualizations (bar, stacked, heatmap).

5. Implementation

5.1 Data Input

Load cluster and condition labels; allow user-defined grouping (tissue, timepoint, etc.).
Normalize counts per sample to ensure comparability.

5.2 Processing Flow

Aggregation & proportions: count \(N_{ij}\); compute \(p_{ij}=N_{ij}/n_i\); form a sample×cluster proportion matrix.
Testing & significance: Chi-square / Fisher for two groups; ANOVA or Kruskal–Wallis for multiple groups; compute \(P\) and adjust FDR; flag \(FDR<0.05\).
Visualization & reporting: bar, stacked, heatmap, volcano/bubble; annotate significant clusters; export tables.
Integration: link proportion shifts with enrichment/mechanism results.

5.3 Output Results

Proportion matrix across conditions.
Summary table (P, FDR, \(\log_{2}\)FC).
Visual outputs: bar, stacked, heatmap, volcano, bubble.
Exportable files (CSV, Excel, JSON).

Proportion analysis — visualization example

6. Output Explanation

Data-level: cluster×condition matrix; mean, sd, \(\log_{2}\)FC, P, FDR; markers for increases/decreases.
Visualization: bar (per group), stacked (global composition), heatmap (sample comparison), volcano/bubble (effect vs. significance).
Statistical report: summary of shifts; method details; multiple-testing correction summary.

7. Features and Highlights

Automated quantification across conditions.
Comprehensive testing (χ²/Fisher; ANOVA/Kruskal–Wallis).
High interpretability through linkage with clustering and enrichment.
Rich but clean visualization options.
Statistical rigor with FDR control.
Extensible to multi-condition/time-series/spatial data.
Interactive exploration for quick insight.
Seamless integration with Embedding, Clustering, Enrichment modules.

8. Summary and Resources

The Virtual Cell Atlas Explorer (VCAExplorer) forms an end-to-end, fault-tolerant, and interactive framework for liver-related single-cell research, integrating LaminDB connectivity, filtering, embedding, clustering, enrichment, and proportion analysis into a coherent pipeline for reproducible biological insight.

Scientific Contribution

Data-level integration: Harmonizes diverse single-cell data formats (.h5ad, .zarr, .loom, 10x .mtx) for unified downstream analysis.
Algorithmic innovation: Implements multi-stage fallback mechanisms to ensure successful dimensionality reduction and clustering even on noisy datasets.
Functional insight: Bridges molecular-level data (gene expression) with system-level phenomena (cell-type composition and pathway activation).
Reproducibility: Caching and metadata tracking guarantee reproducible analyses across sessions and users.
Extensibility: Each module supports modular expansion (e.g., PHATE, scVelo, Harmony), maintaining long-term scalability.

Impact across pipeline stages

Scale: 0 (none) → 5 (strong) Selected: Data Integration — strongest in ingestion & embedding.

Ingestion

Embedding

Clustering

Enrichment

Proportion

Reproducibility checklist

Cache saved Metadata tracked Config persisted

Cache (.h5ad + JSON) Run metadata User config

Reproducibility score: 3 / 3

Tip: uncheck items to simulate missing pieces.

Program Architecture Overview

Data Sourcehf:// · laminDB · S3/GCS · Local

Data/Filtermetadata subsetting & sync

Embedding VisualizationPCA fallback ready

Clusteringaccepts PCA fallback input

EnrichmentORA · GSEA · GSVA

Proportionper-group fractions

Mechanism Analysisintegrative reasoning

Select a stage to see details.

Program Architecture — UI · API · Core Analytics · Storage & Cache · Cloud

Design Features

Fully automated preprocessing, normalization, and metadata alignment.
Interactive Streamlit-based Web UI with multi-format visualization.
API layer for developers to embed VCAExplorer functions in larger workflows.
Cloud-compatible architecture supporting LaminDB, Hugging Face, and S3/GCS storage.

Error Handling

Smart auto-repair of corrupted or misaligned data matrices.
Automatic fallback to simpler embeddings (PCA) when advanced algorithms fail.
Detailed logs and metadata for reproducibility and debugging.

Resources and Availability

Resource Type	Description
Repository	Available on open-source hosting platforms (e.g., GitHub / Hugging Face). Contains full codebase, documentation, and dependencies (`requirements.txt`, `pyproject.toml`).
Documentation	Detailed bilingual tutorials (English/Chinese) covering installation, workflow, and parameter configuration.
Databases	Integrated with LaminDB, Hugging Face datasets, and public single-cell resources (e.g., Human Cell Atlas, GEO).
Supported Formats	.h5ad, .zarr, .loom, .csv, .mtx, .zip archives (auto-detection and decompression supported).
Environment	Python ≥ 3.11, compatible with Windows, Linux, and macOS.
Visualization Libraries	Matplotlib, Seaborn, Plotly (3D and interactive support).
External APIs	LaminDB API, Hugging Face Hub, `fsspec` cloud storage connectors.

Future Directions

Simple roadmap tracker (adjust statuses to estimate progress).

Overall progress: 0%

Direction	Description	Status
Mechanism Analysis Module	Integration of transcription factor (TF) networks, ligand–receptor interactions, and gene regulatory modeling.
Multi-omics Integration	Expansion to scATAC-seq and spatial transcriptomics for joint inference of chromatin–transcriptome coupling.
AI-driven Analysis	Machine learning modules for automated feature extraction and biomarker discovery.
Collaborative Research Platform	Cloud dashboards and database-linked annotations to enhance team sharing and review.

Status scale: Planned → In progress → Beta → Done.

Conclusion

VCAExplorer delivers a comprehensive, transparent, and modular platform for single-cell liver research. It bridges computational rigor with biological interpretability—empowering scientists to move seamlessly from raw transcriptomic data to mechanistic understanding and therapeutic insight.

Through its open-source design, database connectivity, and reproducible architecture, the system sets a benchmark for integrated bioinformatics workflows in precision medicine and liver-targeted drug discovery.

Software · Desktop App

Demo (Desktop): A dedicated page to preview and explain the browser version of our software.

Opens a separate page that keeps the same look & feel as this site.

Open Desktop App Page

Overview Virtual Cell Atlas · Web Software

Research Background & Challenges

Format & Scale

Pipeline Complexity

Collaboration & Viz

Highlights

Data Ingestion

Smart Repair & Cache

Analysis & Viz

Validated

Tool Components

Core Backend

Frontend

Dependencies

Data Cache Layer

User Experience & API

Web UI

Unified API

Compatibility

Openness

Collaboration & Validation

Data Preprocessing & QC

Principle

Goals and Challenges

Core Design Philosophy

QC & DR (simulated)

Data Processing Workflow

Key Implementation Details

Design Advantages

User Interface and Compatibility (Web UI, API, formats, UX)

Web UI

API Interface

Format Compatibility

User Experience

Tutorials

1. Homepage Access

2. Data/Filter Module

3. Embedding Visualization Module

4. Clustering Module

5. Enrichment Analysis Module

6. Proportion Analysis Module

7. General Operation Tips

Data/Filter Module

1. Module Purpose

2. Assumptions

3. Mathematical Model

4. Data Sources

5. Implementation (Conceptual Workflow)

6. Outputs

7. Module Features and Highlights

Embedding Visualization Module

1. Module Purpose

2. Assumptions

3. Mathematical Model

4. Data Sources

5. Implementation

6. Outputs

7. Features and Highlights

Clustering Module

1. Module Purpose

2. Assumptions

3. Mathematical Model

4. Data Sources

5. Implementation

5.1 Data Input

5.2 Processing Flow

5.3 Output Results

6. Output Explanation

7. Module Features and Highlights

Enrichment Analysis Module

1. Module Purpose

2. Assumptions

3. Mathematical Model

4. Data Sources

5. Implementation

5.1 Data Input

5.2 Processing Flow

5.3 Output Results

6. Output Explanation

7. Module Features and Highlights