Description | SYPHU-CHINA - iGEM 2025
Loading
Loading . . .
Navigation Bar

Description

hero
0%

1. Overview

This software connects ATRA exposure to single-cell outcomes by combining robust statistics and practical modeling. Starting from scRNA-seq matrices, it cleans and normalizes data, finds differential genes, prioritizes targets with machine learning, simulates in-silico perturbations that mimic ATRA regulation, and explains mechanisms via enrichment analysis. Results are presented with compact figures and exportable tables; the same pipeline can be driven through a simple REST API.

DEG Welch / Mann–Whitney ML LASSO / Random Forest GSEA + KEGG UMAP / PCA REST API CSV/TSV/XLSX

Inputs. A count matrix (genes × cells) and minimal metadata (condition, optional cell types, run/batch). Outputs. A curated DEG table (p/q, log2FC), target rankings with feature importance, dose–response parameters for key readouts, mechanism summaries (NES/FDR/leading edge), and publication-ready charts (PNG/SVG) plus machine-readable tables (CSV/XLSX).

Single-pass workflow. QC (min genes/cell, max MT%) → normalization + log transform → optional batch correction → PCA/UMAP → group contrasts with FDR control → sparse & non-linear models to rank targets → simulate ATRA-like up/down changes at gene level → summarize mechanisms with GSEA/KEGG and concise visuals. Every step records parameters and provenance for reproducibility.
Key formulas.
\[ \text{log2FC}(g) = \log_2\!\Big(\frac{\bar x_{g,\mathrm{treated}}+\varepsilon}{\bar x_{g,\mathrm{control}}+\varepsilon}\Big) \tag{1} \]
\[ E(c) = E_{\min} + \frac{E_{\max}-E_{\min}}{1+\left(\frac{EC_{50}}{c}\right)^{n}} \tag{2} \]
\[ \text{DEG} = \big\{ g \;|\; q_g \le 0.05 \;\land\; |\text{log2FC}(g)| \ge 1.00 \big\} \tag{3} \]
\[ S_g = 0.40\,\mathrm{Importance}_g + 0.40\,E_{g,\text{hepatoma}} - 0.20\,E_{g,\text{normal}} \tag{4} \]
Interactive thresholds.

Adjust significance and effect size; the rule updates live.

View focus.

Switch the explanatory note to match the figure you’re looking at.

Heatmap: clustered row-z scores with a clear legend; ideal for multi-gene patterns.

Target score sandbox.

Tune weights for importance and expression deltas; the scoring formula updates below.

Perturbation rule.

Choose the down-modulation factor applied to the target in hepatoma cells.

The equations are shown below as full-width lines.

Perturbation equations.
\[ X^{\text{post}}_{g,c} = \begin{cases} \alpha\,X^{\text{pre}}_{g,c}, & \text{if } c \in \text{hepatoma} \ \land\ g=g_{\text{target}} \\ X^{\text{pre}}_{g,c}, & \text{otherwise} \end{cases} \tag{5} \]
\[ X^{\text{post}}_{g,c} = \begin{cases} 0.50\,X^{\text{pre}}_{g,c}, & \text{if } c \in \text{hepatoma} \ \land\ g=g_{\text{target}} \\ X^{\text{pre}}_{g,c}, & \text{otherwise} \end{cases} \tag{5'} \]
Interactive dose–response.

Move the sliders and the curve updates. In the full pipeline, bind these to fitted values/CI per gene or pathway.

Compact outputs you can expect.
ItemWhat it shows
DEG tablegene, log2FC, p, q, direction, comments
Target rankingimportance, model votes, confidence note
Dose–responseEC50, Emax, Hill n, fit quality
MechanismsNES, FDR, leading edge genes
Figuresheatmap / UMAP overlays / violin plots

Provenance is captured in a small metadata.json (QC thresholds, normalization method, batch correction settings, versions).

More on preprocessing, statistics, and compatibility
  1. Preprocessing. Doublet screening if applicable; library-size normalization, log1p, z-scaling for HVGs; batch correction via ComBat/Harmony/scVI depending on design.
  2. DEG. Welch’s t-test or Mann–Whitney; multiplicity control with Benjamini–Hochberg; typical cutoffs q<0.05, |log2FC|>1 but exposed as parameters.
  3. Target models. LASSO emphasizes sparsity; Random Forest captures non-linear interactions; cross-validation and lightweight hyper-parameter tuning to stabilize rankings.
  4. Mechanisms. ORA/GSEA over KEGG/Reactome; report NES/FDR and show pathway localization on UMAP via ssGSEA/AUCell scores.
  5. Formats & API. Input CSV/TSV/XLSX; outputs as PNG/SVG and CSV/XLSX; REST endpoints cover upload → run → poll → fetch.

2. Data & Methods

What we contributed.
  • Clean QC pack: dataset-aware HVG caps, adaptive MT% filter, and replicate-stratified sanity checks.
  • Transparent target ranking: a fixed, documented combination of sparse (LASSO) and non-linear (RF) importances with exportable rationales.
  • ATRA-consistent perturbation module: single-switch down-regulation in hepatoma cells, aggregated to pathway scores with a calibrated sensitivity.
  • CLEAR GSEA reporting: baseline-aligned curve, explicit leading-edge, and per-sample localization; figures and tables rendered once for the report.
  • Reproducible packaging: every job emits a metadata.json, version pins, and a compact REST result bundle.

Preprocess & QC

QCNormalizationHVGBatch

We filter low-quality cells/genes, normalize by library size, apply log1p, select HVGs, and (when required) correct batches. The thresholds below are our defaults for this release.

Cells
72%
Genes
85%

Numbers reflect our benchmark runs and are fixed here for clarity.

Differential Analysis

We report gene-level changes between treated vs. control with multiple-testing control. Exact tests follow established practice; our contribution is the report contract: stable thresholds, audit fields, and exportable CSVs used downstream.

  • Defaults: FDR ≤ 0.05; |log2FC| ≥ 1.0 (dataset-tunable, but fixed for the report).
  • Outputs: ranked table with effect size, FDR, and provenance columns.

Target Prioritization

We combine LASSO (interpretability) and Random Forest (non-linearity) into a single fixed score. We also export the per-gene rationale (coefficients, split gains) to keep selection auditable.

G156% G242% G334% G424% G517%

Exact values come from the run artifacts; here we show the fixed layout used in reports.

In-silico Perturbation

We simulate an ATRA-like down-regulation on target genes in hepatoma cells and summarize the expected pathway movement. This module is deterministic in the report (single α, fixed sensitivity), so results remain comparable across datasets.

  • Scope: hepatoma cells only; normal compartments are left unchanged for contrast.
  • Output: per-pathway delta table and PNG overlays for quick review.

2.5 Mechanism Enrichment (CLEAR GSEA)

We keep enrichment simple and readable: a baseline-aligned running-sum curve, shaded leading-edge, and a small summary table. No sliders or tooltips—just the figure used in the paper/report.

peak
leading-edge (shaded) RS = 0 baseline
Pathway NES FDR Leading edge size
Retinoid signaling+2.10.00434
Cell-cycle down−1.80.01227

Outputs & Reporting

What you get from one run: compact figures for quick review, machine-readable tables for reuse, and a small metadata.json so every result is reproducible. Below we keep it simple—no sliders or toy tooltips.

Our contributions (what’s different from a vanilla RNA-seq pipeline).
  • Clear, fixed report contract. Each figure/table has a defined schema and stable defaults, so results are comparable across datasets.
  • Interpretable target ranking. A single documented score combining sparse (LASSO) and non-linear (RF) evidence, plus expression context.
  • ATRA-consistent preview. Deterministic in-silico perturbation on hepatoma cells summarized to pathway deltas for wet-lab planning.
  • CLEAR GSEA visuals. Baseline-aligned running-sum and shaded leading edge; the exact numbers live in the tables.
  • Provenance by design. Every bundle ships the parameters and versions that created it—no hidden knobs.
DEG decision Target score Dose–response Pathway/NES Reports + Metadata

DEG decision — rule on the figure, numbers in the table

We apply a transparent rule balancing FDR and effect size, then render a minimal volcano that visually encodes the decision. No interactive widgets here—the figure you see is the one we store.

passes rule thresholds

Defaults in this release: FDR ≤ 0.05, |log2FC| ≥ 1.0. Exact values are in deg.csv.

Target ranking — one interpretable score

We publish a single score per gene that blends model evidence and expression context. The weights are part of the metadata so reviewers can audit or re-weight later.

LASSO 0.40 RF 0.40 Hep ↑ 0.15 Normal ↓ 0.05

Weights shown are our report defaults; the exact vector is stored per run in metadata.json.

Dose–response — fixed-format figure for reports

For top targets we fit a logistic curve and export one consistent figure per target. Below is the static template we use (no sliders).

EC₅₀

Numerical estimates and CIs are saved in a compact JSON next to the PNG/SVG.

Reports & bundles — small and reproducible

Each run produces a predictable bundle: figures, tables, and the metadata that regenerates them exactly.

deg.csv targets.csv pathways.csv meta.json report.zip
  • Tables: DEG, target ranking, enrichment summaries.
  • Figures: volcano, violin/box, UMAP overlays, CLEAR GSEA.
  • Provenance: thresholds, weights, versions, seeds.

4. ATRA Targeted Delivery System Single-Cell RNA Analysis Tool Tutorials

1. Accessing the Homepage

  • Open a browser (recommended: Chrome / Firefox / Edge) and navigate to the software's Web UI homepage.
  • The homepage displays entry points for primary functional modules:
    a) ATRA Data Processing
    b) Virtual Cell Generation
    c) Pharmacodynamic Analysis
    d) Mechanism Interpretation
  • Users may upload their own data or use the system's sample data for demonstration.

2. ATRA Data Processing Module

Function: Preprocess and perform differential gene analysis on single-cell RNA sequencing (scRNA-seq) data

Steps:

  • a) Click ATRA Data Processing in the navigation bar.
  • b) Upload data files (supports CSV/TSV/Excel formats; rows represent genes, columns represent cells).
  • c) Configure data processing parameters:
    • QC Threshold: Minimum gene count, mitochondrial gene proportion filtering.
    • Normalization Method: TPM / RPKM / CPM.
    • Batch Effect Correction: Enable PCA/ComBat.
  • d) Click Run Analysis. The system will output:
    • Differential gene table (includes logFC, p-value, significance flag).
    • Heatmap (expression distribution of differential genes across samples).
    PCA/UMAP plot (distribution differences of cells before and after ATRA treatment).
PCA/UMAP overview treated vs control cue
PCA/UMAP plot showing distribution differences of cells before and after ATRA treatment
PCA/UMAP plot — distribution differences of cells before and after ATRA treatment.

3. Virtual Cell Generation Module

Function: Simulates transcriptional changes in cells post-ATRA treatment using machine learning and mathematical modeling

Steps:

  • a) Click Virtual Cell Generation in the navigation bar.
  • b) Input treatment conditions:
    o Drug concentration (µM): e.g., 0.5, 1.0, 2.0.
    o Target gene downregulation ratio (e.g., 50%).
    o Select simulation method: Random Forest / LASSO / Logistic Regression.
  • c) Click Generate Virtual Cells.
  • d) System Output:
    o Position of “virtual cells” in the expression matrix.
    o PCA/UMAP projection plots showing relative positions of virtual and real cells.
    o Simulated expression curves for differentially regulated genes (dose-dependent).
virtual vs real overlay dose ladder
Simulated dose-dependent expression curves for differentially regulated genes
Simulated expression curves — dose-dependent patterns for differentially regulated genes.

4. Pharmacodynamic Analysis Module

Function: Perform pharmacodynamic modeling and response prediction for differentially expressed genes and target genes

Procedure:

  • a) Click Pharmacodynamic Analysis in the navigation bar.
  • b) Select the gene set for analysis (can be imported from DEG results).
  • c) Set simulation parameters:
    o Dose gradient (e.g., 0.1, 0.5, 1.0, 5.0 µM).
    o Effect Type: Upregulation / Downregulation.
  • d) Click Run Pharmacodynamic Analysis.
  • e) System Output:
    o Dose-Response Curve.
    o Population Sensitivity Distribution (Violin Plot/Box Plot).
    o Gene Expression Heatmap Before and After Drug Treatment.
pre vs post heat scale sensitivity band
Gene expression heatmap before and after drug treatment (panel 1)
Heatmap — before vs after treatment (1/3).
Gene expression heatmap before and after drug treatment (panel 2)
Heatmap — before vs after treatment (2/3).
Gene expression heatmap before and after drug treatment (panel 3)
Heatmap — before vs after treatment (3/3).

5. Mechanism Analysis Module

Function: Perform enrichment analysis on differentially expressed gene sets to reveal ATRA's mechanism of action

Procedure:

  • a) Click Mechanism Analysis in the navigation bar.
  • b) Import the differentially expressed gene list (from the ATRA Data Processing Module).
  • c) Select enrichment analysis methods:
    o GSEA (Gene Set Enrichment Analysis).
    o KEGG pathway enrichment analysis.
    o Reactome pathway analysis.
  • d) Set parameters:
    o Significance threshold (p<0.05, FDR<0.25).
    o Gene set database (KEGG, MSigDB, Reactome).
  • e) Click Run Mechanism Analysis.
  • f) System outputs:
    o Ranked enrichment pathway list.
    o Bar chart (pathway significance comparison).
    o Network diagram (gene-pathway relationships).
enriched pathways gene links
Network diagram showing gene-pathway relationships
Network diagram — gene–pathway relationships.

6. Final Output

  • After completing the above four modules, users can:
    • Generate a complete report (PDF/Word) with one click, including:
    o Data processing workflow
    o Differentially expressed gene list
    o Virtual cell simulation results
    o Pharmacodynamic analysis curves
    o Pathway mechanism analysis
  • Export interactive charts (PNG/SVG) and gene expression matrices (CSV/Excel) for further research or publication.
stepsModule NameinputMain operationoutput
1Visit the home pageNone (browser access)Go to the Web UI and select ModulesFunctional navigation interface
2ATRA Data Processing ModulescRNA-seq Data file(CSV/TSV/Excel)QC filtration, standardization, batch effect correction, differential genetic analysis (DEG)Differential gene table, heat map, PCA/UMAP map
3Virtual cell generation moduleDifferential gene data, simulation conditions (concentration, target down-adjustment ratio)Machine learning modeling to generate virtual cell expression matricesVirtual cell projection maps (PCA/UMAP), simulated gene curves
4Pharmacodynamic analysis moduleScreening gene list, dose gradient parametersConstruct a dose-response model and analyze drug sensitivityDose-response curve, sensitivity distribution, violin plot, heat map
5Mechanism analysis moduleList of differential genesGSEA、KEGG、Reactome Pathway enrichment analysisEnrichment pathway table, bar graph, network diagram
6Final outputResults of each of the above modulesAutomatically generate reportsPDF/Word Reports, chart exports, table results

5. Detailed Explanation of the ATRA Data Processing Module

QC Normalization Batch correction DEG / Statistics

1) Module Function

The module is the system’s first step. It cleans, standardizes, and performs differential gene analysis (DEG) on scRNA-seq inputs, producing high-quality data for virtual cell generation, pharmacodynamic analysis, and mechanism interpretation.

  • Remove noise and low-quality cells to ensure reliability.
  • Normalize cells and batches onto a common scale for comparability.
  • Identify DEGs before/after ATRA and between HCC vs. normal liver cells.

2) Module Assumptions

  • Cellular heterogeneity: HCC vs. normal hepatocytes show meaningful transcriptional differences detectable by scRNA-seq.
  • Drug action: ATRA modulates genes by up/down regulation with statistically significant shifts.
  • Independence: Cells are treated as independent samples for DEG testing.
  • Noise & batch effects: Present but mitigated by normalization, dimensionality reduction, and batch correction.

3) Mathematical Model

Data matrix. Let \(X \in \mathbb{R}^{m\times n}\) be raw expression (genes × cells):

\[ X \in \mathbb{R}^{m \times n},\quad X_{i j}\ \text{is expression of gene } g_i \text{ in cell } c_j. \tag{5.1} \]

Rows are genes \((g_i)\); columns are cells \((c_j)\).

Standardization. Log-CPM (library-size normalized then log-transform):

\[ \tilde X_{i j} = \log_2\!\left(\frac{X_{i j}}{\sum_{k=1}^{m} X_{k j}} \times 10^{6} + 1\right). \tag{5.2} \]

This addresses sequencing-depth variation and non-linear scaling.

DE testing. Group means, effect size, and hypothesis test with FDR control:

\[ \mu_{T}(g_i) = \frac{1}{|T|} \sum_{j \in T} \tilde X_{i j},\quad \mu_{C}(g_i) = \frac{1}{|C|} \sum_{j \in C} \tilde X_{i j}. \tag{5.3} \] \[ \mathrm{log2FC}(g_i) = \log_2\!\left(\frac{\mu_{T}(g_i)+\varepsilon}{\mu_{C}(g_i)+\varepsilon}\right),\ \varepsilon>0. \tag{5.4} \] \[ \begin{aligned} &H_0: \mu_{T}(g_i) = \mu_{C}(g_i),\\ &H_1: \mu_{T}(g_i) \ne \mu_{C}(g_i), \end{aligned} \qquad \text{test via Wilcoxon/Mann–Whitney or } t\text{-test; control FDR (e.g., BH).} \tag{5.5} \]

Screening rule. A gene is called a DEG if both hold:

\[ \big|\mathrm{log2FC}(g_i)\big| \ge \theta \quad \text{and} \quad p_{\text{adj}}(g_i) < \alpha, \qquad (\theta = 1,\ \alpha = 0.05\ \text{typ.}) \tag{5.6} \]

4) Data Sources

  • Experimental: scRNA-seq from HCC lines (HepG2/Huh7) and normal hepatocytes; ATRA at varying concentrations with post-treatment profiles.
  • Public: GEO, TCGA, Single Cell Portal / Human Cell Atlas for controls or model references.

5) Implementation

Input formats: CSV/TSV/Excel (genes × cells), and AnnData (.h5ad) compatible with Scanpy/Seurat.

Required metadata: expression matrix + grouping labels (ATRA vs. control; HCC vs. normal).

Workflow — QC. Remove low-quality cells/genes; limit mitochondrial proportion.

import scanpy as sc
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs['percent_mito'] < 0.2]

Workflow — normalization. CPM/TPM; log; scaling.

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)

Workflow — batch correction. PCA + ComBat; or Harmony/Scanorama.

sc.tl.pca(adata, svd_solver='arpack')
# then apply Harmony / Scanorama if needed

Workflow — DEG testing. T vs C; HCC vs normal; Wilcoxon/t-test/logistic regression.

sc.tl.rank_genes_groups(adata, groupby='treatment', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

Screening criteria. p < 0.05; log2FC > 1 or < −1.

6) Output Overview

  1. DEG table: gene, log2FC, p-value, FDR, and significance marks.
  2. Visualizations: PCA/UMAP (pre/post ATRA), heatmaps, violins for key targets.
  3. Standardized matrix: QC’d and batch-corrected for downstream modules.

DEG Significance Illustration

Annotates significance with * / ** / *** right after p/FDR thresholds.

DEG table with significance annotations (* / ** / ***)
Example of DEG table with significance markers (* / ** / ***).
Show DEG Table (click to expand)
Click headers to sort.
Gene FoldChange PValue AdjustedPValue

6. Virtual Cell Generation Module Detailed Explanation

In-silico simulation Dose/time/down-regulation PCA/UMAP Dose-response

1) Purpose

The Virtual Cell Generation Module simulates cellular transcriptomic responses to varying ATRA treatment conditions (dose/time/downregulation ratio) based on processed single-cell RNA-seq data, generating “virtual cells” or “virtual populations.” Its primary applications include:

  • Conducting in-silico drug intervention simulations when experimental coverage is insufficient;
  • Expanding training datasets to enhance the robustness of downstream models (DEG, enrichment, classification);
  • Comparing effects of different dosages or downregulation strategies on cell populations/single cells;
  • Providing visual (PCA/UMAP) and statistical (dose-response curves, population sensitivity distributions) outputs for decision-making.

The module interfaces directly with the ATRA data processing module: input is QC/normalized/batch-corrected AnnData (or equivalent matrices).

2) Module Assumptions

  • Response modelability: ATRA’s impact on gene expression can be approximated by statistical/mechanistic models (multiplicative scaling, Hill curves, or learned mappings).
  • Local perturbation: Drug effects primarily alter a subset of key target genes, expandable to multi-gene coupling.
  • Stability: Under identical treatments, populations are reproducible and can be Monte-Carlo simulated for uncertainty.
  • Data availability: There exists at least one dataset or credible target/effect prior for calibration or supervised learning.

3) Mathematical Model

A. Per-Gene Multiplicative Perturbation

For gene \(g\) and cell \(c\), let the original expression be \(X^{\text{pre}}_{g,c}\). The virtual expression:

\[ X^{\text{post}}_{g,c} \;=\; s_g \cdot X^{\text{pre}}_{g,c}, \qquad s_g \in (0,\infty). \tag{6.1} \]

The perturbation coefficient can be constant or stochastic:

\[ s_g = 1 - \alpha_g \quad\text{(constant down-regulation)},\qquad s_g \sim \mathrm{LogNormal}\!\big(\log(1-\alpha_g),\,\sigma^2\big)\ \text{(randomized)}. \tag{6.2} \]

To preserve count statistics, optional resampling:

\[ X'^{\text{post}}_{g,c}\ \sim\ \mathrm{Poisson}\!\big(\lambda=X^{\text{post}}_{g,c}\big) \quad\text{or}\quad \mathrm{NegBin}\!\big(\mu=X^{\text{post}}_{g,c},\,r\big). \tag{6.3} \]

B. Dose-Response (Hill/Logistic) Model (for multi-dose data)

For each gene \(g\), the dose-response:

\[ E_g(d) \;=\; E_{\min,g} + \frac{E_{\max,g}-E_{\min,g}}{1+\left(\frac{EC_{50,g}}{d}\right)^{n_g}}, \tag{6.4} \]

Virtual expression as additive or multiplicative modulation about a baseline \(\bar X_g\):

\[ X^{\text{post}}_{g,c}(d) \;=\; \bar X_{g,c} \cdot \big(1 + \kappa_g\,E_g(d)\big) \quad\text{or}\quad \bar X_{g,c} + \kappa_g\,E_g(d) + \varepsilon_{g,c}, \tag{6.5} \]

where \(d\) is drug concentration, \(\kappa_g\) scales gene-specific sensitivity, and \(\varepsilon_{g,c}\) captures individual variation.

C. Latent-Space Perturbation

Map each cell to latent space \(z=f(x)\) (PCA/scVI/autoencoder). Define population shift:

\[ \Delta \;=\; \bar z_{\text{treated}} - \bar z_{\text{control}}. \tag{6.6} \]

Shift and sample a virtual cell, then decode back:

\[ z^{\text{post}} \;=\; z^{\text{pre}} + \lambda\,\Delta + \eta,\qquad X^{\text{post}} \;=\; \mathrm{Decoder}\!\left(z^{\text{post}}\right)\ \ \text{or}\ \ W\,z^{\text{post}}. \tag{6.7} \]

Latent perturbation better preserves covariance structures; multiplicative is simple/transparent; Hill fits enable pharmacologic interpretation (EC50/Emax/n).

4) Data Sources

  • Required: AnnData preprocessed by the ATRA module (with adata.X or adata.layers['normalized']), and adata.obs including treatment, cell_type.
  • Optional: Multi-dose datasets for Hill fitting; similar public scRNA-seq (e.g., GEO) for calibration; predefined target lists or priors.

5) Implementation

Scanpy · QC
import scanpy as sc
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs['percent_mito'] < 0.2]
Scanpy · Normalization
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)
Scanpy · Batch Correction
sc.tl.pca(adata, svd_solver='arpack')
# then apply Harmony / Scanorama if needed
Scanpy · DEG
sc.tl.rank_genes_groups(adata, groupby='treatment', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

5.1 Input

  • adata: AnnData (QC/normalized/batch-corrected); raw counts require normalization flags.
  • targets: Candidate target gene list or internally screened (RF/LASSO).
  • mode: 'multiplicative' / 'dose_response' / 'latent'.
  • params: alpha (down-regulation ratio), dose, Emax/EC50/n, etc. (per-gene or shared).
  • n_replicates: Monte-Carlo replicates per real cell.
  • random_state: Random seed.

5.2 Processing Pipeline

  1. Input validation (targets ∈ adata.var_names; layer availability).
  2. Select population (e.g., cell_type == 'HepG2').
  3. Construct coefficients:
    • multiplicative: \(s=1-\alpha\) or \(s\sim\mathrm{LogNormal}\).
    • dose_response: fit per-gene Hill or extrapolate shared parameters.
    • latent: compute \(\Delta\); choose displacement \(\lambda(d)\).
  4. Apply perturbations → write to adata.layers['perturbed'] (or replicas).
  5. Post-process: optional re-normalize/log1p; recompute PCA/UMAP/DEG.
  6. Repeat simulations to obtain distributions/CI.
  7. Save: adata, plots, DEG tables, dose-response params, adata.uns['perturbation_meta'].
Python · simulate_virtual_cells() Stores meta in adata.uns['perturbation_meta']
def simulate_virtual_cells(adata, targets, mode="multiplicative", params=None,
                           n_replicates=1, random_state=42, layer_in="normalized",
                           layer_out="perturbed"):
    import numpy as np, datetime
    rng = np.random.default_rng(random_state)
    X = adata.layers[layer_in] if layer_in in adata.layers else adata.X.copy()

    if mode == "multiplicative":
        alpha = float(params.get("alpha", 0.5))
        sigma = float(params.get("sigma", 0.0))
        s = 1.0 - alpha
        if sigma > 0:
            s = rng.lognormal(mean=np.log(max(1e-6, 1.0 - alpha)), sigma=sigma, size=X.shape[1])
        gene_mask = adata.var_names.isin(targets)
        X[gene_mask, :] = (X[gene_mask, :].T * s).T

    elif mode == "dose_response":
        dose = float(params.get("dose", 1.0))
        Emax = float(params.get("Emax", 1.0))
        EC50 = float(params.get("EC50", 1.0))
        n = float(params.get("n", 1.0))
        kappa = float(params.get("kappa", 1.0))
        E = Emax - (Emax - 0.0) / (1 + (EC50 / dose)**n)
        gene_mask = adata.var_names.isin(targets)
        X[gene_mask, :] = X[gene_mask, :] * (1 + kappa * E)

    elif mode == "latent":
        import scanpy as sc
        if "X_pca" not in adata.obsm:
            sc.tl.pca(adata)
        lam = float(params.get("lambda", 1.0))
        tr = (adata.obs["treatment"] == "treated").values
        ct = (adata.obs["treatment"] == "control").values
        Delta = adata.obsm["X_pca"][tr].mean(0) - adata.obsm["X_pca"][ct].mean(0)
        adata.obsm["X_pca_perturbed"] = adata.obsm["X_pca"] + lam * Delta

    adata.layers[layer_out] = X
    adata.obs["is_virtual"] = True
    adata.obs["perturb_id"] = params.get("tag", "sim_001")
    adata.uns["perturbation_meta"] = {
        "alpha": params.get("alpha"),
        "dose": params.get("dose"),
        "mode": mode,
        "random_state": random_state,
        "timestamp": datetime.datetime.now().isoformat()
    }
    return adata

6) Outputs

  1. Virtual expression matrix layer: adata.layers['perturbed'] (or perturbed_rep# for multiple runs).
  2. Virtual cell annotation: adata.obs['is_virtual'] (True/False), adata.obs['perturb_id'].
  3. Low-dimensional embeddings: adata.obsm['X_pca_perturbed'], adata.obsm['X_umap_perturbed'] (if recomputed).
  4. Virtual vs Original DEG: CSV/TSV (logFC, p-value, FDR) via sc.tl.rank_genes_groups(..., layer='perturbed').
  5. Dose-response params & plots: per-gene Emax/EC50/n and fitted curves (PNG/SVG).
  6. Statistical summary: means, variances, sensitivity scores; uncertainty via replicates/CI.
  7. Simulation metadata: adata.uns['perturbation_meta'] stores parameters (alpha/dose/mode/random_state/timestamp).
UMAP clustering distribution of virtual vs original cells
UMAP clustering distribution of virtual vs original cells.

8) Output Result Explanation

  • adata.layers['perturbed']: ready for downstream (normalize → log1p → PCA → DEG); note annotations refer to perturbed data.
  • DEG table: compares perturbed vs control to verify pathway shifts; can feed into mechanism analysis.
  • PCA/UMAP: visual shift of virtual cells highlights population-level drug effects.
  • Dose-response curve: estimates EC50/Emax for pharmacological discussion.
  • Replicates: average curves with CI reflect uncertainty.

9) Features & Highlights

  • Multi-strategy support: multiplicative, Hill fitting, and latent perturbation cover sparse→rich data scenarios.
  • Reproducibility: all parameters/seeds stored in adata.uns['perturbation_meta'] for exact replication.
  • Fidelity vs flexibility: multiplicative is simple/transparent; latent preserves covariance; Hill is pharmacologically interpretable.
  • Downstream integration: seamless with Scanpy pipelines (PCA/UMAP/DEG/GSEA).
  • Parallel/batch simulation: supports multi-replicate/multi-dose grid for response surfaces.
  • Uncertainty quantification: Monte-Carlo replication with CI for design decisions.

7. Drug Efficacy Analysis Module Detailed Explanation

Dose–response EC50 / Emax / Hill n AUC / Sensitivity GLM / Mixed-effects

1) Purpose

The Drug Efficacy Analysis Module quantitatively describes pharmacological effects of ATRA at the single-cell/population level, assessing how concentration, duration, or intervention strategies impact genes or cell states. Objectives:

  • Fit dose–response curves per gene/pathway/population to estimate EC50/IC50, Emax, and Hill coefficient.
  • Compute AUC and max/min response for efficacy ranking and target prioritization.
  • Identify sensitive/resistant subpopulations and quantify their proportions (sensitivity distribution).
  • Run statistical tests (between-dose significance, trends, population differences) with multiple-testing correction.
  • Provide visualizations: dose–response plots, heatmaps, sensitivity distributions, violin plots, and dose-gradient trajectories (PCA/UMAP). Support uncertainty quantification (bootstrap/Bayesian CI) and mixed-effects to account for replicate/batch/individual effects.

2) Module Assumptions

  • Dose-dependency: ATRA effects are dose-dependent and follow fitable curves (Hill/Logistic) within observed ranges.
  • Reproducibility: Populations are reproducible under identical treatments; variability estimable via replicates.
  • Independence & nesting: Single-cell data are nested (cells within samples/batches); mixed-effects handle correlations.
  • Distribution: Transformed data meet model assumptions (log1p/CPM/GLM); counts via Poisson/NegBin GLM.
  • Measurable effect: Effects are detectable at the transcriptional level with sufficient sample size.

3) Mathematical Models

3.1 Single-Gene Dose–Response (Hill / 4PL)

For gene \(g\) vs. dose \(d\) (or log-dose), model mean expression:

\[ y_g(d) \;=\; E_{0,g} \;+\; \frac{E_{\max,g}-E_{0,g}}{1+\left(\frac{EC_{50,g}}{d}\right)^{n_g}}, \tag{7.1} \]

Parameters \(E_{0,g}\) (baseline), \(E_{\max,g}\) (max effect; negative can indicate down-regulation), \(EC_{50,g}\) (potency), \(n_g\) (steepness) are estimated via least squares / MLE.

3.2 GLMM (Generalized Linear Mixed-Effects)

For single-cell counts \(y_{i j k g}\) (sample \(i\), cell \(j\), measurement \(k\)):

\[ y_{i j k g} \sim \mathrm{NegBin}\!\big(\mu_{i j k g}, \phi_g\big), \qquad g(\mu_{i j k g}) = \beta_{0g} + \beta_{1g}\,\log d_{i j k} + u_{i g} + v_{i j g}, \tag{7.2} \]

with link \(g(\cdot)\) (log for counts; identity/log for continuous), random effects \(u,v\) capturing individual/batch/cell-level variation.

3.3 AUC & Sensitivity Score

Within dose range \([d_{\min}, d_{\max}]\):

\[ \mathrm{AUC}_g \;=\; \int_{d_{\min}}^{d_{\max}} y_g(d)\, \mathrm{d}d, \qquad \mathrm{nAUC}_g = \frac{\mathrm{AUC}_g}{d_{\max}-d_{\min}}. \tag{7.3} \]

Sensitivity fraction over cells with response above threshold \(\tau\):

\[ S(d) \;=\; \frac{1}{N}\sum_{c=1}^{N} \mathbf{1}\!\big[r_c(d) \ge \tau\big]. \tag{7.4} \]

3.4 Single-Cell Response Distribution & Clustering

For cell \(c\), define response vector \(r_c = \{ x_{c,g_1}(d)-x_{c,g_1}(0),\ldots\}\); cluster \(r_c\) to reveal sensitive/resistant groups; compute \(S(d)\) as above.

X-axis is log-dose (0.1–10). Use Emax<E0 to see downregulation; increase n for steeper transitions.

4) Data Sources

  • Required: Preprocessed AnnData (from ATRA module or virtual cell module) with:
    • Multi-dose data (if available) or control + single dose;
    • adata.obs includes dose/treatment/replicate/batch/cell_type;
    • adata.layers may include perturbed layers or raw counts for GLM.
  • Optional: Public dose–response data; PK/PD priors for constraints or Bayesian priors.

5) Implementation

5.1 Data Input

Scanpy · Read & check columns
import scanpy as sc
adata = sc.read_h5ad('preprocessed_atra.h5ad')
print(adata.obs.columns)  # Must include 'dose' and 'treatment'

Required arguments: adata (dose/treatment in obs), genes (target list or auto-selected), and analysis params: dose_range, fit_model ('hill'/'logistic'/'glm'/'mixed'), bootstrap_reps, alpha, random_state.

5.2 Processing Flow

  1. Select granularity: per gene vs. gene set/pathway.
  2. Aggregate per dose (population means) or fit GLMM on single-cell data.
  3. Fit model:
    • hill: non-linear least squares per gene across doses;
    • glm: NegBin regression for counts;
    • mixed: mixed-effects for nested structure.
  4. Estimate parameters & uncertainty (bootstrap/SE/CIs; Bayesian: posteriors).
  5. Compute derived metrics: AUC, Emax, EC50, slope, sensitivity fraction.
  6. Multiple-testing correction: Benjamini–Hochberg (FDR).
  7. Pattern discovery: cluster dose–response shapes.
  8. Visualization: curves, heatmaps, violins, dose-colored PCA/UMAP.
  9. Persist outputs: parameter tables, plots, updated adata.uns.
Python · Fit Hill per gene (curve_fit)
import numpy as np
from scipy.optimize import curve_fit

def hill_fn(d, E0, Emax, EC50, n):
    return E0 + (Emax - E0) / (1 + (EC50 / np.maximum(d, 1e-9))**n)

def fit_hill(doses, expr):
    # doses: array shape [D]; expr: mean expr per dose for one gene [D]
    p0 = [np.median(expr), np.max(expr), np.median(doses), 1.0]
    bounds = ([-np.inf, -np.inf, 1e-9, 0.1], [np.inf, np.inf, np.inf, 5.0])
    popt, pcov = curve_fit(hill_fn, doses, expr, p0=p0, bounds=bounds, maxfev=10000)
    return popt, pcov  # E0, Emax, EC50, n
Python · AUC / nAUC
def auc_from_fit(d_grid, params):
    y = hill_fn(d_grid, *params)
    auc = np.trapz(y, d_grid)
    nAUC = auc / (d_grid.max() - d_grid.min())
    return auc, nAUC
Statsmodels · GLM (NegBin)
import statsmodels.api as sm
import pandas as pd
import numpy as np

def fit_glm_nb(counts, doses):
    # counts: vector; doses: vector (log-dose recommended)
    X = sm.add_constant(np.log(np.maximum(doses,1e-9)))
    model = sm.GLM(counts, X, family=sm.families.NegativeBinomial())
    res = model.fit()
    return res.params, res.bse, res.pvalues
Bootstrap · EC50 / Emax CI
def bootstrap_ci(doses, expr, B=200, q=(0.025,0.975), rng=None):
    rng = np.random.default_rng() if rng is None else rng
    pars = []
    for _ in range(B):
      idx = rng.choice(len(doses), len(doses), replace=True)
      p,_ = fit_hill(doses[idx], expr[idx])
      pars.append(p)
    pars = np.array(pars)
    lo = np.quantile(pars, q[0], axis=0)
    hi = np.quantile(pars, q[1], axis=0)
    return lo, hi  # element-wise CI for [E0,Emax,EC50,n]
Persist meta → adata.uns['pharmacology_meta']
import platform, datetime
def write_pharm_meta(adata, method="hill", seed=42, extra=None):
    meta = dict(method=method, random_seed=seed,
                time=datetime.datetime.now().isoformat(),
                python=platform.python_version())
    if extra: meta.update(extra)
    adata.uns['pharmacology_meta'] = meta

5.3 Outputs

  1. Parameter table per gene/pathway: gene,E0,Emax,EC50,n,...,pvalue_dose,FDR.
  2. Uncertainty intervals: bootstrap or MLE covariance (e.g., EC50_CI_low/high).
  3. AUC/nAUC table for ranking.
  4. Sensitivity distribution: fraction-sensitive by cell or cluster.
  5. Visualization plots: dose–response curves, population heatmap (genes×doses), violin plots, UMAP/PCA with dose color bars.
  6. DEG across doses: pairwise DEG; volcano plots.
  7. Diagnostics: R², residuals, failure logs.
  8. Meta-information and reproducibility record: fitting method, random seed, fitting time, software version, etc., written to adata.uns['pharmacology_meta'].
Dose–response visualization 1
Dose–response visualization 1
Dose–response visualization 2
Dose–response visualization 2
Sensitivity distribution / heatmap
Sensitivity distribution / heatmap
Diagnostics and parameter panels
Diagnostics and parameter panels

6) Output Interpretation

  • EC50 / IC50: dose at half-max effect; lower means higher potency; report CIs.
  • Emax: maximal effect (direction & magnitude) for endpoint.
  • Hill n: curve steepness; \(n>1\) suggests cooperative-like behavior (interpret with caution).
  • AUC: overall effect magnitude across dose range; useful for ranking across targets/pathways.
  • p-value / FDR: dose effect significance; use for candidate screens.
  • Single-cell sensitivity: percentage of sensitive cells reveals heterogeneity.
  • Diagnostics: flag large residuals, failed fits, unstable EC50; recommend validation.

7) Module Features & Highlights

  • Multi-level modeling: gene, pathway, cluster, single-cell mixed-effects.
  • Flexible models: Hill, logistic, GLM (NegBin), mixed-effects, optional Bayesian.
  • Uncertainty quantification: bootstrap CIs by default.
  • Integrated output: parameter tables, AUC, visualizations, and metadata for reports.
  • Virtual-cell integration: analyze simulated data to compare with experiments.
  • Automation & interactivity: batch fitting across many genes with per-gene drill-down.

8) Best Practices and Considerations

  1. Preprocess first (normalized, batch-corrected, clear dose info).
  2. Choose granularity wisely (means per dose vs. GLMM at single-cell scale).
  3. Use replicates and bootstrap/MCMC to get robust CIs.
  4. Control multiple testing via FDR when screening many genes.
  5. Use NegBin GLM (or DESeq2 in R) for raw counts.
  6. Inspect residuals and extreme/unstable EC50; flag as unreliable.
  7. Retain metadata: method, seed, failed fits, inits → adata.uns['pharmacology_meta'].

8. Mechanism Analysis Module Detailed Explanation

ORA / GSEA ssGSEA / AUCell / score_genes GRN inference Pathway–Network integration

1) Purpose

The Mechanism Analysis Module reveals molecular mechanisms and regulatory networks underlying ATRA action using DEG and efficacy results. Scope:

  • Identify significantly affected pathways and functions (GO/KEGG/Reactome).
  • Quantify pathway activity at single-cell/subpopulation level (ssGSEA, AUCell, score_genes).
  • Infer gene regulatory networks (GRNs) to find key TFs and upstream regulators.
  • Build a three-tier Gene → Pathway → Drug network to discover mediators and upstream/downstream signaling.
  • Provide pathway/network visualizations (bar/dot plots, heatmaps, GRN diagrams, trajectory plots).
  • Output prioritized candidates (TFs, pathways, key nodes) for validation.

2) Module Assumptions

  • DEGs reflect biological effects: ATRA-induced DEGs carry pathway-level signals.
  • Pathway annotations available: KEGG/Reactome/MSigDB/GO provide coverage and interpretability.
  • Regulatory relations are inferable: Co-expression and perturbation clues allow partial GRN inference (data/noise dependent).
  • Independence of this stage: Depends on correctness of upstream DEG/efficacy results.

3) Mathematical Models

3.1 ORA (Hypergeometric) & GSEA

Given background size \(N\), pathway gene count \(K\), DEG set size \(n\), overlap \(k\), ORA p-value:

\[ p_{\text{ORA}} \;=\; \sum_{i=k}^{\min(K,n)} \frac{\binom{K}{i}\binom{N-K}{\,n-i\,}}{\binom{N}{n}}. \tag{8.1} \]

GSEA ranks genes (e.g., by logFC/statistic) and computes an enrichment score (ES) from a weighted running-sum; significance is assessed via permutations to obtain FDR.

3.2 Single-Cell Pathway Activity

Per-cell activity \(A_{c,S}\) for set \(S\) (e.g., ssGSEA/AUCell/score_genes). A simple score_genes form (target minus reference mean):

\[ A_{c,S} \;=\; \frac{1}{|S|}\sum_{g\in S} x_{c,g} \;-\; \frac{1}{|S^\star|}\sum_{g\in S^\star} x_{c,g}. \tag{8.2} \]

3.3 GRN Inference

Estimate graph \(G=(V,E)\) with edge weight \(w_{ij}\). One view models gene \(i\) as a function of others:

\[ x_i \;\approx\; f_i\!\big(x_{-i};\,\Theta_i\big),\qquad w_{ji} \;\propto\; \text{importance of } x_j \text{ in predicting } x_i. \tag{8.3} \]

Approaches include correlation/partial correlation, mutual information, and tree ensembles (e.g., GENIE3 / RandomForest); perturbation data aids directionality.

4) Data Sources

  • Inputs: DEG table (gene, logFC, pval, FDR), preprocessed AnnData (expression + cell annotations; optional layers['perturbed']), dose–response params (EC50/Emax).
  • Annotations: KEGG, Reactome, MSigDB, GO; TF-target (TRANSFAC, JASPAR, TRRUST, DoRothEA); PPIs (STRING, BioGRID).

5) Implementation

Inputs · DEG & AnnData
import scanpy as sc, pandas as pd
adata = sc.read_h5ad('preprocessed_atra.h5ad')   # contains normalized layer / obs annotations
deg_df = pd.read_csv('deg_results.csv')          # columns: gene, logFC, pval, FDR
background_genes = set(adata.var_names)          # N = len(background_genes)
ORA · Hypergeometric test
from scipy.stats import hypergeom
def ora_pvalue(K, N, n, k):
    # pathway size K, background N, DEG size n, overlap k
    rv = hypergeom(N, K, n)
    # P(X >= k) = sf(k-1)
    return float(rv.sf(k-1))
GSEA · Rank-based (gseapy)
import gseapy as gp
# ranked list: gene -> statistic (e.g., logFC)
rnk = deg_df[['gene','logFC']].sort_values('logFC', ascending=False)
pre_res = gp.prerank(rnk=rnk, gene_sets='kegg.gmt', min_size=10, max_size=500, permutation_num=1000)
gsea_results = pre_res.res2d  # NES, pval, FDR, leadingEdge
Single-cell activity · score_genes
import scanpy as sc
target_genes = [...]         # from enrichment (e.g., KEGG_FATTY_ACID_MET)
ref_genes = [...]            # matched control gene set
sc.tl.score_genes(adata, gene_list=target_genes, ctrl_size=None, score_name='score_fa', use_raw=False)
# UMAP coloring:
# sc.pl.umap(adata, color=['score_fa'])
GRN · RF importance (GENIE3-like)
import numpy as np, pandas as pd
from sklearn.ensemble import RandomForestRegressor
def infer_grn(X_df, genes=None, n_trees=500, random_state=42):
    # X_df: genes x cells (or cells x genes, adapt as needed)
    genes = list(X_df.index) if genes is None else list(genes)
    edges = []
    for gi, tgt in enumerate(genes):
        y = X_df.loc[tgt].values
        X = X_df.drop(index=tgt).T.values
        feat_names = X_df.drop(index=tgt).index
        rf = RandomForestRegressor(n_estimators=n_trees, random_state=random_state, n_jobs=-1)
        rf.fit(X, y)
        imp = rf.feature_importances_
        for s, w in zip(feat_names, imp):
            edges.append((s, tgt, float(w)))
    grn_edges = pd.DataFrame(edges, columns=['source','target','weight'])
    return grn_edges.sort_values('weight', ascending=False)
Integration · Gene → Pathway → Drug
# Map GRN key nodes to enriched pathways, compute centralities, and assemble a 3-layer view
import networkx as nx
def integrate_network(grn_edges, enriched_df, gene2path):
    G = nx.DiGraph()
    for s,t,w in grn_edges.itertuples(index=False):
        G.add_edge(s,t,weight=w,layer='gene-gene')
    # attach pathways
    for gene, paths in gene2path.items():
        for p in paths:
            G.add_edge(gene, p, weight=1.0, layer='gene-pathway')
    # drug layer (ATRA)
    G.add_node('ATRA', layer='drug')
    for p in set().union(*gene2path.values()):
        G.add_edge('ATRA', p, weight=1.0, layer='drug-pathway')
    cen = nx.pagerank(G, alpha=0.85, weight='weight')
    return G, cen
Persist outputs & meta → adata.uns['mechanism_meta']
import datetime, platform
def write_mechanism_meta(adata, method='gsea+ora+grn', seed=42, versions=None, extra=None):
    meta = dict(method=method, random_seed=seed,
                time=datetime.datetime.now().isoformat(),
                python=platform.python_version())
    if versions: meta.update(versions)
    if extra: meta.update(extra)
    adata.uns['mechanism_meta'] = meta

6) Output Description (Outputs & Interpretation)

  1. Enrichment result tables (CSV):
    • gsea_results.csv: pathway, NES, pval, FDR, leadingEdgeGenes.
    • ora_results.csv: ORA statistics and significance.
  2. Single-cell pathway scores in AnnData: adata.obs['score_<pathway>'].
  3. GRN edge table (CSV): source,target,weight,method,pvalue?; optional node centralities.
  4. Network & pathway figures: dose–response-relevant pathways, pathway dot/bar plots, GRN diagrams, pathway scores on UMAP, gene–pathway heatmaps.
  5. Candidate TF list: ranked by TF-target overlap and centrality.
  6. Integrated report (PDF/HTML): enrichment plots, pathway projections, representative networks, summaries, and suggested validations.
  7. Interpretation:
    • High NES / low FDR → priority pathways.
    • UMAP localization of high pathway scores → sensitive subpopulations.
    • High-centrality GRN nodes (often TFs) → upstream regulators for validation.
    • Agreement between virtual and real perturbations → stronger causal credibility.

Meta-information and reproducibility record: fitting method, random seed, fitting time, software versions, etc., are recorded in adata.uns['mechanism_meta'].

Average expression heatmap for enriched pathways
Average expression heatmap for enriched pathways.

7) Module Features & Highlights

  • Multi-scale analysis from genes → networks → pathways, not just lists but structured mechanisms.
  • Single-cell pathway scoring exposes subpopulation heterogeneity.
  • Perturbation-aware causal clues leverage virtual/real interventions to suggest directionality.
  • Interactive-ready outputs for network exploration and pathway filtering.
  • Integrated prioritization of TF/target/pathway candidates for downstream CRISPR/drug validation.

9. Summary and Resources

1) ATRA Analysis Tool — Contributions and Innovations (Summary)

Overall Positioning. This tool is an end-to-end platform for liver-targeted delivery (ATRA) single-cell RNA-seq analysis, spanning DBTL (Design–Build–Test–Learn): data cleaning, differential analysis, virtual cell generation, efficacy modeling, mechanism elucidation, and target prioritization. It integrates bioinformatics with reproducible in-silico interventions (virtual cells) to support drug delivery and target validation.

  • Integrated DBTL support Links data processing → virtual experiments → pharmacodynamic modeling → mechanism analysis into an automated closed loop, enabling iterative optimization of drug design and protocols.
  • Virtual Cell pipeline Generates reproducible “virtual treatment” data for multi-dose/strategy scenarios when wet coverage is limited, reducing cost while enabling sensitivity analysis and prioritization.
  • Multi-scale efficacy modeling From gene → pathway → population: single-gene Hill curves, population AUC, single-cell GLMM.
  • Mechanism-driven prioritization Integrates DEG, GSEA, GRN, and EC50/AUC into candidates that directly guide CRISPR/validation.
  • Engineering & reproducibility Write perturbations to AnnData.layers and log adata.uns['perturbation_meta'] for traceability.
  • Modular + visual output Web UI + API, interactive networks, one-click report export for collaboration.

2) Source Repositories & Resource Links

Scanpy — single-cell analysis (GitHub)

Preprocessing, PCA/UMAP, DEG, visualization.

AnnData — standardized .h5ad format (GitHub)

Layers and obs/var annotations for reproducible pipelines.

scvi-tools — deep generative/latent modeling (GitHub)

Latent-space perturbation/decoding (scVI/scANVI/totalVI).

GSEApy — GSEA/ssGSEA/Enrichr wrapper (GitHub)

Run GSEA/ssGSEA directly on DEG/ranked lists; access KEGG/Reactome/MSigDB.

Arboreto / GENIE3 / GRN tools (GitHub)

GRN inference via RF/GBM importance; integrates with pySCENIC.

GEO (NCBI) — public expression data

Search/download datasets for training/calibration/controls.

Human Cell Atlas (HCA) — reference atlas

Useful as controls or for annotation references.

STRING / Reactome / KEGG / MSigDB — PPI & pathway sets

Protein interactions and curated pathways (consumed via GSEApy or APIs).

Scanpy Documentation — tutorials & API

Official guides for preprocessing, clustering, visualization, DEG.

scvi-tools Documentation — variational modeling examples

End-to-end workflows for latent modeling and decoding.

GSEApy Documentation — prerank / ssgsea examples

Reference for enrichment pipelines and parameters.

Install · Selected libraries
pip install scanpy anndata gseapy scvi-tools arboreto

3) Extended Applications and Future Directions

1) Package the virtual cell module

Create a Python package (virtualcells) with CLI/API; include unit tests and random seed logging.

  • Benefit: Team members can call functions independently and run CI.
  • Keys: Functional interface (mode), sparse/dense support, reproducibility metadata.

2) Automated report templates

Generate PDF/HTML with PCA/UMAP, DEG, dose–response, GSEA via Jinja2 or nbconvert.

  • Outcome: Single-run reports for meetings/submissions.
Cite & Dependencies · Quick copy
# Dependencies (example)
pip install scanpy anndata gseapy scvi-tools arboreto statsmodels pymc

# BibTeX (example skeletons to complete in paper)
@article{scanpy18, title={Scanpy}, journal={Nat Methods}, year={2018}}
@article{scvitools20, title={scvi-tools}, journal={Nat Biotechnol}, year={2021}}
@article{gseapy, title={GSEApy}, journal={Bioinformatics}, year={2023}}

Source Code & Repository

GitLab Public repository Clone & issues
2025 / software-tools / syphu-china Open in a new tab to browse code, commits, issues, and CI.
Clone (HTTPS)
git clone https://gitlab.igem.org/2025/software-tools/syphu-china.git
cd syphu-china
Setup · Recommended
# (Optional) create env
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate

# install core deps (align with docs)
pip install -U pip setuptools wheel
pip install scanpy anndata gseapy scvi-tools arboreto statsmodels networkx

# run tests / examples (adjust to repo)
# pytest -q
# python examples/quickstart.py
Environment (example)
# export these only if your repo uses them
export ATLAS_CACHE=./cache
export DATA_DIR=./data
export CUDA_VISIBLE_DEVICES=0

Note: If the repository uses SSH, replace the clone URL accordingly. Keep dependencies in sync with the repo’s requirements.txt or pyproject.toml.

Software · Source Code

Repository: https://gitlab.igem.org/2025/software-tools/syphu-china

Software · Web App

Demo (Web): A dedicated page to preview and explain the browser version of our software.
Opens a separate page that keeps the same look & feel as this site.

Software • Archive (Zenodo)

DOI QR code
Scan to open: https://doi.org/10.5281/zenodo.17259225
This Zenodo record contains our software release and associated datasets.
Footer