Our Model TriScale

Engineering peroxisomal protein targeting and rewiring central carbon metabolism in Yarrowia lipolytica poses a formidable design challenge: the import efficiency of linker–signal peptides, the catalytic bottleneck imposed by pathway rate-limiting enzymes, and the nonlinear dynamics of growth under complex carbon sources are deeply intertwined. Trial-and-error approaches alone cannot efficiently resolve this multi-layered problem.

To bridge this gap, we established a hierarchical in silico pipeline that acts as a digital twin of our PICasSO-inspired engineering strategy, enabling rational design and predictive prototyping.

Integration and Impact

By fusing molecular docking, AI-guided enzyme redesign, and machine learning–based systems modeling into a unified framework, our model enables in silico prototyping of peroxisomal targeting and metabolic rewiring. This model-driven strategy transforms the traditional Design–Build–Test–Learn (DBTL) cycle into a predictive, knowledge-guided process, substantially reducing experimental overhead while accelerating the path toward high-yield, industrially robust strains.

I. Linker–Signal Peptide System Docking

Signal-peptide & linker selection · Molecular docking · Energy analysis

1.1 Background

Peroxisomal protein import relies on peroxisomal targeting signals (PTS) and their cytosolic receptors.
Two major types exist: PTS1, a canonical C-terminal tripeptide such as –SKL¹, and PTS2, a nonapeptide motif near the N-terminus.
Among these, PTS1 is the most conserved and widely used signal across eukaryotes.

The cytosolic receptor PEX5 specifically recognizes PTS1-containing proteins, forming a transient receptor–cargo complex.
This complex docks at the peroxisomal membrane through interactions with PEX13 and PEX14, which together constitute the import machinery.
After cargo translocation, PEX5 is recycled to the cytosol for subsequent transport cycles.

To engineer efficient peroxisomal targeting in Yarrowia lipolytica, we incorporated the classical PTS1 (–SKL) signal at the C-terminus of peptides.
Beyond the minimal SKL motif, we designed two extended variants—GGGGSSKL and TYWIRFSKL—to systematically evaluate how linker length and composition affect PEX5 recognition.
The design rationale was that a flexible linker (GGGGS) could increase accessibility of the SKL motif, while an aromatic-rich segment (TYWIRF) might provide additional stabilizing interactions.

These three peptide designs were fused to the eight key enzymes of the MVA pathway (Erg8, Erg9, Erg10, Erg12, Erg13, Erg20, Hmgr and Idi), providing the foundation for subsequent molecular docking analysis.

An animation illustrates this process: enzyme–SKL forms a complex with PEX5, which then docks with PEX14 and PEX13; IDRs of PEX13 undergo liquid–liquid phase separation to generate a transient liquid cavity that facilitates import.

Figure 2. Molecular Mechanism of PEX5-Mediated Peroxisomal Import

1.2 Molecular Docking

1.2.1 Energy Terms and Scoring Principles

The HADDOCK docking protocol is fundamentally a physics-based and restraint-driven platform. It evaluates docking poses by combining several key energy terms:

van der Waals interactions: Describing repulsion and attraction between atoms, preventing steric clashes while capturing hydrophobic stabilization.

$$ E_{\text{vdW}} = \sum_{i<j} \varepsilon_{ij} \!\left[ \left( \frac{r_{\min,ij}}{r_{ij}} \right)^{12} - 2 \left( \frac{r_{\min,ij}}{r_{ij}} \right)^6 \right] $$

Electrostatics: Coulombic interactions between charged residues, essential for PEX5 recognition of signal peptides.

$$ E_{\text{elec}} = \sum_{i<j} \frac{k_e\,q_i q_j}{\varepsilon(r_{ij})\,r_{ij}}, \quad k_e \approx 332\ \text{kcal·Å·mol}^{-1}\text{·e}^{-2} $$

Desolvation energy: Reflecting the free-energy change of amino acid residues when buried at the interface. Hydrophobic burial is favorable, while polar residues require compensating hydrogen bonds or salt bridges.

$$ E_{\text{desolv}} = \sum_{i,j} S_i V_j \exp\left(-\frac{r_{ij}^2}{2\sigma^2}\right) $$

Ambiguous Interaction Restraints (AIRs): Experimentally derived or hypothesized distance restraints that bias docking toward biologically relevant solutions.

$$ E_{\text{air}} = \sum_k W_k \big(\langle d_k^{-6}\rangle^{-1/6} - d_{0,k}\big)^2 $$

At different docking stages (rigid body, semi-flexible refinement, water refinement), these energy terms are combined with varying weights into the final HADDOCK score. Additionally, Buried Surface Area (BSA) is used to quantify the size and compactness of the interface.

$$ \mathrm{Score}=w_{vdW}E_{vdW}+w_{elec}E_{elec}+w_{desolv}E_{desolv}+w_{air}E_{air}-w_{BSA}\cdot\mathrm{BSA} $$

(Optional reporting term: − 0.01 × BSA may be included to account for interface compactness.)

Thus, HADDOCK scoring integrates both atomic-level physical interactions and experimental restraints, rather than relying on geometry alone.

1.2.2 Structure Preparation

To systematically investigate peroxisomal targeting, we collected structural information for the eight key enzymes of the Yarrowia lipolytica MVA pathway (Erg8, Erg9, Erg10, Erg13, Erg12, Erg20, Hmgr, Idi). Experimental structures were retrieved from the PDB database, and missing structures were completed by homology modeling.

We designed three C-terminal signal peptide tags: the canonical SKL motif, the flexible extended GGGGSSKL, and the aromatic-rich TYWIRFSKL. These tags were fused to the eight enzymes, and complete inputs were generated with AlphaFold predictions.

1.2.3 Docking Workflow

Docking simulations were performed with HADDOCK 2.4⁸ for PEX5 against each enzyme–peptide fusion. In total, 24 independent jobs (3 peptides × 8 enzymes) were executed. Each job produced ~300 poses at the rigid-body stage (it0), yielding ~14,400 initial poses overall. After semi-flexible refinement (it1) and water refinement, 2,880 water-refined models were retained, consistent with the typical ~1:5 (it1:it0) selection ratio.

$$ \mathrm{Score}_{it0}=0.01E_{vdw}+1.0E_{elec}+1.0E_{desolv}+0.01E_{air}-0.01\,\mathrm{BSA} $$

$$ \mathrm{Score}_{it1}=1.0E_{vdw}+1.0E_{elec}+1.0E_{desolv}+0.01E_{air}-0.01\,\mathrm{BSA} $$

$$ \mathrm{Score}_{water}=1.0E_{vdw}+0.2E_{elec}+1.0E_{desolv}+0.1E_{air} $$

Clustering by interface RMSD reduced the dataset to ~192 representative complexes. Based on HADDOCK scores, buried surface area (BSA), and conformational diversity, 24 best-performing complexes were selected for downstream energy analysis and visualization.

1.2.4Definition of Active Residues

To focus docking on biologically relevant regions, active residues were explicitly defined. For PEX5, the TPR repeat hot-spot residues (364, 398, 438, 469, 476, 501, 507, 585) were selected. For the MVA enzymes, the C-terminal linker + SKL regions were defined as active sites, reproducing the natural recognition of PTS1 signals by PEX5.

1.2.5 Scoring and Visualization

Docking results were evaluated with the HADDOCK scoring function, which integrates van der Waals, electrostatics, desolvation, and restraint energies, combined with BSA to quantify interface size (Figure 1).

Representative complexes were visualized in PyMOL and ChimeraX to illustrate hydrogen-bond networks and structural overlays. The results confirmed the plausibility of docking poses and highlighted distinct recognition patterns between the minimal and extended signal peptides at the PEX5 interface.

1.3 Energy Analysis

To comprehensively evaluate docking outcomes, we analyzed both global scores and underlying energetic contributions.

Boxplot comparisons across all peptide–enzyme complexes revealed a clear hierarchy: the extended TYWIRFSKL consistently achieved the most favorable HADDOCK scores, the canonical SKL reproducibly ranked lowest, and GGGGSSKL showed intermediate values. This ordering was robust across replicates and reflected a systematic advantage for the aromatic-rich extension.

When HADDOCK scores were resolved for each enzyme partner (barplot analysis), the same pattern emerged: TYWIRFSKL outperformed SKL in nearly every case, often with large margins, whereas GGGGSSKL tracked between the two. These results demonstrate that the effect is not restricted to a single target but represents a general enhancement of binding across the enzyme panel.

Global HADDOCK score distribution (boxplot)

Figure 4a. Global HADDOCK score distribution across peptide–enzyme complexes (boxplot).

Enzyme-wise HADDOCK score comparison (barplot)

Figure 4b. Enzyme-wise HADDOCK score comparison across target enzymes (barplot).

We next examined the energetic basis for these differences. Donut plots visualizing HADDOCK scoring weights highlighted distinct stabilization mechanisms. For TYWIRFSKL, improved binding arose primarily from enhanced electrostatics and van der Waals packing, which favor tighter and more specific interfaces. By contrast, SKL complexes relied disproportionately on desolvation contributions, reflecting weaker direct interactions and more solvent-mediated stabilization. GGGGSSKL again showed an intermediate profile, suggesting that the flexible linker partially compensates but cannot substitute for the aromatic contacts.

Figure 5. Energetic decomposition showing relative contributions of electrostatics, vdW, and desolvation terms (donut plots).

Finally, we integrated all docking metrics—including HADDOCK score, vdW, electrostatics, desolvation, buried surface area (BSA), and Z-score—into a standardized heatmap to compare complexes on a unified scale. To enable cross-metric comparison, terms where “lower is better” were inverted, and each column was z-score normalized. This approach eliminated differences in physical units and revealed the relative performance landscape.

$$ Z = \frac{x - \mu}{\sigma} $$

Figure 6. Standardized heatmap of HADDOCK-derived energetic and interfacial metrics.

In this analysis, TYWIRFSKL complexes consistently clustered toward favorable energetic and interfacial signatures, with larger BSA and stronger vdW/electrostatics contributions. In contrast, SKL complexes grouped together at the unfavorable end of the spectrum, with lower BSA and weaker interactions.

Together, these multi-layered analyses converge on the conclusion that aromatic extension of the SKL motif enhances recognition, increases buried surface area, and stabilizes peptide–enzyme interactions through stronger packing and electrostatics, providing a mechanistic rationale for its superior docking performance.

In summary, TYWIRFSKL was identified as the optimal linker–signal peptide system capable of efficiently binding to PEX5 and mediating peroxisomal import, offering clear guidance for experimental design and validation in the wet-lab phase.

1.4 Limitations and Future Perspectives

Despite providing valuable insights into peptide–enzyme recognition, this study has several limitations.

First, HADDOCK predictions are highly dependent on the quality of input structures.
Homology-modeled enzymes and AlphaFold-predicted peptide fusions may carry local inaccuracies, and the definition of active residues remains partly subjective—both factors potentially bias docking outcomes and cluster rankings.

Second, docking represents molecular interactions in a static framework.
Conformational flexibility, induced-fit effects, and large-scale domain motions prevalent in the cellular environment cannot be fully captured.
Moreover, macromolecular crowding, ionic strength, and post-translational modifications—critical determinants of peroxisomal import—were not considered in the present computational setup.

Third, the HADDOCK scoring function is based on empirical weighting.
While the composite score provides a useful trend indicator, it may not accurately reflect the true energetic balance under physiological conditions.
The dominance of electrostatic and van der Waals contributions observed here warrants validation through more rigorous, force-field–based simulations.

Finally, the functional relevance of docking preferences remains to be experimentally verified.
In vitro binding assays and in vivo peroxisomal import experiments will be required to confirm whether the enhanced affinity of TYWIRFSKL indeed translates into improved targeting efficiency.

Future work will address these limitations by:

refining top-performing complexes with long-timescale molecular dynamics simulations under near-physiological conditions;
incorporating explicit solvent and macromolecular crowding models to capture cellular complexity;
coupling computational predictions with biochemical and cellular assays to validate binding preferences and import outcomes.

II. Protein Design and Optimization

Structure prediction · Active-site engineering · Stability refinement

2.1 Background

3-Hydroxy-3-methylglutaryl-CoA reductase (HMGR) is the rate-limiting enzyme of the mevalonate (MVA) pathway, catalyzing the reduction of HMG-CoA to mevalonate — a crucial control point for isoprenoid, sterol, and lipid biosynthesis. The structural stability and catalytic efficiency of HMGR therefore exert a decisive influence on downstream metabolic flux and product accumulation.

In Yarrowia lipolytica, HMGR is a membrane-associated enzyme composed of an N-terminal transmembrane and sterol-binding domain and a C-terminal cytosolic catalytic domain. Because full-length HMGR is difficult to express and fold in heterologous systems, we employed a truncated soluble variant (tHMGR) that lacks the sterol-binding and transmembrane regions but retains the core catalytic domain. This truncation preserves enzymatic activity while significantly improving solubility and expression stability.

To enhance thermodynamic stability without compromising catalytic function, we developed a structure-guided protein design framework integrating ProteinMPNN, Frustratometer2, and Rosetta FastRelax, enabling systematic sequence exploration and energy refinement of tHMGR — laying a computational foundation for rational enzyme optimization in the MVA pathway.

2.2 Theoretical Principles

2.2.1 ProteinMPNN

ProteinMPNN⁹ is a graph-based sequence design model built upon message passing neural networks (MPNNs).
The protein backbone is represented as a residue graph, where nodes correspond to residues and edges encode geometric relationships.
By iteratively passing messages between nodes, the model learns spatial dependencies that govern residue identity.

Trained on large-scale protein structure–sequence datasets, ProteinMPNN can rapidly sample diverse and stable sequences conditioned on a fixed backbone, efficiently balancing sequence diversity and thermodynamic stability.

2.2.2Frustratometer2

Frustratometer2¹¹ quantitatively characterizes the frustration of a protein’s energy landscape — a measure of energetic conflict among local interactions.
Using molecular mechanics potentials and perturbation analysis, it classifies interactions as minimally, neutral, or highly frustrated, thereby evaluating both local and global energetic smoothness.

Reducing local frustration typically enhances folding cooperativity and thermodynamic stability.

2.2.3 Rosetta FastRelax

Rosetta¹³ FastRelax is an all-atom refinement protocol within the Rosetta suite.
By iteratively minimizing energy and repacking side-chains while maintaining the global fold, it eliminates steric clashes and achieves energetically favorable conformations.

FastRelax is typically employed as a post-design refinement step to further optimize candidate structures, ensuring physical plausibility and thermodynamic robustness.

2.3 Design Workflow

To systematically explore the sequence landscape of tHMGR, we established a three-step design–optimization workflow (Figure 7):

(1) Sequence Sampling:
ProteinMPNN generates approximately 100 candidate sequences conditioned on the tHMGR backbone.
(2) Energy Evaluation:
Each sequence is analyzed using Frustratometer2 to quantify local and global frustration, selecting those with smooth energy landscapes.
(3) Structure Relaxation:
Selected candidates are refined by Rosetta FastRelax to remove steric conflicts and minimize total energy at the atomic level.

This pipeline enables iterative coupling of generative modeling and physics-based refinement, ensuring both sequence diversity and structural stability in tHMGR design.

Figure 7. Integrated pipeline for sequence design and energy optimization of tHMGR.

2.4 Results and Validation

The workflow successfully generated a set of structurally compatible and energetically optimized tHMGR variants.
All ProteinMPNN-designed sequences aligned well with the target backbone, confirming the feasibility of conditional sequence generation.

Frustratometer2 analysis revealed that several designed variants exhibited a slight reduction (~2%) in highly frustrated regions and a corresponding increase in minimally frustrated contacts compared with the wild-type template.

After Rosetta FastRelax refinement, the total energy decreased by approximately 15–25%, and the local energy landscape in the core folding regions became smoother .
During the relaxation process, the backbone RMSD gradually converged from 14.52 Å to 1.34 Å (calculated over 2796 atoms).
The RMSD values for the five relaxation cycles were 14.52, 5.10, 2.75, 2.08, and 1.67 Å respectively , illustrating the progressive elimination of energetic strain through iterative minimization and side-chain repacking.

The final RMSD of ~1.3 Å indicates that structural refinement mainly affected local regions—particularly side chains and flexible loops—while maintaining the global fold.
In the interactive NGLView visualization, the pre-relaxation model (grey) and post-relaxation model (cyan) clearly demonstrate local rearrangements and reduced steric clashes, confirming the effectiveness of the FastRelax refinement.

Figure 8. RMSD convergence during Rosetta FastRelax refinement of tHMGR variants.

Top-ranked variants were subsequently selected for experimental validation of folding stability and catalytic activity.

2.5 Biological Interpretation

From an energy-landscape perspective, the reduction of frustration indicates fewer energetic barriers along the folding pathway, resulting in a smoother folding funnel and more stable energy minima.
During the design process, the core catalytic region of tHMGR was deliberately preserved; all mutations were restricted to peripheral or non-conserved sites to maintain the geometry of catalytic residues and cofactor-binding motifs.
This rational constraint ensured that structural optimization did not compromise enzymatic activity, achieving a balance between stability enhancement and functional retention.
Interestingly, several low-frustration regions were located near the catalytic pocket, suggesting that local stabilization may further fine-tune the catalytic microenvironment and improve functional efficiency.

2.6 Limitations and Future Work

Although the integrated framework performed effectively for rational optimization of tHMGR, several limitations remain.
First, ProteinMPNN operates on static backbones and does not capture conformational flexibility.
Second, Frustratometer2 relies on classical potentials lacking quantum and solvent effects.
Third, FastRelax focuses on energy minimization without explicitly constraining catalytic kinetics.

Future extensions include:
1 Incorporating molecular dynamics simulations to represent conformational dynamics;
2 Developing multi-objective optimization strategies to balance thermodynamic stability and catalytic activity;
3 Integrating experimental feedback loops to achieve closed-cycle computational–experimental co-design.

III. Machine Learning Growth Modeling

Gompertz fitting · Parameter prediction · Environmental restraint

3.1 Background

Understanding the quantitative relationship between environmental factors and microbial growth dynamics is essential for optimizing bioprocesses and designing predictive cultivation strategies.
In this study, Yarrowia lipolytica was chosen as a model organism owing to its exceptional environmental tolerance and remarkable lipid-producing capacity, making it a versatile chassis for metabolic engineering.

Traditionally, microbial growth is described using the four-parameter Gompertz equation, which captures the characteristic sigmoidal trajectory of a batch culture.
Its parameters—initial biomass (A), carrying capacity (K), maximum specific growth rate (μₘₐₓ), and lag-phase duration (λ)—offer direct physiological interpretation, reflecting cell density, growth potential, metabolic activity, and adaptation time, respectively.

However, this classical fitting approach is inherently phenomenological and condition-specific: each growth curve is fitted independently under a fixed environment, and the obtained parameters cannot be generalized to new combinations of pH or temperature.
To overcome this limitation, we established a hybrid modelling framework that integrates mechanistic growth equations with machine-learning regression, enabling the model to learn the mapping between culture conditions (pH, T) and Gompertz parameters.

This combined approach allows not only quantitative prediction of unseen conditions but also rational optimization of environmental variables—bridging the gap between empirical growth curves and data-driven bioprocess design.

3.2 Data Acquisition

To construct a comprehensive dataset linking environmental variables to growth behavior, we continuously monitored OD₆₀₀ of Yarrowia lipolytica cultivated under systematically varied conditions.
Each culture was measured in triplicate across eleven time points spanning 0–80 h. At every time point, OD₆₀₀ was recorded both before and after centrifugation, and the difference was taken as the effective OD, thereby eliminating background interference from medium turbidity and yielding cleaner growth profiles.

The experimental design combined two environmental factors—pH and temperature—with twelve distinct carbon sources, including oleic acid, glycerol, and recycled cooking oil.
Sixteen pH–temperature combinations were tested in total: twelve evenly spaced factorial conditions for model training and four intermediate, non-equidistant points for testing generalization to unseen environments.

Altogether, the dataset contained approximately 6,000 high-resolution measurements, providing a robust quantitative foundation for Gompertz parameter fitting and machine-learning regression.

3.3 Modelling and Prediction

To transform the experimental OD₆₀₀ trajectories into interpretable kinetic descriptors, we first applied the Gompertz²² equation to fit all growth curves using non-linear least squares

Figure 9. Gompertz model fitting of Y. lipolytica growth under varied carbon-source conditions.The carbon sources in the figure include: Oleic acid + glycerol (1:1, 1:3, 1:5), Linoleic acid + glycerol (1:1, 1:3, 1:5), Alpha-linolenic acid + glycerol (1:1, 1:3, 1:5), Glucose (YPD), and waste cooking oil from the school canteen (Lipid, Lipid-fatase).

This classical model captures the sigmoidal microbial growth pattern and provides four biologically meaningful parameters: the initial biomass (A), the carrying capacity (K), the maximum specific growth rate (μₘₐₓ), and the lag-phase duration (λ).

$$ y(t) = A + (K - A) \cdot \exp\left[-\exp\left(\frac{\mu_{\max} \cdot e}{K - A}(\lambda - t) + 1\right)\right] $$

Each fit produced smooth trajectories and 95 % confidence intervals, providing robust parameter estimates.

Figure 10. Gompertz model fitting with 95% confidence intervals across multiple carbon sources.

The fitted parameters were then reformulated as a multi-output regression task, with environmental conditions (pH, temperature, substrate) serving as inputs and the Gompertz parameter vector as outputs.
Model training minimized the L2 loss function , ensuring consistency between predicted and fitted parameters.

$$ L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \| \theta_i - f_\theta(x_i) \|_2^2 $$

For decision-tree models²³, node splits were determined by maximizing variance reduction .
Two ensemble-tree algorithms were benchmarked:

$$ \Delta V = \mathrm{Var}(S) - \frac{|S_L|}{|S|}\mathrm{Var}(S_L) - \frac{|S_R|}{|S|}\mathrm{Var}(S_R) $$

Random Forest — aggregates multiple decision trees through a bagging strategy to reduce variance and improve robustness;
XGBoost — applies boosting to iteratively fit residuals while optimizing a regularized objective . The regularization term penalizes overly complex tree structures and large weights, thereby enhancing generalization.

$$ \mathrm{Obj} = \sum_i \frac{1}{2} \| \theta_i - \hat{\theta}_i \|_2^2 + \sum_k \Omega(f_k) $$

$$ \Omega(f) = \gamma T + \frac{\lambda}{2} \sum_{j=1}^{T} w_j^2 $$

Training was conducted under five-fold cross-validation, with predictive performance evaluated using the coefficient of determination (R²) and root-mean-square error (RMSE).
To ensure biological plausibility, we incorporated an empirical constraint based on the 10%→90% rise width of the Gompertz curve .

$$ \Delta t_{10\to90} \approx \frac{3.0844e}{\mu_{\max}(K - A)} $$

In specific scenarios, μₘₐₓ was recalculated from this constraint , further improving the physiological interpretability of predictions.

$$ \mu_{\max}^{\mathrm{new}} = \frac{3.0844}{20} \cdot \frac{K - A}{e} $$

In summary, this tri-layer modelling framework — mechanistic fitting, data-driven regression, and empirical constraint — ensures both interpretability and generalization, providing a robust tool for predictive modelling and optimization of microbial growth curves.
This hybrid approach bridges mechanistic insight with data-driven flexibility, enabling rational environmental optimization in bioprocess design.

Figure 11. Environmental adjustment of lipid growth curve (pH 6.0, 30 °C).

3.4 Results and Model Verification

After completing curve fitting and machine learning training, we systematically evaluated the overall performance and extrapolation capability of the model. All 144 growth curves (12 environmental conditions × 12 substrates) were successfully fitted without any failure, demonstrating the strong adaptability and numerical stability of the Gompertz equation within this system. The fitted curves closely overlapped with the experimental measurements, accurately capturing the initiation, exponential, and stationary phases of the typical sigmoidal growth pattern, confirming that the model can robustly describe microbial growth under diverse conditions.

Figure 12. System evaluation of growth fitting and model prediction.

To assess the model’s generalization ability, fivefold cross-validation was conducted on the training dataset, and an independent test set was constructed for unseen conditions. The training dataset included combinations of pH 4.5, 5.5, 6.5, and 7.5 with temperatures of 24 °C, 28 °C, and 32 °C, while the test dataset comprised four new conditions (pH 6.0 and 6.8 × 26 °C and 30 °C). Cross-validation results showed that the Random Forest model achieved the best overall accuracy, with an average RMSE of approximately 0.0609 and R² of 0.93, whereas XGBoost yielded slightly higher error with a normalized RMSE of around 0.12. The predicted values of the four Gompertz parameters (A, K, μₘₐₓ, and λ) exhibited strong linear correlation with the fitted results, indicating that the model successfully learned the nonlinear mapping between environmental variables and growth kinetics.

Figure 13. Predicted vs. true values for Gompertz parameters..

In the independent test set, the predicted growth curves almost completely overlapped with the experimentally fitted ones. The onset of the exponential phase and the stationary plateau were consistent across curves, with deviations in onset time and maximum specific growth rate remaining within ±5 %.

$$ L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$

These results indicate that the learned mapping relationships can reliably extend to unseen environmental conditions, demonstrating excellent extrapolation performance.

Figure 14. Predicted vs. actual growth curve (unseen condition, 26 °C).

The reliability of the model was further confirmed through multiple validation strategies. Leave-one-condition-out analysis showed that removing any single condition still yielded predictions with an average R² above 0.9. Residual analysis revealed a normal distribution of prediction errors across different pH and temperature combinations, without systematic bias. Moreover, bootstrap resampling demonstrated that 90 % prediction intervals successfully encompassed over 90 % of the true parameter values, confirming the robustness and statistical confidence of the model.

The combined “prediction + constraint” framework preserves biophysical interpretability while enabling accurate extrapolation to unseen environmental conditions. The model exhibits high stability, generalization, and physiological plausibility, providing a reliable quantitative foundation for subsequent metabolic modeling and cultivation optimization.

In summary, pH 6.0, 28 °C, and a 1:1 glycerol-to-oleic acid ratio were identified as the optimal cultivation conditions for Yarrowia lipolytica, providing clear guidance for subsequent wet-lab experiments.

3.5 Biological Interpretation

The four kinetic parameters (A, K, μₘₐₓ, λ) extracted from the model provide clear insights into how Yarrowia lipolytica responds to different environmental conditions. Both pH and temperature exhibited strong nonlinear effects on growth dynamics. At pH 5.5–6.5 and 28 °C, the yeast achieved its highest maximum specific growth rate (μₘₐₓ) and the shortest lag-phase duration (λ), indicating this region represents the physiological optimum where metabolic activity and growth are most synchronized. This trend is consistent with previous reports on the optimal pH range for lipid accumulation, suggesting that the conditions favoring rapid population growth also coincide with enhanced biosynthetic activity.

When the environment deviated from this range (pH < 5 or > 7), μₘₐₓ decreased sharply while λ increased, indicating that both acidic and alkaline stresses delayed metabolic activation and slowed biomass expansion. Under alkaline conditions in particular, the model predicted that λ could increase by 1.5–2 fold, implying that membrane integrity and enzyme activity are compromised under pH stress.

Temperature exerted a clear influence on the carrying capacity (K). Above 28 °C, K gradually declined, suggesting that elevated temperatures induced early stationary phases and reduced total biomass yield. Conversely, between 24–28 °C, K remained stable at high levels, implying that moderate temperatures promote prolonged and balanced population growth.

Table 3-1. Model predictions versus actual Gompertz parameters on independent test set

Group	Target	pH	Temperature	A_pd	K_pd	μₘₐₓ_pd	λ_pd	A_ac	K_ac	μₘₐₓ_ac	λ_ac
1	Oleic_1:1	6.0	26.0	0.013	0.801	0.048	8.932	0.012	0.719	0.048	8.955
1	Oleic_1:3	6.0	26.0	0.028	0.507	0.072	8.471	0.027	0.451	0.072	8.558
1	Oleic_1:5	6.0	26.0	0.024	0.566	0.047	5.003	0.039	0.503	0.046	5.000
1	Linoleic_1:1	6.0	26.0	0.021	0.267	0.100	6.653	0.019	0.210	0.100	6.775
1	Linoleic_1:3	6.0	26.0	0.041	0.346	0.075	5.006	0.039	0.306	0.074	5.000
1	Linoleic_1:5	6.0	26.0	0.030	0.435	0.073	7.381	0.020	0.385	0.073	7.495
1	Alpha-linolenic_1:1	6.0	26.0	0.018	0.234	0.100	6.450	0.014	0.179	0.100	6.475
1	Alpha-linolenic_1:3	6.0	26.0	0.029	0.125	0.100	7.265	0.024	0.103	0.100	7.264
1	Alpha-linolenic_1:5	6.0	26.0	0.023	0.140	0.100	6.397	0.020	0.281	0.100	6.562
1	YPD	6.0	26.0	0.014	0.186	0.100	5.896	0.013	0.145	0.100	5.995
1	Lipid	6.0	26.0	0.049	0.552	0.100	10.861	0.044	0.496	0.100	10.875
1	Lipid-fatase	6.0	26.0	0.000	0.476	0.049	6.881	0.000	0.418	0.049	6.908
2	Oleic_1:1	6.0	30.0	0.018	0.818	0.048	8.920	0.014	0.799	0.048	8.862
2	Oleic_1:3	6.0	30.0	0.029	0.513	0.072	8.487	0.032	0.501	0.071	8.482
2	Oleic_1:5	6.0	30.0	0.042	0.568	0.047	5.005	0.049	0.558	0.046	5.000
2	Linoleic_1:1	6.0	30.0	0.021	0.258	0.100	6.667	0.022	0.233	0.100	6.720
2	Linoleic_1:3	6.0	30.0	0.042	0.347	0.075	5.006	0.047	0.338	0.075	5.000
2	Linoleic_1:5	6.0	30.0	0.020	0.435	0.073	7.399	0.022	0.427	0.071	7.246
2	Alpha-linolenic_1:1	6.0	30.0	0.017	0.224	0.100	6.452	0.015	0.199	0.100	6.476
2	Alpha-linolenic_1:3	6.0	30.0	0.029	0.122	0.100	7.270	0.028	0.115	0.100	7.233
2	Alpha-linolenic_1:5	6.0	30.0	0.032	0.328	0.100	6.426	0.039	0.312	0.100	6.439
2	YPD	6.0	30.0	0.014	0.178	0.100	5.912	0.014	0.161	0.100	5.859
2	Lipid	6.0	30.0	0.059	0.555	0.100	10.858	0.052	0.551	0.100	10.920
2	Lipid-fatase	6.0	30.0	0.000	0.477	0.048	6.840	0.000	0.465	0.047	6.592
3	Oleic_1:1	6.8	26.0	0.013	0.824	0.048	8.922	0.010	0.613	0.048	8.943
3	Oleic_1:3	6.8	26.0	0.029	0.519	0.072	8.478	0.023	0.386	0.072	8.549
3	Oleic_1:5	6.8	26.0	0.043	0.574	0.047	5.004	0.033	0.430	0.046	5.000
3	Linoleic_1:1	6.8	26.0	0.021	0.259	0.100	6.664	0.016	0.179	0.100	6.808
3	Linoleic_1:3	6.8	26.0	0.042	0.351	0.075	5.006	0.034	0.261	0.073	5.000
3	Linoleic_1:5	6.8	26.0	0.021	0.443	0.073	7.386	0.017	0.328	0.074	7.548
3	Alpha-linolenic_1:1	6.8	26.0	0.027	0.227	0.100	6.450	0.012	0.152	0.100	6.508
3	Alpha-linolenic_1:3	6.8	26.0	0.029	0.123	0.100	7.269	0.021	0.088	0.100	7.288
3	Alpha-linolenic_1:5	6.8	26.0	0.036	0.329	0.100	6.410	0.030	0.240	0.100	6.668
3	YPD	6.8	26.0	0.013	0.180	0.100	5.899	0.011	0.124	0.100	5.800
3	Lipid	6.8	26.0	0.030	0.566	0.100	10.859	0.038	0.423	0.100	10.868
3	Lipid-fatase	6.8	26.0	0.000	0.484	0.048	6.845	0.002	0.355	0.051	7.187
4	Oleic_1:1	6.8	30.0	0.031	0.828	0.048	9.012	0.028	0.907	0.048	8.950
4	Oleic_1:3	6.8	30.0	0.029	0.520	0.072	8.947	0.027	0.453	0.072	8.950
4	Oleic_1:5	6.8	30.0	0.043	0.576	0.047	5.007	0.041	0.539	0.047	5.000
4	Linoleic_1:1	6.8	30.0	0.022	0.259	0.100	6.675	0.022	0.229	0.100	6.784
4	Linoleic_1:3	6.8	30.0	0.041	0.352	0.075	5.010	0.040	0.331	0.074	5.000
4	Linoleic_1:5	6.8	30.0	0.022	0.446	0.073	7.391	0.020	0.402	0.073	7.472
4	Alpha-linolenic_1:1	6.8	30.0	0.018	0.228	0.100	6.458	0.016	0.186	0.100	6.489
4	Alpha-linolenic_1:3	6.8	30.0	0.029	0.124	0.100	7.278	0.025	0.107	0.100	7.266
4	Alpha-linolenic_1:5	6.8	30.0	0.035	0.331	0.100	6.436	0.032	0.289	0.100	6.459
4	YPD	6.8	30.0	0.014	0.183	0.100	5.915	0.013	0.151	0.100	5.904
4	Lipid	6.8	30.0	0.061	0.557	0.100	10.863	0.055	0.522	0.100	10.881
4	Lipid-fatase	6.8	30.0	0.000	0.479	0.048	6.842	0.000	0.442	0.047	6.701

RMSE: 0.0609 R²: 0.9333
The model shows close agreement between predicted and actual Gompertz parameters across unseen pH–temperature conditions.

Together, these findings demonstrate that the model successfully captured the multidimensional physiological responses of Y. lipolytica to environmental variation. The inverse relationship between μₘₐₓ and λ reflects the trade-off between metabolic activation and adaptation, while the decline of K at higher temperatures highlights energy and stability constraints under heat stress. Overall, this interpretation confirms the biological validity of the model and provides quantitative guidance for optimizing culture conditions.

3.6 Limitations and Future Work

Although the hybrid modelling framework developed in this study demonstrated high accuracy and interpretability in predicting and optimizing microbial growth, several limitations remain.
First, the Gompertz equation assumes a single sigmoidal phase and does not account for substrate depletion, nutrient limitation, or metabolic reprogramming.
As a result, its descriptive capacity may decline when Yarrowia lipolytica undergoes secondary growth or stress-induced recovery under complex substrate conditions.
Second, the machine-learning module currently relies on static environmental inputs (pH, temperature, substrate type), without incorporating temporal features or process factors such as dissolved oxygen or C/N ratio.
This allows the model to capture global growth trends but limits its ability to represent short-term transitions or regulatory responses.

Future extensions will focus on two directions:
(1) introducing dynamic modelling approaches, such as sequence-to-sequence architectures or neural ordinary differential equations (Neural ODEs), to describe multi-phase growth and stress-response processes;
and (2) expanding the dataset to include multiple carbon sources and nutrient regimes, enabling the construction of a multidimensional predictive model that learns substrate adaptability and growth stability simultaneously.

Overall, this framework establishes a generalizable approach for predictive and interpretable growth modelling in Y. lipolytica, which can be further enhanced by integrating dynamic information and environmental complexity.

Click to learn more about our software-tools

References

Gould SJ, Keller GA, Hosken N, Wilkinson J, Subramani S. A conserved tripeptide sorts proteins to peroxisomes. J Cell Biol. 1989;108(5):1657–1664.
Brocard C, Hartig A. Peroxisome targeting signal 1: Is it really a simple tripeptide? Biochim Biophys Acta. 2006;1763(12):1565–1573.
Lametschwandtner G, Brocard C, Fransen M, Van Veldhoven PP, Berger J, Hartig A. The difference in recognition of PTS1 and PTS2 signals is conserved between yeast and mammals. EMBO J. 1998;17(21):5948–5958.
Neuberger G, Maurer-Stroh S, Eisenhaber B, Hartig A, Eisenhaber F. Prediction of peroxisomal targeting signal 1–containing proteins from amino acid sequence. J Mol Biol. 2003;328(3):581–592.
Dominguez C, Boelens R, Bonvin AMJJ. HADDOCK: A protein–protein docking approach based on biochemical or biophysical information. J Am Chem Soc. 2003;125(7):1731–1737.
van Zundert GCP, Rodrigues JPG, Trellet M, Schmitz C, Kastritis PL, Karaca E, et al. The HADDOCK2.4 web server: user-friendly integrative modeling of biomolecular complexes. J Mol Biol. 2016;428(4):720–725.
Rodrigues JPG, Trellet M, Schmitz C, Kastritis PL, Karaca E, Melquiond ASJ, Bonvin AMJJ. Clustering biomolecular complexes by residue contacts similarity. Proteins. 2012;80(7):1810–1817.
van Zundert GCP, Bonvin AMJJ. Modeling protein–protein complexes using the HADDOCK webserver. Methods Mol Biol. 2014;1137:163–179.
Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning based protein sequence design using ProteinMPNN. Science. 2022;378(6615):49–56.
Ferreiro DU, Hegler JA, Komives EA, Wolynes PG. Localizing frustration in native proteins and protein assemblies. Proc Natl Acad Sci USA. 2007;104(50):19819–19824.
Parra RG, Schafer NP, Radusky LG, Tsai MY, Guzovsky AB, Wolynes PG, Ferreiro DU. Protein Frustratometer 2: a tool to localize energetic frustration in protein molecules, now with electrostatics. Nucleic Acids Res. 2016;44(W1):W356–W360.
Tyka MD, Keedy DA, André I, Dimaio F, Song Y, Richardson DC, Richardson JS, Baker D. Alternate states of proteins revealed by detailed energy landscape mapping. J Mol Biol. 2011;405(2):607–618.
Leaver-Fay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R, Kaufman K, Renfrew PD, Smith CA, Sheffler W, et al. ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol. 2011;487:545–574.
Istvan ES, Deisenhofer J. Structural mechanism for statin inhibition of HMG-CoA reductase. Science. 2001;292(5519):1160–1164.
Miziorko HM. Enzymes of the mevalonate pathway of isoprenoid biosynthesis. Arch Biochem Biophys. 2011;505(2):131–143.
Liu Z, Gao Y, Chen J, Imanaka T, Bao J, Hua Q. Metabolic engineering of Yarrowia lipolytica for the production of terpenoids. Metab Eng. 2019;57:151–161.
Rodriguez GM, Hussain MS, Gambill L, Gao D, Yaguchi A, Blenner M. Engineering Yarrowia lipolytica to produce fuels and chemicals from xylose. Biotechnol Bioeng. 2016;113(11):2528–2538.
Zwietering MH, Jongenburger I, Rombouts FM, van ’t Riet K. Modeling of the bacterial growth curve. Appl Environ Microbiol. 1990;56(6):1875–81.
Gibson AM, Bratchell N, Roberts TA. The effect of sodium chloride and temperature on the rate and extent of growth of Clostridium botulinum type A. J Appl Bacteriol. 1988;65(2):95–108.
Ratkowsky DA, Olley J, McMeekin TA, Ball A. Relationship between temperature and growth rate of bacterial cultures. J Bacteriol. 1982;149(1):1–5.
Gupta S, Mozaffar H, Syed A, et al. Machine learning models for microbial growth prediction under environmental stress. Biotechnol Bioeng. 2020;117(12):3727–38.
Buchanan RL, Whiting RC, Damert WC. When is simple good enough: a comparison of the Gompertz, logistic, and other empirical models for fitting bacterial growth curves. Food Microbiol. 1997;14(4):313–26.
Desai M, Udupa S, Chatterjee S, Sarma SJ. A hybrid machine learning framework for predictive bioprocess modeling using ensemble tree algorithms. Comput Chem Eng. 2022;157:107675.
Rajamanickam S, Misra BB. Machine learning-based prediction of microbial growth under multiple environmental conditions. Front Microbiol. 2021;12:713517.
Yang H, He F, Liu J, et al. Integration of mechanistic models and machine learning improves prediction accuracy in microbial fermentation. Biotechnol J. 2023;18(4):2200604.
Larroude M, Rossignol T, Nicaud JM, Ledesma-Amaro R. Synthetic biology tools for engineering Yarrowia lipolytica. Biotechnol Adv. 2018;36(8):2150–64.
Qiao K, Abidi SHI, Liu H, Zhang H, Chakraborty S, Zhang L, Tang Y. Engineering lipid overproduction in the oleaginous yeast Yarrowia lipolytica. Metab Eng. 2015;29:56–65.