Results | Japan scinet

Model1-1: Prediction of unknown-enzyme-expression

1 Prediction of Tom-expression overview

The biosensor used in this project responds to ethylene oxide but not to ethylene itself. Therefore, it was necessary to introduce an additional enzyme to enable the conversion of ethylene into ethylene oxide. Since the expression level of the enzyme in E. coli was unknown, it was computationally predicted using Python.

2 Introduction

In this project, a literature review suggested that specific enzyme modifications could potentially introduce new functional properties. Based on this finding, assuming difficulties in experimental verification, sequence data were constructed from publicly available information, and machine learning was applied to predict their expression characteristics. .

3 The way the datasets were collected

To develop the expression-level prediction program in Python, gene sequences with known expression levels in E. coli were used as training data. These datasets were constructed based on protein expression information and corresponding DNA sequences obtained from public databases.

4 Explanation of Python code for expression-prediction

To extract structural and sequence-based information from the input DNA data, computational analysis was conducted to generate feature representations, which were then integrated with expression data for model training. A machine learning regression model was employed to predict expression levels, followed by evaluation using both known and unknown sequences to assess model performance.

5 Results of the prediction

For comparison of model performance, simulations were conducted in two stages, using 40 and 95 training samples, respectively. The results are summarized in Tables 1–4. Furthermore, graphs of feature importance and predicted vs. observed values, which served as evaluation indicators of model performance, are presented in Graphs 1–4. From these results, it was suggested that, at both stages of prediction, the target nucleotide sequences could achieve a certain level of protein expression in E. coli.

Table 1 (Machine Learning Data: 40 Samples)
Gene (Unknown Expression in E. coli)	Predicted Expression (ppm)
TomA0	9056.29
TomA1	5873.83
TomA2	7447.67
TomA3 (V106F)	7194.09
TomA3 (A113F)	7194.94
TomA4	10205.76
TomA5	7149.04

Test MSE: 13,987,415.769112501

Table 2 (Machine Learning Data: 95 Samples)
Gene (Unknown Expression in E. coli)	Predicted Expression (ppm)
TomA0	3988.56
TomA1	2793.93
TomA2	3686.44
TomA3 (V106F)	2486.01
TomA3 (A113F)	2487.43
TomA4	3965.34
TomA5	2437.71

Test MSE: 2,875,357.8509500003

Table 3 (Machine Learning Data: 40 Samples)
Gene (Known Expression in E. coli)	Observed Expression (ppm)	Predicted Expression (ppm)
rplF	4110	5201.77
rpsP	4070	5991.55

Test MSE: 13,987,415.769112501

Table 4 (Machine Learning Data: 95 Samples)
Gene (Known Expression in E. coli)	Observed Expression (ppm)	Predicted Expression (ppm)
rplF	4110	3354.16
rpsP	4070	4071.46

Test MSE: 2,875,357.8509500003

Graph 1: Feature importance (40-sample dataset)

Graph 1: Bar graph of feature importance

Graph 2: Feature importance (95-sample dataset)

Graph 2: Bar graph of feature importance

Graph 3: Predicted vs. Observed (40-sample dataset)

Graph 3: Scatter plot of observed vs. predicted values

Graph 4: Predicted vs. Observed (95-sample dataset)

Graph 4: Scatter plot of observed vs. predicted values

6 Consideration about the results

In the dataset with 40 samples, the predicted expression levels of known genes differed considerably from the observed values, resulting in a high MSE.

This discrepancy may be due to the following factors:

1.When constructing the dataset, the DNA sequences were input in descending order of expression level, which could have biased the model toward overestimation (Predicted > Observed).

2.Since the feature dimensions exceeded the number of samples, the model may have overfitted noise or random patterns.

To improve the model’s accuracy, we added 57 more DNA–expression pairs to the dataset and excluded the two highest-expression outliers.

After retraining, the Test MSE decreased, and the scatter plots showed values closer to the Predicted = Observed line, indicating improved model performance and reduced overfitting.

Model1-2: Simulation of enzyme kinetics

From the results of section 1-1, it was confirmed that the TOM mutants A113F and V106F were expressed in E. coli. Based on these results, the concentration changes of the substrates C₂H₄ (ethylene) and C₂H₄O (ethylene oxide), which are involved in the enzyme reaction of TOM, were plotted as graphs to evaluate their functionality.

The equations used for the simulation and the list of parameters that compose those equations are shown below.

Table Parameters Used in the Enzyme Kinetic Model
Parameter	Description
$$V_{max}$$	Maximum velocity of the reaction (mM/min)
$$K_{ia}$$	Constant for NADH (mM)
$$K_b$$	Constant for O₂ (mM)
$$K_c$$	Constant for C₂H₄ (mM)
$$K_{bc}$$	Cross-term of O₂ and C₂H₄ (mM)
$$V_{\max}^{\ast}$$	Maximum velocity in the approximated first-order reaction equation (mM/min)
$$K_{m}^{\ast}$$	Reaction rate constant in the approximated first-order reaction equation

equation $$ \frac{d[\mathrm{C_2H_4O}]}{dt} = v([C]) = \frac{V_{\max}^\ast [C]}{K_m^\ast + [C]} \ $$

For a detailed explanation, please refer to the page of “mode1” conducted by the Dry team.

Graph1. Changes in the concentrations of C₂H₄ and C₂H₄O when A113F was expressed.

The data used for plotting Graph 1 are shown in the following table.

Table 8. Parameters Used for A113F Mutation
Parameter	Initial value
$ V_{\max}^{\ast} $	0.019 (mM/min)
$ K_{m}^{\ast} $	0.5 (mM)
$ [\mathrm{C_2H_4}] $	0.7 (mM)
$ [\mathrm{C_2H_4O}] $	0.0 (mM)

Graph2. Changes in the concentrations of C₂H₄ and C₂H₄O when V106F was expressed.

The data used for plotting Graph 2 are shown in the following table.

Table 9. Parameters Used for V106F Mutation
Parameter	Initial value
$ V_{\max}^{\ast} $	0.0007 (mM/min)
$ K_{m}^{\ast} $	0.5 (mM)
$ [\mathrm{C_2H_4}] $	0.7 (mM)
$ [\mathrm{C_2H_4O}] $	0.0 (mM)

From these two results, it can be observed that the graph of A113F follows the curve based on the first-order enzyme reaction equation, while that of V106F shows a linear pattern. We assumed that this is due to the reaction rate of V106F being much slower than that of A113F, and therefore extended the reaction time only for V106F to 10,000 minutes and replotted the graph. The result of this replotting is shown in Graph3 below.

Graph3. Changes in the concentrations of C₂H₄ and C₂H₄O when V106F was expressed (reaction time extended to 10,000 minutes).

From Graph 3, it can be seen that the reaction graph of V106F also follows the curve based on the first-order enzyme reaction equation when the simulation time is extended. Considering the total time required for the enzyme reaction to complete, it is suggested that introducing the gene expressing A113F into E. coli would be more appropriate for carrying out the project.

Ethylene can be oxidized to ethylene oxide using an engineered toluene monooxygenase The mechanism of oxygen activation by P450 enzymes also provides useful insights

Model2: Co-culture Simulation

In the dry simulation of the Tom monooxygenase system, the main Monod parameters were first estimated from single-culture data, yielding a maximum specific growth rate of　$\mu_{max} = 0.795\,\mathrm{h^{-1}}$, a half-saturation constant of $K_S = 0.0404\,\mathrm{g\,L^{-1}}$ and a biomass yield of $Y_{X/S} = 1.72\times10^9\,\mathrm{cells\,mL^{-1}(g\,L^{-1})^{-1}}$. Based on these parameters, a co-culture model of the monooxygenase-expressing strain and the sensor strain was constructed to analyze the time evolution of ethylene oxide concentration and sensor output $y′$ following ethylene addition.

For initial condition optimization, the substrate concentration was varied within $S_0 = 0.5\text{–}10.0\,\mathrm{g\,L^{-1}}$, and the initial cell densities of the producer ($X_{0,1}$) and sensor strain ($X_{0,2}$) were explored within $8.0\times10^8\text{–}1.6\times10^{10}\,\mathrm{cells\,mL^{-1}}$. Bayesian optimization was performed with the maximum $y′$ value as the objective function. The results indicated that $y′$ was maximized when the sensor strain had a slightly higher density than the producer strain (approximately a 2:1 ratio) and when the substrate concentration was moderate ($S_0 \approx 4.0\,\mathrm{g\,L^{-1}}$).

Furthermore, increasing the Tom-producing strain density accelerated ethylene oxide production but caused stronger growth inhibition in the sensor strain, delaying the response peak. Conversely, when the producer strain density was lower, the production rate decreased and the sensor output weakened, though the system exhibited more stable behavior.

These findings demonstrated that the ratio between the two strains governs both the amplitude and timing of the sensor response.

This analysis quantitatively showed that tuning the balance of cell ratios enables control over the response strength, providing a rational design guideline for optimizing ethylene-detecting biosensors.

Model 3: Evaluation of the Antimicrobial Activity of Nisin

We evaluated nisin’s antimicrobial performance under varying concentration, pH, and temperature conditions using a multivariate model fitted by least-squares minimization. The analysis identified the optimal conditions for nisin’s activity within our biosensor framework.

Results show that nisin is most potent in acidic environments (around pH 5), performs better under refrigeration (~7 °C), and requires relatively high concentrations ($10^2–10^3$ µg/mL) for strong bacterial inactivation. These findings suggest that nisin-based biosensors are well-suited for protecting acidic fruits during cold storage, offering practical guidance for reducing foodborne risks such as Listeria monocytogenes contamination.

Parameter	Initial value
\( V_{\max}^{\ast} \)	0.019 (mM/min)
\( K_{m}^{\ast} \)	0.5 (mM)
\( [\mathrm{C_2H_4}] \)	0.7 (mM)
\( [\mathrm{C_2H_4O}] \)	0.0 (mM)