Machine Learning Predictive Model

Predicting the ammonium concentration in soil using a Machine Learning model

The rich microbial environment of the soil is a very dynamic and complex system – consisting of numerous interactions and factors which affect both the ecosystem and each other, such as pH, oxygen content, nitrate concentration, temperature, ammonium concentration, carbon content, C/N ratio, particle size, soil type, humidity and precipitation, latitude, longitude, etc.

We approximated the basal Dissimilatory Nitrate Reduction to Ammonium (DNRA) rate based on Michaelis-Menten kinetics, using a gross average of the cell population in a given mass of soil. The increased DNRA rate was estimated using maximum rates from our ODE model.


Introduction

DNRA in soil is highly interconnected with several soil and environmental factors. Since these factors are very hard to model individually and their individual and bulk effects on the rate of DNRA are very complex, we decided to use Machine Learning to find patterns across bulk data sets from all across the world with different conditions.

The goal of the model is to predict the ammonium concentration in soil conditions that will occur when we introduce our modified bacterium into the ecosystem.

Why Use Machine Learning?

What is a Machine Learning Model?

A Machine Learning model is a mathematical representation of a real-world process that is trained on data to make predictions or decisions without being explicitly programmed for the specific task. Essentially, it manipulates data and learns from already known outputs to predict the outputs for inputs we have not seen before. Although a very poor simplification, it can be thought of as a glorified form of curve-fitting.

Why ML?

Although the concept of ML has become a buzzword today, going past the hype, ML models are a great way to predict complex non-linear relationships, like the interplay of factors that influence the soil ecosystem. These are complex environments, with many hundreds of variables affecting the outcome and each other, for which it is very difficult to build a traditional mathematical model.

ML models provide the right balance of handling complexity without requiring much heavy lifting from the user's end, while also letting the user decide the architecture for the model. This is why we decided to use ML models for our project.

XGBoost Models

XGBoost models are a type of ensemble model, which combines multiple weak learners to create a strong learner. The weak learners are decision trees (think of decision trees as a series of if-else questions, or as I like to call it: your friend going down a slippery slope fallacy in a lot of parallel universes), which are simple models that can be trained on small datasets. To simplify, a neural network uses layers of neurons, while an XGBoost uses layers of decision trees.

XGBoost models are less prone to overfitting and can be trained on smaller datasets.

Architecture

Let us first talk about ammonium. Since that is what we are trying to predict, we have 902 rows and a pretty skewed representation of the amount of ammonium. To address this high skewness, we applied a log(1+x) transformation, which stabilizes variance and improves model learning on positively skewed data.

Log(1+x) Transformation of Ammonium Data

Figure: Log(1+x) Transformation of Ammonium Data

We used an XGBoost Regressor with hyperparameters tuned through a previous search. The chosen configuration emphasized balanced learning with moderate regularization:

Parameter Description Value
max_depth Maximum depth of each decision tree. Controls model complexity. 7
n_estimators Number of boosting rounds (trees) used during training. 500
learning_rate Shrinkage step applied after each boosting round to improve generalization. 0.01
colsample_bytree Fraction of features randomly sampled for each tree. 0.7
subsample Fraction of training samples used for growing each tree. Helps reduce overfitting. 0.7
objective Specifies the learning task and loss function. Here, a standard squared error regression is used. reg:squarederror
random_state Seed for reproducibility. 42
n_jobs Number of parallel threads used during training. -1 (uses all cores)

Since the model predicts ammonium in log-space, we applied a smearing estimate (a standard bias correction technique) when converting predictions back to the original scale. This improves the accuracy of back-transformed predictions, especially when residuals are not perfectly normally distributed.

Results

We evaluated the model on the test set using standard regression metrics on the original ammonium scale:

Metric Description Value
R² Score Proportion of variance in ammonium concentrations explained by the model. 0.5924
RMSE Root Mean Squared Error — measures the model’s prediction error in mg/kg on the original scale. 9.37
MAE Mean Absolute Error — average absolute difference between predicted and actual ammonium values. 4.54
Prediction vs Reality Graph

Figure: Prediction vs. Reality Graph

The model’s results are pretty good overall, especially considering how complex and varied the soil data is. Explaining almost 60% of the variation in ammonium levels means the model is picking up on many important patterns, even though it’s not perfect. The error values show that it can give a fairly accurate estimate, but there’s still some natural variability it can’t fully predict — which makes sense given the mix of different soils and conditions in the dataset. In simple terms, the model is reliable enough to guide experiments and highlight key factors, even if it’s not meant for exact field predictions.

The dataset being quite small, with only 902 samples, likely limits the model's performance.

Synthetic Data

We had a small dataset for predicting the ammonium concentration in soil without the presence of our engineered bacteria, but we had no data for predicting ammonium concentration in soil with the presence of our engineered bacteria. This is a huge problem, since we have essentially nothing to go on.

To solve this, we used synthetic data to augment our dataset. Synthetic data is artificially generated data that mimics the statistical properties of real data.

We generated synthetic data by simulating the effect of our engineered bacteria on ammonium levels. We assumed that the presence of our bacteria would increase ammonium concentration by a certain percentage, based on literature values and preliminary lab results.

To estimate the increase factor, we assumed the rate of the entire process to be equal to the rate of the first step– nitrate reduction.

We used the maximum DNRA rate obtained from our ODE model to estimate the new rate of DNRA.

We characterized the basal DNRA rate using the Michaelis-Menten kinetics of nitrate reduction of Nap enzyme done by the previous iGEM team – Cattlelyest from Wageningen, 2021.

By these approximations, we obtained a 10x increase rate. The basal rate of DNRA is given by:

We then applied this increase to our existing dataset to create a new synthetic dataset that represents soil conditions with our engineered bacteria.

Results of Synthetic Data Modelling

Mean Predicted Ammonium for the basal level was obtained to 7.39 mg N/kg soil, and after the application of the bacteria was 23.24 mg N/kg soil, which is a predicted increase of 15.86 (+214.74%).

References

  1. Tathanda, Gautham. “Model | IISc-Bengaluru - IGEM 2025.” Igem.wiki, 2025, 2025.igem.wiki/iisc-bengaluru/model
  2. iGEM wiki WUR 2021. “Team:Wageningen UR/Model/Nitrogen - 2021.Igem.org.” Igem.org, 2021, 2021.igem.org/Team:Wageningen_UR/Model/Nitrogen
Full Page Layout