Introduction
During our experiments we recognised that having a readily accessible optimisation model would streamline decisions on media composition, incubation times, and the upkeep of both bioreactors and shake flasks. A review of scientific literature and past iGEM contributions suggested that a generalised, no-code batch Bayesian optimisation workflow would offer a distinctive capability to the community. This led us to develop BioKernel, our no-code interface for experiment optimisation (explore the model).
Synthetic biology projects consistently face a fundamental challenge: how to achieve optimal system performance when experimental resources are severely constrained. During our project development, we confronted this reality directly. Our goal was to metabolically engineer our chassis into a high-performance production system, yet cloning complexity, protracted growth cycles, and limited lab infrastructure meant we could conduct only a handful of DBTL cycles before the project freeze. With access to just a few large shake flasks and a single bioreactor, yet dozens of strain modifications and culture conditions to explore, we needed a rigorous approach to extract maximum information from minimal experiments.
This challenge extends far beyond our immediate project. Biological optimisation problems are fundamentally difficult: they involve expensive-to-evaluate objective functions, inherent experimental noise, particularly heteroscedastic noise which is non-constant, and high-dimensional design spaces[2]. Traditional approaches like exhaustive screening or one-factor-at-a-time experimentation are prohibitively resource-intensive. While Bayesian optimisation has emerged as a powerful solution for such scenarios, existing implementations often lack accessibility for experimental biologists or the flexibility to handle the specific complexities of biological data.
We developed BioKernel, a no-code Bayesian optimisation framework specifically designed to guide biological experimental campaigns toward optimal outcomes with minimal resource expenditure. Our software addresses key limitations of existing tools through some critical innovations that are, to our best knowledge, novel to iGEM:
- Modular kernel architecture — enabling users to select or combine covariance functions appropriate for their biological system.
- Flexible acquisition function selection — Expected Improvement, Upper Confidence Bound, Probability of Improvement, etc., to balance exploration and exploitation based on experimental goals.
- Heteroscedastic noise modelling — accurately captures the non-constant measurement uncertainty inherent in biological systems.
- Support for variable batch sizes and technical replicates — recognises practical laboratory workflows and provides flexibility.
These features transform Bayesian optimisation from a theory-heavy tool into a practical laboratory companion, enabling researchers to intelligently navigate complex parameter spaces and identify high-performing conditions with dramatically fewer experiments than conventional approaches.
Validation Strategy and Broader Applicability
To validate our framework, we pursued two complementary approaches. First, we applied our software to optimise published datasets from metabolic engineering studies, demonstrating that our approach successfully identifies optimal conditions with substantially fewer experiments than were originally required.
Second, we designed a comprehensive experimental proof-of-concept: optimising astaxanthin production via a heterologous 10-step enzymatic pathway integrated into the Marionette-wildE. coli strain[5]. This strain possesses a genomically integrated array of twelve orthogonal, highly sensitive inducible transcription factors, allowing for a twelve-dimensional optimisation landscape ideal for demonstrating our software's capabilities. By systematically varying inducer concentrations across the pathway, we aim to verify that Bayesian optimisation could guide this complex, multi-step enzymatic process to a strong optimum using far fewer experiments than conventional screening methods.
We opted to use astaxanthin as it is readily quantified spectrophotometrically[23], reducing the time needed to evaluate each batch.
It is not economically feasible to utilise inducible promoters for industrial transcriptional control. We thus propose utilising the framework from[6], offering a solution to find a constitutive `match` for the expression levels corresponding to an optimum reached in an experimental campaign. This way, expensive inducers within the Marionette array, such as naringenin[5], are only necessary for initial screening campaigns.
While parts delivery delays prevented completion of our full experimental validation within the competition timeline, the successful retrospective optimisation of published datasets serves as a compelling proof of concept. It took our BO policy an average of 19 unique points investigated to converge close to the optimum (10% of total possible normalised euclidean distance), as opposed to the 83 taken by the grid search employed in the paper[21]. These results demonstrate that our framework can effectively optimise cellular "black box" functions, systems where the relationship between inputs and outputs is unknown, using substantially fewer experimental iterations than traditional approaches.
Results
hough we could not perform optimisation batches using our marionette strain due to an unforeseen, significant order delay of the parts needed to clone the astaxanthin pathway, we validated our package on empirical data.
To trial the effectiveness of our package despite this delay, we took a dataset from an optimisation experiment applying four-dimensional transcriptional control to limonene production in Marionette-wild Escherichia coli [21]. Though this represents a significantly more tractable optimisation problem than the astaxanthin pathway we chose for our in-lab validation, it serves as strong validation that BioKernel can find an optimum far faster than the exhaustive combinatorial search.
As the dataset of this paper is relatively sparse, we fitted a Gaussian process with a scaled RBF kernel and additional white noise kernel to their data, creating a surface approximating the actual optimisation landscape of the four-dimensional input space. The procedure included training a mixed model of Random Forest (RF) and K-Nearest Neighbours (KNN) on 83 unique parameter combinations.
The same procedure was applied, however instead of using the means calculated from experimental data, we estimated the noise by calculating the standard deviation. This noise was then used to build a heteroscedastic noise meshgrid, which was supplied as a standard deviation parameter for random sampling from a normal distribution.
This surface then became our test set for BioKernel. The following optimisation was performed using our package, with options set to use a Matern kernel with a gamma noise prior.
BioKernel algorithm converged to the optimum as measured by normalised euclidean distance of 10% in just 22% of unique points investigated in the paper[21]. thus took BioKernel an average of 18 points investigated to converge close to the optimum (10% of total possible normalised euclidean distance), as opposed to the 83 taken by the adapted grid-search used in the paper[21].
What is Bayesian Optimisation?
Bayesian Optimisation (BO) is a sample-efficient, sequential strategy for global optimisation of black-box functions[1]. In essence, it enables the identification of input parameter combinations that yield an optimal output while making minimal assumptions about the objective function. Crucially, BO does not require the function to be differentiable. This is a significant advantage in synthetic biology, where response landscapes are frequently rugged, discontinuous, or stochastic due to complex molecular interactions, making gradient-based optimisation methods inapplicable[2]. By assuming only continuity[1], BO is well-suited to navigate these complex, unpredictable biological systems where the underlying mechanisms are often intractable.
In biological research, the experimental landscape is often complex and high-dimensional. Traditional methods like grid search become intractable due to the "curse of dimensionality," where the number of experiments required grows exponentially with the number of parameters. Simpler algorithms, such as one-factor-at-a-time searches (a form of gradient descent), can easily get trapped in local optima when the system does not behave as expected, thus requiring an arbitrary number of restarts to discover the global optimum. BO is engineered to navigate these challenges, performing effectively in scenarios with up to 20 input dimensions, and can easily handle more with some tuning[4].
The power of BO stems from three core components[1],[3]:
- Bayesian inference to update beliefs based on evidence.
- A Gaussian Process (GP) to create a probabilistic model of the function.
- An Acquisition function to intelligently balance the exploration-exploitation trade-off.
For comprehensive mathematical background, see Roman Garnett's Bayesian Optimization (2023). [1]
The Bayesian Approach: Learning from Data
True to its name, BO is founded on Bayesian statistics. Unlike frequentist methods that provide single-point estimates, the Bayesian approach models the entire probability distribution of possible outcomes. This method preserves information by propagating the complete underlying distributions through calculations, which is critical when dealing with costly and often noisy biological data. A key feature is the ability to incorporate prior knowledge (a "prior") into the model, which is then updated with new experimental data to form a more informed distribution (a "posterior"). This iterative updating is ideal for lab-in-a-loop biological research, where each data point is expensive to acquire and system noise can be unpredictable (heteroscedastic)[1],[3].
The Gaussian Process: A Probabilistic Map of the Landscape
The second component, the Gaussian Process (GP), serves as a probabilistic surrogate model for the black-box function. A GP defines a distribution over functions; for any set of input parameters, it returns a Gaussian distribution of the expected output, characterised by a mean and a variance. This provides not just a prediction but also a measure of uncertainty for that prediction. Central to the GP is the covariance function, or kernel, which encodes assumptions about the function's smoothness and shape. The kernel defines how related the outputs are for different inputs, allowing the GP to generalise from observed data to unexplored regions of the parameter space. A well-chosen kernel is crucial for balancing the risks of overfitting (mistaking noise for a real trend) and underfitting (missing a genuine trend in the data), a common challenge with inherently noisy biological datasets[1],[3].
The Acquisition Function: Balancing Exploration and Exploitation
The GP model, with its predictions of mean and variance, guides the search for the next set of parameters to test experimentally. This guidance is formalised by the acquisition function. This function calculates the expected "utility" of evaluating each point in the parameter space, effectively balancing the trade-off between exploring uncertain regions and exploiting areas known to yield good results.
- Exploitation involves sampling in regions where the GP predicts a high mean value, refining our knowledge around known optima.. Exploration involves sampling in regions where the GP predicts high variance,
The next experimental point is chosen by finding the parameters that maximise the acquisition function. This dynamic approach ensures that the search efficiently converges toward the global optimum by focusing resources on promising regions while avoiding wasteful experiments in poorly performing ones. Common acquisition functions include Probability of Improvement (PI), Expected Improvement (EI), and Upper Confidence Bound (UCB)[1][3].
This trade-off can be further tuned by adopting a risk-averse or risk-seeking policy, often by adjusting a parameter within the acquisition function. A risk-averse strategy prioritises regions promising a certain but possibly lower improvement, which is useful when the cost of a failed experiment is high. Conversely, a risk-seeking strategy favours more uncertain regions that might result in higher overall improvement. This often results in the policy shifting towards exploitation or exploration.
The Optimisation Workflow
By integrating these concepts, BO can identify an optimal set of parameters with a minimal number of experimental iterations. The typical workflow is as follows:
- Initialisation: Begin with a small set of initial data points, either from prior knowledge or quasi-random sampling (e.g. SOBOL algorithm).
- Model Fitting: Fit a Gaussian Process surrogate model to the existing data.
- Acquisition: use the acquisition function to choose the next experiment.
- Experimentation: run the experiment and record outputs.
- Update: Add the new data point to the dataset and repeat from step 2 until an optimal solution is found or the experimental budget is exhausted.
This process can also be adapted for batch optimisation, where multiple points are suggested for parallel evaluation in each cycle. While this can slightly reduce sample efficiency, it significantly accelerates the discovery process when multiple experiments can be run concurrently[3].
Why is Bayesian Optimisation Under-Employed in Biology?
In the past decade BO has received a significant boost in attention. This can be attributed to a massive leap in popularity of machine learning (ML) and the need for computationally expensive hyperparameter tuning[9]. Almost all conventional ML algorithms (linear regression, decision trees, random forests, etc.) have hyper-parameters that control how the model is built. For example, the number of unique leaves and branches has a significant impact on how well a decision tree generalises data[10]. Although rules of thumb for such situations exist, they often provide non-ideal model performance. The cost of retraining multiple models with varied parameters and choosing the best performer is oftentimes much lower than the unrealised gains from deploying a subpar model for real life scenarios[3]. Consequently multiple BO libraries have been developed for Python (it being the de facto language for ML). These range from simple BO models implemented within the Sci-Kit package, general plug-and-play ML BO (e.g. Optuna)[11], or researcher aimed low-level packages, such as BoTorch (part of the popular PyTorch package)[12] and Ax[13]. However, since all of these were developed with ML scientists and software developers in mind, they require strong programming knowledge to use effectively. Unfortunately, such skillset is quite often neglected within the biological sciences apart from bioinformaticians and PhD level researchers.
Nevertheless, significantly modified BO has been repeatedly and successfully applied in the context of natural sciences. Various BO implementations (often using a Kernel assuming some level of white noise) have been used in academia and in the corporate sector, with Ax and BoTorch both being co-developed by large companies like Meta[13]. These applications are more prominent for situations where the signal to noise ratio can be sufficiently maximised by use of technical replicates, for instance Material Science[14],[15] or Chemistry research[16],[17]. There are even some claims within the research community that technical replicates are not needed for BO workflows since custom models are able to accommodate some level of noise[7].
We postulate that it is due to the following reasons that BO has been under-appreciated by the biological sciences outside of media optimisation and research-level exploration of large genetic combinatorial libraries[18],[20].
One reason is that by itself BO is unintuitive and could be assumed to be similar to a random search. This is a misconception though, since non Thompson sampling implementations are completely deterministic and random search algorithms actually outperform the more conventional grid search[3].
The second issue pertains to the high, often heteroscedastic (not-constant along the prediction variable), noise levels that come with wet-lab experimentation. Technical replicates are an inherent part of the field and provide information not just through discovering the mean but the variance associated with replication. This is a significant limitation for the simplest BO models, since they treat the mean of technical replicates as a noiseless observation (the ground truth)[1]. Consequently, the researcher is required to familiarise themselves with research-level implementations that were often designed for solving a specific issue and cannot be described as beginner friendly.
Our dry lab software project aims to address two main points of the issue. First, it acts as an interactive playground that would allow an individual to convince themselves of the merits provided by BO. This acts as a first stepping stone for anyone who would then go out and implement their own BO workflow to tackle a specific real world problem. Second, it has all the necessary functions to run BO as part of an experiment without having to write code. Once deployed, the applet can support rapid access and result analysis while in the lab. This is enhanced by a core modular functionality, meaning that more engaged users can upload custom acquisition functions and kernels to suit their specific needs.
Outlook
The software - BioKernel can be downloaded from: https://gitlab.igem.org/2025/software-tools/imperial
Beyond what we managed to add ahead of the freeze, we are actively engaged in enhancing this package by adding greater functionality. Currently, we plan to add the following elements:
- A larger set of kernels, also suitable to exploring mixed-integer spaces (i.e. with continuous and discrete input features).
- A flowchart guiding less technically apt users to the correct choices of acquisition functions and kernels based on simulations and real-world data.
- Optimising for ‘just-in-time’ induction: some literature[21] indicates that there may be unrealised gains in offsetting the induction of some elements of a metabolic pathway, thus we are presently testing ways to jointly optimise this
- In its current iteration, the package does not possess any surrogate models like kinetic models to further reduce iterations with a mechanistically informed prior, since this still lacks straightforward heuristics for implementation.
We are open to collaborating with future iGEMers to make this a robust and usable package for the community at large.
Best model award consideration
We think that our modelling contribution, BioKernel, aligns strongly with the Best Model Award criteria. Here's a couple reasons.
How impressive is the modelling?
- No implementation of bayesian optimisation: Makes a mathematically advanced framework accessible to all experimental biologists, a novel contribution for iGEM and the community at large.
- Built for real lab conditions: Explicitly handles heteroscedastic noise, technical replicates, and parallel batches; necessary for modelling real biological data accurately.
- Modular & extensible: Users can swap kernels and acquisition functions to tailor the model to diverse biological problems, ensuring broad applicability.
Did the model help the team understand a part, device, or system?
- Visualised performance landscape: Transformed sparse data points into a predictive landscape using active subspace visualisation, revealing important interactions.
- Quantified system properties: Modelled the specific noise profile and sensitivities of the real-world limonene production system, showing quantitative understanding.
- Proven understanding via prediction: Successfully found the known optimum (within 10% difference) in a parameter space interpolated from a real dataset, definitively proving its ability to understand the system's behaviour.
Did the team use measurements of a part, device, or system to develop the model?
- Validated on real experimental data: The model was tested and proven effective using measurements from a published microbial limonene production study in addition to simulated data.
- Demonstrated high data efficiency: Reached the known optimum using 22% of the original study's experiments, showing it learns effectively from a small number of real measurements.
Does the modelling approach provide a good example for others?
- Solves a universal iGEM problem: Addresses the critical need for resource-efficient experimentation that nearly every team faces.
- Accessible & reusable tool: Provides a framework that can be easily adapted and reused by other teams,
- Accessible & reusable tool: As a documented, no-code software, it is a ready-to-use solution that future teams can immediately apply and contribute to, accelerating their own DBTL cycles. Users can add their own kernels and acquisition functions in a plug-and-play manner.
References
- R. Garnett, Bayesian Optimization. Cambridge University Press, 2023.
- C. Merzbacher, O. Mac Aodha, and D. A. Oyarzún, “Bayesian Optimization for Design of Multiscale Biological Circuits,” ACS Synthetic Biology, vol. 12, no. 7, pp. 2073–2082, Jun. 2023, doi: 10.1021/acssynbio.3c00120.
- Q. Nguyen, Bayesian Optimization in Action. Simon and Schuster, 2024.
- C. Hvarfner, E. O. Hellsten, and L. Nardi, “Vanilla Bayesian Optimization Performs Great in High Dimensions,” arXiv, Feb. 2024, doi: 10.48550/arxiv.2402.02229.
- A. J. Meyer, T. H. Segall-Shapiro, E. Glassey, J. Zhang, and C. A. Voigt, “Escherichia coli ‘Marionette’ strains with 12 highly optimized small-molecule sensors,” Nature Chemical Biology, vol. 15, no. 2, pp. 196–204, Feb. 2019, doi: 10.1038/s41589-018-0168-3.
- A. Ghodasara and C. A. Voigt, “Balancing gene expression without library construction via a reusable sRNA pool,” Nucleic Acids Research, vol. 45, no. 13, pp. 8116–8127, Jun. 2017, doi: 10.1093/nar/gkx530.
- M. Siska, E. Pajak, K. Rosenthal, A. del Rio Chanona, von Lieres, and L. M. Helleckes, “A Guide to Bayesian Optimization in Bioprocess Engineering,” arXiv.org, 2025. https://arxiv.org/abs/2508.10642 (accessed Oct. 08, 2025).
- Jong Hyun Park et al., “Design of Four Small-Molecule-Inducible Systems in the Yeast Chromosome, Applied to Optimize Terpene Biosynthesis,” ACS Synthetic Biology, vol. 12, no. 4, pp. 1119–1132, Mar. 2023, doi: 10.1021/acssynbio.2c00607.
- A. H. Victoria and G. Maragatham, “Automatic tuning of hyperparameters using Bayesian optimization,” Evolving Systems, vol. 12, May 2020, doi: 10.1007/s12530-020-09345-2.
- J. Wu, X.-Y. Chen, H. Zhang, L.-D. Xiong, H. Lei, and S.-H. Deng, “Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization,” Journal of Electronic Science and Technology, vol. 17, no. 1, pp. 26–40, Mar. 2019, doi: 10.11989/JEST.1674-862X.80904120.
- T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A Next-generation Hyperparameter Optimization Framework,” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul. 2019, doi: 10.1145/3292500.3330701.
- Maximilian Balandat et al., “BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization,” Neural Information Processing Systems, vol. 33, pp. 21524–21538, Jan. 2020.
- M. Olson et al., “Ax: A Platform for Adaptive Experimentation,” Openreview.net, 2025. https://openreview.net/forum?id=U1f6wHtG1g (accessed Oct. 08, 2025).
- C. Li et al., “Rapid Bayesian optimisation for synthesis of short polymer fiber materials,” Scientific Reports, vol. 7, no. 1, Jul. 2017, doi: 10.1038/s41598-017-05723-0.
- P. I. Frazier and J. Wang, “Bayesian Optimization for Materials Design,” Springer series in materials science, vol. 225, pp. 45–75, Dec. 2015, doi: 10.1007/978-3-319-23871-5_3.
- B. J. Shields et al., “Bayesian reaction optimization as a tool for chemical synthesis,” Nature, vol. 590, no. 7844, pp. 89–96, Feb. 2021, doi: 10.1038/s41586-021-03213-y.
- J. Guo, Bojana Ranković, and P. Schwaller, “Bayesian Optimization for Chemical Reactions,” CHIMIA International Journal for Chemistry, vol. 77, no. 1/2, pp. 31–31, Feb. 2023, doi: 10.2533/chimia.2023.31.
- H. Narayanan et al., “Accelerating cell culture media development using Bayesian optimization-based iterative experimental design,” Nature Communications, vol. 16, no. 1, p. 6055, Jan. 2025, doi: 10.1038/s41467-025-61113-5.
- S. G. Baird, A. R. Falkowski, and T. D. Sparks, “Honegumi: An Interface for Accelerating the Adoption of Bayesian Optimization in the Experimental Sciences,” arXiv, Feb. 2025, doi: 10.48550/arxiv.2502.06815.
- M. HamediRad, R. Chao, S. Weisberg, J. Lian, S. Sinha, and H. Zhao, “Towards a fully automated algorithm driven platform for biosystems design,” Nature Communications, vol. 10, no. 1, Nov. 2019, doi: 10.1038/s41467-019-13189-z.
- J. Shin, E. J. South, and M. J. Dunlop, “Transcriptional Tuning of Mevalonate Pathway Enzymes to Identify the Impact on Limonene Production in Escherichia coli,” ACS omega, vol. 7, no. 22, pp. 18331–18338, May 2022, doi: 10.1021/acsomega.2c00483.
- Y. Ma et al., “Flux optimization using multiple promoters in Halomonas bluephagenesis as a model chassis of the next generation industrial biotechnology,” Metabolic Engineering, vol. 81, pp. 249–261, Dec. 2023, doi: 10.1016/j.ymben.2023.12.011.
- P. Casella, A. Iovine, S. Mehariya, T. Marino, D. Musmarra, and A. Molino, “Smart Method for Carotenoids Characterization in Haematococcus pluvialis Red Phase and Evaluation of Astaxanthin Thermal Stability,” Antioxidants, vol. 9, no. 5, p. 422, May 2020, doi: 10.3390/antiox9050422.