Overview

Model Framework Overview

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

The Dry Lab’s workflow has always been closely aligned with the entire project lifecycle, which can be broadly divided into three phases: the early phase, the mid-phase, and the late phase. In the early phase , our focus was on providing background information to support project conception. In the mid-phase , we offered theoretical support to guide project implementation. Finally, in the late phase , we delivered data-driven insights to facilitate real-world application of the project.

During the early phase, we sought to understand the pollution and waste directly from the industrial folic acid synthesis —thereby highlighting the unique advantages of plant-based folic acid biosynthesis—and to address the relative lack of quantitative data on the environmental impact of current industrial production routes. To this end, we developed a “PINN Differential Equations Model for Industrial Folic Acid Synthesis” grounded in chemical reaction equations, leveraging Physics-Informed Neural Networks (PINNs) to assist in solving the system.

In the mid-phase, to better support project, we tackled two critical challenges: “how to select candidate genes” and “how to improve information collection.” We addressed these by proposing, respectively, an “AI-Driven Gene Selection Model” and a “Domain-specific LLM with SFT and RAG”

The AI-Driven Gene Selection Model began with the application of a phylogenetic tree algorithm tailored to our project needs, which pre-screened a set of soybean genes with high homology. We then used AlphaFold3 to predict the 3D structures of these candidate genes, followed by Interformer —a Graph-Transformer-based protein–small molecule docking model—to predict binding interactions. Finally, we performed molecular dynamics simulations using GROMACS to validate the stability of these interactions under dynamic conditions.

The Domain-specific LLM with SFT and RAG consisted of two main components: dataset construction and model construction. For dataset construction, we employed two strategies: (1) direct web crawling via automated scrapers, and (2) manual distillation from high-performing domain-specific models and state-of-the-art (SOTA) general-purpose models. In model construction, we implemented three key steps: (1) Supervised Fine-Tuning (SFT) , where we fine-tuned Qwen3 models of varying parameter scales using LLaMA Factory; (2) Retrieval-Augmented Generation (RAG) , for which we built a vector database using Qwen3-embedding-0.7B and employed Qwen3-Reranker to re-rank retrieval results, thereby completing the RAG pipeline.

In the late phase, recognizing that high-folate soybeans are typically consumed as soy-based food products, we developed a “Thermodynamic Modeling of Active Folate during High-Folate Soybean Processing” to explore optimal processing parameters that preserve folate content during food preparation.

Overall, our modeling framework comprises three interconnected components: macro-scale models, micro-scale models and domain-specific large language models (LLMs) . These components exhibit mutually reinforcing relationships : the macro- and micro-scale models provide complementary theoretical foundations and co-evolve throughout the project; the domain-specific LLM enhances team members’ understanding of both macro- and micro-scale models ; and, in turn, the macro- and micro-scale models generate rich, structured data that further enriches and refines the domain-specific LLM.

Model1: PINN Differential Equations Model for Industrial Folic Acid Synthesis

1.1 Background and Objectives

Figure1.0： Overview of PINN Differential Equations Model for Industrial Folic Acid Synthesis

Folic acid, an essential B-vitamin required by the human body, is widely used in pharmaceuticals and food industries. However, current industrial synthesis processes face challenges such as high raw material consumption, significant waste generation, and environmental pollution.

To quantitatively evaluate these issues and support process optimization, we have developed a comprehensive differential equation model simulating the entire folic acid production pathway: "preparation of 2,4,5-triamino-6-hydroxypyrimidine sulfate → folic acid synthesis → purification." This model converts chemical reaction kinetics into systems of ordinary differential equations (ODEs), which are solved via Physics-Informed Neural Networks (PINNs).The approach enables precise prediction of concentration dynamics, yield, and byproduct formation throughout the process.

1.2 Model Construction

Industrial folic acid synthesis consists of two core stages:

(1) The preparation of 2,4,5-triamino-6-hydroxypyrimidine sulfate.

(2)The subsequent synthesis of folic acid based on this pyrimidine sulfate intermediate.

Below, we construct differential equations describing the reaction kinetics for each stage.

1.2.1 Equation Construction

1.2.1.1 Synthesis of 2,4,5-Triamino-6-Hydroxypyrimidine Sulfate

This stage starts with methyl cyanoacetate as the initial raw material, and the material undergoes multiple steps to produce 2,4,5-triamino-6-hydroxypyrimidine sulfate. The core reaction equation is shown in Figure 1.1.

Figure 1.1：Equation for the synthesis of 2,4,5-triamino-6-hydroxypyrimidine sulfate

The corresponding differential equations are as follows:

（1）Concentration change rate of Methyl Cyanoacetate（A）

\frac{d A}{d t} = - k_{1} A

Methyl cyanoacetate (A) reacts with guanidine nitrate in the presence of methanol sodium to produce 2,4-diamino-6-hydroxypyrimidine (D). The rate constant k1=0.85/h is derived from the experimental fact that the reaction is complete within 3 hours, with a conversion rate of A exceeding 95%, consistent with the kinetic characteristics of k1=0.85.

（2）Concentration change rate of 2,4-Diamino-6-Hydroxypyrimidine (D)

\frac{d D}{d t} = k_{1} A - k_{2} D - k_{s i d e} D

Under acidic conditions, D reacts with sodium nitrite to form 2,4-diamino-5-isonitroso-6-oxopyrimidine (H). The rate constant k2=5.2/h is consistent with the experimental results of the procedure: "cooling to -10°C with ice-salt water before adding hydrochloric acid, with the reaction self-heating to 50°C": the low temperature suppresses side reactions, while 50°C accelerates nitrosation, and the value of k2 reflects the reaction efficiency at this temperature. The side reaction rate k_side Dorresponds to the nitrosation side reactions of D (e.g., over-nitrosation generating nitro impurities), with k_side=0.15/h, explaining the experimental practice of "controlling the amount of sodium nitrite to 40 g" to avoid excessive consumption of D due to side reactions.

（3）Concentration change rate of 2,4-Diamino-5-Isonitroso-6-Oxopyrimidine (H)

\frac{d H}{d t} = k_{2} D - k_{3} H - k_{s i d e} H

H reacts with excess iron powder and hydrochloric acid to produce 2,4,5-triamino-6-hydroxypyrimidine (M). The rate constant k3=0.65/h matches the experimental result that H is fully reduced within 1.5 hours with an excess of iron powder, ensuring complete reduction. The side reaction rate k_side Hcorresponds to the over-reduction of H (e.g., to inactive aminopyrimidine), with k_side=0.15/h, explaining the experimental practice of filtering while hot to avoid over-reduction of H upon cooling.

（4）Concentration change rate of 2,4,5-Triamino-6-Hydroxypyrimidine (M)

\frac{d M}{d t} = k_{3} H - k_{4} M - k_{s i d e} M

M reacts with concentrated sulfuric acid to produce 2,4,5-triamino-6-hydroxypyrimidine sulfate (Pyr). The rate constant k4=5.2/h matches the experimental process of adding concentrated sulfuric acid at 40-50°C and stirring for 0.5-1 hour. The mild temperature prevents decomposition of M, reflecting the rapid and efficient nature of the salt formation reaction. The side reaction rate k_side Mcorresponds to the oxidation of M (M contains multiple amines, which are prone to oxidation by air to form dark-colored impurities), with k_side=0.15/h, explaining the experimental practice of cooling to below 0°C after salt formation to minimize oxidation loss of M.

（5）Concentration change rate of 2,4,5-Triamino-6-Hydroxypyrimidine Sulfate (Pyr)

\frac{d P y r}{d t} = k_{4} M \cdot 0.95

The factor 0.95 is a correction coefficient for salt formation efficiency. After salt formation, the product is filtered and washed with water until the pH reaches 6-7. The water washing process results in a loss of about 5% of Pyr. This correction factor ensures that the calculated results match the experimental yield of 75%.

（6）Concentration change rate of by-products: Hydrolysis/Oxidation Impurities (S)

\frac{d S}{d t} = k_{s i d e} \cdot ((D + H + M))

This equation accounts for the total generation of by-products from the side reactions of all intermediates, including excessive nitrosation of D, over-reduction of H, and oxidation of M. The accumulation of S directly determines the pollutant concentration in the wastewater from washing, providing the basis for subsequent "pollution calculation" in the code.

Figure 1.2：Kinetics of 2,4,5-triamino-6-hydroxypyrimidine sulfate synthesis

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

1.2.1.2 Folic Acid Synthesis Reaction Equation

After obtaining 2,4,5-triamino-6-hydroxypyrimidine sulfate, it further reacts with potassium carbonate, trichloroacetyl chloride, and N-(4-aminobenzoyl)-L-glutamic acid to produce crude folic acid. The core reaction equation is shown in Figure 1.3.

Figure 1.3：Folic acid synthesis equation

The corresponding differential equations and variable explanations are as follows:

（1）Concentration change rate of 2,4,5-Triamino-6-Hydroxypyrimidine Sulfate（C_pyr,sulf）

\frac{d C_{p y r, s u l f}}{d t} = - r_{1}

Rate expression:

r_{1} = k_{1} \cdot C_{p y r, s u l f} \cdot \min ((1, \frac{C_{k 2 c o 3}}{C_{p y r, s u l f}}))

Pyrimidine sulfate reacts with an equimolar amount of potassium carbonate to produce free pyrimidine, water, and CO₂. Based on the experimental fact that the reaction is complete within 30 minutes at 40°C, after temperature correction, the rate constant k1 is 0.09 min⁻¹, ensuring that the conversion rate of C_pyr,sulf exceeds 99% within 30 minutes. The term min(1, C_k2co3/C_pyr,sulf) is a correction factor for excess potassium carbonate. In the literature, "an equimolar amount of potassium carbonate" corresponds to a ratio of 1.0. If potassium carbonate is insufficient, the pre-treatment rate is proportionally reduced to avoid residual C_pyr,sulf.

（2）Concentration change rate of Potassium Carbonate（C_k2co3）

\frac{d C_{k 2 c o 3}}{d t} = - r_{1}

The consumption of potassium carbonate is synchronized with that of pyrimidine sulfate in the pre-treatment process, consistent with the experimental requirement of "an equimolar amount of potassium carbonate" to ensure no residual potassium carbonate after pre-treatment. According to the literature, the amount of potassium carbonate used is 2.31 g, with a molar mass of 138.2 g/mol, corresponding to an amount of substance of 2.31/138.2≈0.0167mol. The concentration of potassium carbonate is 0.0167/0.5=0.0334mol/L.

（3）Concentration change rate of Trichloroacetyl Chloride（C_trichlor）

\frac{d C_{t r i c h l o r}}{d t} = - ((_{r 2} + r_{s i d e}))

Free pyrimidine reacts with trichloroacetyl chloride to form an intermediate condensation product（C_inter）. The rate expression is r₂=k₂⋅C_pyr,free⋅C_trichlor. The rate constant k₂=1.2 L/(mol⋅min) requires triple correction:

pH correction: Using $e^{(- 0.5 \cdot {((p H - 3))}^{2})}$ to correct, the rate constant k₂ increases by 1.1 times at pH 2.5-2.8.
Temperature correction: Using the Arrhenius formula $e^{(\frac{E a}{R} \cdot ((\frac{1}{313.15} - \frac{1}{T + 273.15})))}$ to correct, the rate constant k₂ increases by 1.3 times at 40°C compared to 30°C.
Phase-transfer catalyst correction: Using $1 + 0.8 \cdot C_{c a t a l y s t}$ to correct, the rate constant k₂ increases by 1.12 times at a catalyst concentration of 0.15 g/L.

r_side corresponds to the side reactions of trichloroacetyl chloride mentioned in the literature: 60% of r_side is the hydrolysis of trichloroacetyl chloride to trichloroethanol and formic acid; 40% is the formation of dimer impurities from the reaction of free pyrimidine with trichloroacetyl chloride.
The rate expression is r_side=k_side⋅C_pyr,free⋅C_trichlor，k_side=0.2 L/(mol⋅min), explaining the literature's use of 2.6 times excess trichloroacetyl chloride to offset the consumption of side reactions and ensure sufficient condensation reaction.

（4）Concentration change rate of Glutamic Acid（C_glut）

\frac{d C_{g l u t}}{d t} = - r_{3}

The intermediate condensation product （C_inter）reacts with glutamic acid to produce folic acid and hydrochloric acid. The rate expression is r₃=k₃⋅C_inter⋅C_glut（k₃=0.5 L/(mol⋅min)）, also requiring triple correction:

pH correction: Using $1 + 0.15 \cdot \max ((0, 7 - p H))$ to correct, the rate constant k₃ increases by 1.08 times at this pH.
Temperature correction: Similar to the condensation reaction, the rate constant k₃ is 1.1 times lower at 40°C compared to 50°C to avoid decomposition of folic acid at high temperatures.
Catalyst correction: Using $1 + 0.6 \cdot C_{c a t a l y s t}$ to correct, the rate constant k₃ increases by 1.09 times due to the promotion of glutamic acid dissolution by tetrabutylammonium bromide.

（5）Concentration change rate of Free Pyrimidine（C_pyr,free）

\frac{d C_{p y r, f r e e}}{d t} = r_{1} - r_{2} - r_{s i d e}

This reflects the experimental procedure of "immediate addition of trichloroacetyl chloride after pre-treatment." If there is a delay, C_pyr,free will be excessively consumed by r_side, leading to a decrease in yield. The code enforces that C_pyr,free is quickly involved in the main reaction after generation.

（6）Concentration change rate of Folic Acid（C_folic）

\frac{d C_{f o l i c}}{d t} = r_{3}

The accumulation of C_folic is entirely determined by r₃.

（7）Concentration change rate of By-products: Dimers/Hydrolysis Products（C_byproduct）

\frac{d C_{b y p r o d u c t}}{d t} = r_{s i d e}

These come from the side reactions of trichloroacetyl chloride and the dimerization of free pyrimidine.

Figure 1.4：Product formation process in folic acid synthesis

1.2.2 Neural Network Solution

The aforementioned differential equation systems for the two stages exhibit complex characteristics such as multi-variable coupling and side reaction interference. This model innovatively integrates PINNs to achieve efficient solution of these systems. Two dedicated networks are constructed to solve the concentration changes in the two stages respectively, enabling a 'data-driven + physics-constrained' approach.

The key to PINNs is to automatically compute the derivative $\frac{d C}{d t}$ ，and compare it with the theoretical differential equation to obtain the residual. The total loss = data fitting loss (MSE) + physical residual loss (MSE). The data fitting loss measures the deviation between the network output and the data generated by traditional numerical methods, ensuring that the network learns the basic trend of the reaction. The physical residual loss measures the deviation between the network output and the "reaction differential equation," ensuring that the network satisfies physical laws. Since concentration prediction is a continuous value regression problem, MSE is the most commonly used choice.

The Adam optimizer is used to train for 500 epochs to balance data fitting and physical constraints. Adam is one of the most commonly used optimizers in deep learning, combining the advantages of momentum and adaptive learning rate (RMSprop), especially suitable for PINN optimization: it can accumulate historical gradient directions to accelerate convergence and allocate different learning rates to different parameters to solve the problem of "data loss and residual loss gradient magnitude differences" in PINN.

Figure 1.5：Heatmap of final total loss of different numbers of layers and neurons

Figure 1.6：Influence of number of neurons on loss with fixed number of layers

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Figure 1.7：Relationship between training time and network configuration

After testing, the Pyrimidine PINN selects 3 hidden layers with 128 neurons. The loss curve decreases slowly after 128 neurons, and 3 layers can achieve a precision close to 4 layers at 128 neurons. The training time for 128 neurons is significantly lower than that for 256 or 512 neurons. The 3-layer 128 configuration achieves a balance of "sufficient precision and prioritizing efficiency." For FolicAcidPINN, the total loss is much lower at 512 neurons than at other configurations. The loss curve for 3 layers is close to that for 4 layers at 512 neurons, and increasing the number of layers has a limited effect on reducing the loss. The training time for 3 layers with 512 neurons is about 20% less than that for 4 layers with 512 neurons. Therefore, 3 hidden layers + 512 neurons are selected.

Different network structures are designed according to the input and output requirements of the two stages:

PyrimidinePINN (Pyrimidine synthesis): The input layer includes 7 dimensions (reaction time + 5 key process parameters + initial concentration),The hidden layers consist of 3 layers, each using the Tanh activation function and containing 128 neurons, and the output layer has 6 dimensions (concentrations of each component).

FolicAcidPINN (Folic acid synthesis): The input layer includes 12 dimensions (reaction time + 11 process parameters), the hidden layer also consists of 3 layers with the Tanh activation function (512 neurons per layer), and the output layer has 11 dimensions.

1.3 Results Analysis

1.3.1 Parameter Sensitivity Analysis

A robustness analysis is designed to evaluate the stability of the entire folic acid synthesis process against parameter perturbations.. The core of this analysis is to apply random perturbations to key process parameters, simulate multiple experiments, and statistically analyze the fluctuations of core indicators to ultimately determine the reliability of the process. The coefficient of variation (Coefficient of Variation, CV) is a statistical measure used to quantify the relative dispersion of data, eliminating the influence of different data scales and mean values to enable fair comparison of dispersion among different datasets. The CV is essentially the "ratio of the standard deviation of the data to the mean value," usually expressed as a percentage (%), with the formula:

C V = \frac{S D}{M e a n} \times 100 %

By dividing by the mean value of the indicator, the differences in the mean values of different indicators are eliminated, allowing comparison of the dispersion of indicators with different scales. The CV reflects the relative degree of fluctuation rather than the absolute fluctuation. A smaller CV indicates that the indicator fluctuates less under perturbation, meaning that the indicator is less sensitive to perturbation and the process is more stable.

After 500 Monte Carlo simulations, the coefficient of variation (CV) for key parameters is shown in Figure 1.8.

Figure 1.8：coefficient of variation of key indicators

The CVs for the yield and purity of folic acid are 2% and 2.2%, respectively, while those for the yield and purity of pyrimidine are 2.8% and 1.9%, all less than 5%, indicating excellent system stability and strong anti-interference ability. The CV for total waste is 9.5%, which is between 5% and 10%, indicating good system stability.

From another perspective, to identify key process parameters that significantly affect the yield of folic acid and simplify model complexity, a sensitivity analysis of the rate constants and initial concentrations in the aforementioned differential equations is conducted. The results are summarized in Figure 1.9.

Figure 1.9：Sensitivity analysis of folic acid synthesis parameters

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Only five parameters significantly affect the yield of folic acid: the rate constants k2 and k3 for folic acid synthesis, the side reaction rate constant k_side for folic acid synthesis, the initial concentration of glutamic acid, and the initial concentration of potassium carbonate. Most other parameters have minimal impact on the final yield of folic acid and can be fixed to simplify the model.

1.3.2 Model Results (Visualization)

The final yield of folic acid is 3.84 g, with a purity of 99.0% and a yield of 63.4%.

Figure 1.10： Distribution of liquid pollutants

Figure 1.11：Raw material utilization radar chart

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

1.3.3 Conclusion

The results of the model have greatly helped our team to gain a deeper understanding of the pollution and wastage associated with current folic acid synthesis, providing necessary data and theoretical support for project initiation. The model reveals significant resource wastage in industrial synthesis, with low raw material conversion efficiency. For example, the utilization rate of trichloroacetyl chloride is only 42.3%, and that of methyl cyanoacetate is 63.6%, with a large amount of raw material not being converted into the target product. High levels of by-products and waste are also observed, with an iron(II) chloride emission of 5.67 g and a total waste volume of 27.25 g, which is seven times the yield of folic acid.

In contrast, cultivating high-folate soybeans offers significant advantages: it relies on natural photosynthesis and biosynthesis processes to directly obtain folate and related substances from natural crops, without the need for complex chemical raw material preparation and multi-step synthesis reactions. The raw material acquisition is more direct and natural, and agricultural cultivation methods produce almost no industrial waste or pollutants associated with chemical synthesis processes, resulting in more efficient resource utilization and better alignment with green and sustainable development requirements.

Additionally, during the model construction process, we effectively utilized PINNs, a deep learning neural network, to solve the differential equations and explored the optimal configuration of PINNs under different hyperparameter settings in terms of computational precision and solution accuracy. For our specific problem, the optimal configurations are as follows: Pyrimidine PINN with 3 hidden layers and 128 neurons; FolicAcidPINN with 3 hidden layers and 512 neurons. We hope this provides some guidance for future iGEM participants or other experts facing difficulties in solving differential equations.

1.4 Contribution

1.The introduction of Physics-Informed Neural Networks (PINNs) to solve differential equations breaks through the limitations of traditional numerical methods. By incorporating physical laws as constraints into neural network training, it combines the flexibility of data-driven approaches with the reliability of model-driven methods, offering a new paradigm for simulating complex chemical processes.

2.The model comprehensively integrates practical factors in industrial production, such as "multi-variable interference," "stage-wise regulation," and "side reaction competition." It accurately characterizes the non-linear effects of temperature, pH, and catalysts on reaction rates in industrial settings, ensuring that the simulation results closely match real production scenarios.

Model2: AI-driven Gene Selection Model

Figure 2.1 Overview of the AI-driven gene selection model

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

2.1 Background and Objectives

In the early stages of the project, through literature research, we discovered that the high folic acid synthesis in Arabidopsis thaliana has been successfully achieved. Therefore, we aimed to identify suitable soybean genes for folate increment expression by referring to the genes involved in the high folate synthesis pathway of Arabidopsis thaliana. After conducting literature research and consulting experts, the wet lab team initially determined four types of enzymes:

ADCS: Aminodeoxychorismate synthase(Arabidopsis gene: AT2G28880)

DHFR: Dihydrofolate reductase(Arabidopsis gene: AT2G16370)

GCH1: GTP cyclohydrolase 1(Arabidopsis gene: AT3G07270)

HPPK：6-Hydroxymethyl-7,8-Dihydropterin(Arabidopsis gene: AT4G30000)

Our objective was to first screen a subset of genes based on sequence similarity through phylogenetic tree analysis. Subsequently, we aimed to further refine the selection of candidate target genes by constructing protein structure prediction models, protein-small molecule docking models, and molecular dynamics simulation models at the protein level. This approach aimed to reduce the experimental costs for the wet lab team in conducting gene selection pre-experiments. Additionally, it provides a foundation and reference for other teams interested in developing high folate pathway expression in the future. Ultimately, the folate expression levels obtained from the wet lab experiments were used to validate the effectiveness of the dry lab models.

2.2 Preliminary Screening of Target Genes

2.2.1 Background and Objectives

To accurately identify the orthologous genes in a complex genome that has undergone multiple gene duplications, such as soybean, constructing a phylogenetic tree is one of the most effective and intuitive methods. A phylogenetic tree is a tree-like diagram that represents the evolutionary relationships between species or genes. In this study, its value lies in distinguishing between orthologous and paralogous genes.

Orthologous genes: Genes that originated from the same ancestral gene and were separated into different species due to speciation events. They typically retain the same biological function. Paralogous genes: Genes that originated from the same ancestral gene but were duplicated within the same genome, potentially leading to functional divergence.

A simple BLAST search can only identify genes with sequence similarity but cannot differentiate whether the gene is the desired ortholog or a paralog with potentially altered function. Therefore, we introduced the phylogenetic tree model. In a phylogenetic tree, orthologous genes form a highly supported clade comprising genes from different species.

Thus, our goal was to use the known Arabidopsis thaliana genes as "probes" to identify the most likely orthologous genes in the soybean genome that are located within the same small branch as the Arabidopsis thaliana genes.

2.2.2 Phylogenetic Tree Evaluation

We provided a multiple sequence alignment file containing folate synthase proteins from Arabidopsis thaliana, soybean, and other closely related species. The phylogenetic analysis was conducted using the IQ-TREE software (v1.6.12).

We employed the -m TEST parameter to invoke the ModelFinder program, which automatically tested 20 different amino acid substitution matrices, each with eight combinations: basic,+I, +G4, +I+G4, +F, +F+I, +F+G4, +F+I+G4.

Based on the Bayesian Information Criterion (BIC), the optimal model was determined to be JTT (Jones-Taylor-Thornton amino acid substitution matrix) +G4 (Gamma-distributed rate heterogeneity with four rate categories). This model combination provided a statistically optimal framework for accurately inferring orthologous relationships.

To assess the reliability of the constructed phylogenetic tree, we conducted a dual test with branch support settings: -bb 1000 (Ultrafast Bootstrap with 1000 replicates) and -alrt 1000 (SH-aLRT test with 1000 replicates).

2.2.3 Results Analysis

The results analysis is exemplified using the Arabidopsis thaliana gene AT2G28880, which corresponds to the enzyme ADCS.

We located the known Arabidopsis thaliana gene (AT2G28880) on the tree and identified the soybean genes within the same branch. We focused on the support values of this branch: if the branch had both high UFBoot values (≥95%) and high SH-aLRT values (≥80%), we could confidently conclude that the soybean gene was the orthologous gene we were seeking.

Figure 2.2 Phylogenetic tree results of ADCS enzyme evolution

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

From the phylogenetic tree results, it can be observed that AT2G28880 is more closely related to Glyma.10G211600 (Glyma.10G211600.1) and Glyma.20G179400 (Glyma.20G179400.2), with high branch support values. Therefore, Glyma.10G211600 (Glyma.10G211600.1) and Glyma.20G179400 (Glyma.20G179400.2) were selected as the candidate soybean genes.

Following the same analytical approach, we identified a total of 13 candidate target genes across four categories, as shown in the table below:

NO.	Type	Gene Name
1	ADCS	Glyma.10G211600
2	ADCS	Glyma.20G179400
3	DHFR	Glyma.06G010600
4	DHFR	Glyma.11G115200
5	DHFR	Glyma.04G010800
6	DHFR	Glyma.12G041000
7	GCH1	Glyma.08G347400
8	GCH1	Glyma.16G089300
9	GCH1	Glyma.18G158900
10	GCH1	Glyma.03G085100
11	GCH1	Glyma.03G085500
12	HPPK	Glyma.02G120400
13	HPPK	Glyma.01G062200

Table2.1：Preliminary screening results of candidate genes

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Other Phylogenetic Tree Results

DHFR

Figure 2.3 Phylogenetic tree results of DHFR enzyme

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

GCH1

Figure 2.4 Phylogenetic tree results of GCH1 enzyme

HPPK

Figure 2.5 Phylogenetic tree results of HPPK enzyme

2.3 Protein Structure Simulation Model

2.3.1 Background and Objectives

After the preliminary screening using the phylogenetic tree model, we still had a considerable number of candidate genes to choose from. To make a more informed decision in selecting the target genes for our project and to save time and effort for the wet lab, we needed to translate the genetic material into proteins to further analyze the merits of different candidate genes. However, the protein structures corresponding to most candidate genes were not well-defined. Therefore, we planned to use the AlphaFold 3 model, based on Graph Neural Networks (GNN), to predict and model the protein structures of the candidate genes for subsequent downstream tasks.

The objective of this model was to use the AlphaFold 3 model to predict the protein structures of all (13 in total across four categories) candidate genes and select the protein structure with the highest overall score for inclusion in the protein-small molecule docking model.

2.3.2 Model Construction

We first obtained the amino acid sequences of the target proteins in FASTA format.

Subsequently, we utilized the widely recognized AlphaFold 3 deep learning model in the academic community to perform the calculations. This model can accurately predict the three-dimensional spatial structure formed by the folding of the amino acid sequence based on a graph neural network.

The output of this process was multiple PDB (Protein Data Bank) files containing atomic coordinates, along with scores for different prediction results. Among the many scoring dimensions, we primarily focused on the prediction structure with the highest ranking score for inclusion in our study.

2.3.3 Results Analysis

Table2.2 : Protein Structure Prediction Score Table

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

AlphaFold3 Protein Structure Visualization Results

ADCS

Figure2.6：Structure prediction result of Glyma.10G211600

Figure2.7 Three-dimensional structure prediction of Glyma.10G211600

Taking the structure prediction result of Glyma.10G211600 as an example, we first examined the PAE (Predicted Aligned Error) plot. The green color indicates low error (<5 Å). Overall, the predicted structure showed high confidence, with most regions having errors below 5 Å and near-zero errors along the diagonal (residue self-alignment), indicating good prediction quality.

Next, we analyzed the AlphaFold 3 output result file (summary_confidences_0). The iptm value is typically used to assess the quality of multi-chain protein complexes. Since our input proteins were all single-chain, the absence of iptm was reasonable. The pae_min value was low (close to 0), indicating high precision in structure alignment. The pTM value of 0.8 and the ranking score of 0.91 were both high, indicating a high-quality prediction.

Using the same approach, we analyzed the protein prediction results of the other 12 genes. Overall, the structure predictions were of high quality and confidence, and thus all were included in the subsequent protein-small molecule docking model analysis.

Other 3D Visualization Results

DHFR

Figure2.8：Structure prediction result of Glyma.06G010600(DHFR)

Figure2.9：Three-dimensional structure prediction of Glyma.06G010600(DHFR)

GCH1

Figure2.10：Structure prediction result of Glyma.18G158900(GCH1)

Figure2.11：Three-dimensional structure prediction of Glyma.18G158900(GCH1)

HPPK

Figure2.12：Structure prediction result of Glyma.02G120400(HPPK)

Figure2.13：Three-dimensional structure prediction of Glyma.02G120400(HPPK)

2.4 Protein-Small Molecule Docking Model

2.4.1 Model Background and Objectives

With the high-confidence predicted structures of the proteins corresponding to the candidate genes, we aimed to further understand the binding interactions between the proteins and their corresponding ligands. To achieve this, we planned to use the state-of-the-art (SOTA) Interformer model, which is based on the Graph Transformer architecture.Interformer is a unified model designed to capture non-covalent interactions using an interaction-aware mixture density network.

Despite the relatively high learning and training costs associated with deploying and using this protein-small molecule docking model, we chose to use it to pursue better docking results, provide more precise support for the wet lab experiments, and offer insights to future iGEM teams interested in using Graph Transformer-based docking models in their experimental design and model usage.

2.4.2 Model Construction

Model Deployment

The deployment process consisted of four main steps: cloning the project to the local environment using git, configuring the environment with Conda, installing PLIP (to avoid dependency errors with openbabel), and compiling the docking sampling program.

The deployment code can be found in the following MarkDown file（in iGEM Team Gitlab）:
Interformer Deployment Guide

Model Construction (Single Docking Process)

Figure 2.14：Flowchart of single-protein small-molecule docking using Interformer

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

The complete code file and detailed description can be found in the following MarkDown file（in iGEM Team Gitlab）:
Interformer Molecular Docking and Affinity Prediction – A Step-by-Step Tutorial

Model Construction (Batch Processing of Multiple Docking Files)

During the actual prediction process, we encountered difficulties with the lack of automated batch execution of the docking program.

Considering that other teams might also want to perform batch processing of Interformer docking modeling in the future, we developed a script to help us conduct protein-small molecule docking in a simple, elegant manner.

We hope this script will also assist future iGEM participants or synthetic biologists in their research.

The core idea of this script is illustrated in the figure below:

Figure 2.15：Flowchart of Automated Protein-Ligad Docking Workflow

The script file and a more detailed documentation can be found at the following GitLAB link:

README（in iGEM Team Gitlab）
Batch Processing Interformer Protein Docking Pipeline

script file（in iGEM Team Gitlab）
run_all_workflows

2.4.3 Results Analysis

Table 2.3: Interformer docking results

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Interformer is a blind docking model based on Graph Transformer and interaction-aware mixture density network (MDN). After inputting the ligand's sdf file and the protein's pdb file into the model, it generates 21 different binding sites for the ligand-protein complex. We selected the optimal binding site for further study.

Additionally, two main evaluation metrics are produced: RMSD and Energy. Root Mean Square Deviation (RMSD) measures the average displacement of atoms and is commonly used to assess the overall difference between the simulated protein structure and the initial structure. A higher RMSD value indicates a greater deviation from the initial state. Energy , measured in joules (J), indicates the binding affinity; a lower value suggests a tighter binding and a more stable overall structure, which may perform better in enzymatic reactions.

Overall, the Interformer model predicted that the proteins corresponding to the different target genes bind very tightly with their respective ligands, and they are expected to function effectively. However, these results are not dynamic.

Next, we will use Gromacs to conduct molecular dynamics simulations to analyze the binding interactions from a dynamic perspective.

2.5 Molecular Dynamics Simulation Model

2.5.1 Background and Objectives

To further understand the binding interactions between the ligand and the protein in a dynamic context and to provide a more in-depth theoretical basis for selecting the target genes, we planned to conduct molecular dynamics simulations using Gromacs.

2.5.2 Model Construction

Figure 2.16：Flowchart of molecular dynamics simulation

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

The main workflow of the dynamics simulation is as follows: ① Structure preprocessing: Ensure that the input PDB file does not contain water molecules and has no missing residues; ② Topology generation: Create GROMACS topology files for both the protein and the small molecule; ③ System assembly: Combine the protein and small molecule, construct a water box, and add ions to neutralize the system; ④ Energy minimization: Eliminate unreasonable contacts; ⑤ Equilibration phase: First perform NVT (constant temperature and volume) and then NPT (constant temperature and pressure); ⑥ Production molecular dynamics simulation: Generate trajectories for analysis.

Our parameter settings are as follows（in iGEM Team Gitlab）:
md.mdp

The description and detailed simulation codecan be found in the following MarkDown file.

Description（in iGEM Team Gitlab）
GROMACS Molecular Dynamics Simulation Pipeline for Protein–Ligand Complex

Simulation code（in iGEM Team Gitlab）
GROMACS Molecular-Ligand Dynamics Simulation Commands

2.5.3 Results Analysis

2.5.3.1 Binding Stability Analysis

ADCS

Figure 2.17：RMSD and RMSF results for ADCS

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

From the RMSD analysis: The two curves rapidly rise in the first ~1000 ps, indicating that the system is "relaxing" (i.e., adjusting to a more stable conformation). Between 1000 and 5000 ps, the RMSD levels off with fluctuations, indicating that the system has reached an equilibration state. The red line (Glyma.20G179400) has a final RMSD value of approximately 0.8 Å, while the black line (Glyma.10G211600) has an RMSD of about 1.2 Å.

From the RMSF analysis: A significant peak appears around residues ~150-200 (especially in the black line), suggesting high flexibility in this region, which may indicate a long loop or a functional regulatory area (e.g., substrate binding entrance). Another notable peak is observed around residues ~300-400, but the red line shows a smaller increase than the black line, indicating that Glyma.20G179400 is more stable in this region.

Overall, Glyma.20G179400 exhibits a more stable conformation with lower flexibility, suggesting stronger structural rigidity or better folding stability. In contrast, Glyma.10G211600 is more flexible and may undergo conformational changes more easily, possibly participating in dynamic regulation (e.g., conformational switching). However, both candidates maintain their basic topological structures, confirming their feasibility. Therefore, we recommend that the wet lab team conduct pre-experiments on both genes and determine the target gene based on folate expression levels.

Using the same approach , we analyzed the topological stability of protein-small molecule complexes under dynamic conditions from the perspectives of RMSD and RMSF, primarily considering the stability of the final topological structure as the basis for selecting candidate target genes for the wet lab experiments.

DHFR

Figure 2.18：RMSD and RMSF results for DHFR

GCH1

Figure 2.19：RMSD and RMSF results for CGH1

HPPK

Figure 2.20：RMSD and RMSF results for HPPK

2.5.3.2 Binding Site Analysis

ADCS

Figure 2.21：Molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.20G179400(ADCS)

Figure 2.22：Binding site analysis from molecular dynamics simulation of the protein–small molecule complex for Glyma.20G179400 (ADCS)

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Due to time and computational limitations, we selected one candidate gene from the ADCS category, Glyma.20G179400, to represent the binding situation of the enzyme ADCS with its small molecule ligand chorismate.

Overall observations: ① The chorismate molecule is embedded in the protein's active pocket in a ring-like structure; ② Multiple amino acid residues surround the substrate, forming a highly specific binding site; ③ Various non-covalent interactions exist, including hydrogen bonds, salt bridges, π-cation interactions, hydrophobic interactions, and halogen bonds; ④ There are multiple charged residues near the active center (such as Asp, Glu, Arg, Lys), suggesting possible roles in catalysis or stabilization of transition states.

Based on these observations, we identified several potential directions for protein structure evolution:

①Asp726 → Glu726：Currently, Asp726 forms a strong hydrogen bond with the C3--OH of chorismate through its OD2. It may participate in deprotonation reactions (acting as a catalytic base). Glu has a longer side chain that could fine-tune the position of the carboxyl oxygen to better match the orientation of C3--OH. However, this substitution might introduce steric hindrance or alter the pKa, which would need to be verified through MD simulations.

②Lys830 → Arg830：Currently, Lys830 forms a hydrogen bond (or electrostatic attraction) with the C4=O of chorismate. Located at the exit of the active pocket, it may be involved in regulating product release. Arg has a stronger positive charge and multiple hydrogen bonding capabilities due to its guanidinium group, which can more firmly anchor the C4=O. Additionally, the planar structure of Arg may form a cation-π interaction with the conjugated system of chorismate, further stabilizing the binding. This could enhance electrostatic stabilization.

DHFR

Figure 2.23：Molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.06G010600(DHFR)

Figure 2.24：Binding site analysis from molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.06G010600(DHFR)

We selected one candidate gene from the DHFR category, Glyma.06G010600, to represent the binding situation of the enzyme DHFR with its small molecule ligand dihydrofolic acid (DHF).

Overall observations: ① The DHF molecule has an "L" shape, including a pteridine ring, a glutamate tail, and a connecting bridge (p-aminobenzoic acid, PABA); ② The pteridine ring is embedded in the hydrophobic cavity of the active center, the PABA part interacts with polar residues, and the glutamate tail forms a salt bridge with positively charged residues. ③ The figure annotates multiple H-bonds, π-π interactions, salt bridges, hydrophobic interactions, and solvent exposure sites, indicating a highly ordered binding interface.

Based on these observations, we identified several potential directions for protein structure evolution:

①Ser185 → Thr185: Currently, Ser185 forms a double hydrogen bond with the C8--NH₂ and C6--OH of DHF. Thr has an additional methyl group that enhances hydrophobicity while retaining the OH group. The methyl group can slightly "push" DHF, making it fit better into the catalytic site (entropic preorganization), thereby achieving better hydrogen bond geometry (due to restricted side chain rotation)

②Glu246 → Gln246: Currently, Glu246 is located near the Met20 loop (which opens and closes during the DHFR catalytic cycle); its negative charge may repel Arg244, affecting its orientation. Gln246 can eliminate the negative charge, reducing electrostatic repulsion with Arg244. It retains polarity and maintains the local hydrogen bond network. This could lead to more stable closure of the Met20 loop → increased lifetime of the catalytic intermediate.

GCH1

Figure 2.25：Molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.18G158900(CGH1)

Figure 2.26：Binding site analysis from molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.18G158900(CGH1)

We selected one candidate gene from the GCH1 category, Glyma.18G158900, to represent the binding situation of the enzyme GCH1 with its small molecule ligand guanosine triphosphate (GTP).

Overall observations: ① Stabilization of the phosphate group: Achieved through a network of polar residues (such as Ser, Thr, His) and water molecules to stabilize the high-energy phosphate bond; ② Recognition of the purine ring: Mainly relies on π-π stacking interactions provided by aromatic amino acid side chains (Phe, Trp) and some key hydrogen bond donors (Pro); ③ Positioning of the sugar ring: Achieved through specific hydrogen bond patterns and hydrophobic interactions.

Based on these observations, we identified a potential direction for protein structure evolution:

Cys115 → Ser115: Cys115 has a weak S···O interaction with the ribose O3', but its thiol group has low polarity and poor hydrogen bonding capability, which may limit the rigidity of ribose orientation. Introducing Ser115 retains a similar size, but the --OH group can form stronger hydrogen bonds (both donor and acceptor), thereby enhancing substrate specificity and improving ribose recognition.

HPPK

Figure 2.27：Molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.02G120400(HPPK)

Figure 2.28：Binding site analysis from molecular dynamics simulation of the predicted protein–small molecule complex for Glyma.02G120400(HPPK)

We selected one candidate gene from the HPPK category, Glyma.02G120400, to represent the binding situation of the enzyme HPPK with its small molecule ligand 6-hydroxymethyl-7,8-dihydropterin (HMDHP).

Overall observations: ① HMDHP is located in a deep and narrow hydrophobic pocket formed by multiple α-helices and β-sheets. The protein backbone forms a "clamp" - like structure that tightly envelops the substrate, especially the region of Cys85--Trp409--Ala408--Arg410, which constitutes the core recognition interface. ② The pocket has a distinct polar-nonpolar partition: Polar region: Near the N-terminal amino (NH₂) and hydroxyl (OH) groups, dominated by polar residues such as Arg410, His449, and Glu402; Nonpolar region: Surrounding the pterin ring and methyl side chain, composed of hydrophobic residues such as Val398, Ile447, Leu446, and Met225.

Based on these observations, we identified a potential direction for protein structure evolution:

Arg410 → Lys410: Currently, Arg410 forms a major hydrogen bond/salt bridge with the N1--NH₂ of the substrate, serving as an important anchor point. Mutating Arg410 to Lys would retain the positive charge characteristic while reducing the steric hindrance of the guanidinium group, potentially enhancing binding affinity or providing more flexible modes of interaction.

2.6 Results Analysis

2.6.1 Target Gene Selection Results

The most recommended list of genes is presented below:

Table 2.4：Recommended genes for expression analysis—gene names and types

Integrating the binding energy from Interformer and the RMSD and RMSF results from the molecular dynamics simulations, we ultimately selected the following genes for the wet lab team to conduct pre-experiments and to validate the modeling effects.

2.6.2 Wet Lab Results Validation

Table 2.5：Wet-lab sequencing results of gene expression levels across different tissues and developmental stages

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

To identify the final target genes from a broader range and to validate the reliability of the dry lab modeling, the wet lab conducted expression tests for 11 candidate genes across different developmental stages of soybean. In addition to the 9 candidate genes recommended by the dry lab, two extra genes were included.

Ultimately, through Flate expression tests, the wet lab measured the expression levels of the candidate genes in various parts of the soybean plants at different stages. The following genes were finally selected as our target genes:

Glyma.10G211600(ADCS), Glyma.06G010600(DHFR), Glyma.18G158900(CGH1), Glyma.02G120400(HPPK)

All four of these genes were included in the list of nine suitable candidates recommended for expression testing by the modeling, thereby confirming the reliability of the model.

Although the results did not save time and effort for the wet lab in this instance, we believe that thismodeling-based candidate gene screening approach will help future iGEM participants and other researchers save time, effort, and resources by narrowing down the targets for gene expression pre - experiments.

2.7 Contributions

①We provided a novel AI - driven model for selecting homologous genes. Building on the initial selection based on sequence similarity, we further analyzed the feasibility of candidate genes from the perspective of their protein structures. This offers a new, efficient, and accurate AI - based solution for similar studies that require candidate gene selection.

②We introduced the emerging Graph - Transformer - based blind docking model Interformer for the first time and validated its high efficiency and accuracy. Unlike traditional docking models that require the determination of pocket space, this model simplifies the docking process through its deep learning - based algorithm. The wet lab results demonstrate that most of the predicted docking outcomes are valid. Additionally, we developed a batch - processing script for the Interformer docking model. We believe this experience can help future iGEM participants and experts simplify operations and save time and effort when conducting protein - small molecule docking in bulk.

③For the four types of enzymes we selected, we conducted in - depth analyses of the binding sites of the proteins with their substrates for the four candidate genes. We hope this can provide a foundation for future iGEM participants or professionals studying folate synthesis pathways or similar metabolic pathways, offering insights into protein - directed evolution.

Model3: Domain-specific LLM with SFT and RAG

3.1 Background and Objectives

As the iGEM competition expands year by year, an increasing number of outstanding experimental protocols have accumulated. Midway through our project, we found that surveying these past, excellent experimental plans was becoming more and more cumbersome. Traditionally, project retrieval relies on checking each team’s wiki one by one. After communicating with several teams, we learned that other iGEMers encountered major obstacles when trying to locate high-quality experimental designs, project inspiration, or any other materials relevant to judging as well.

To solve the problem of searching iGEM experimental designs or project ideas in the same or similar direction, to save the time and energy our own members and other iGEMers spend on searching, and to help our team better understand the core concepts and experimental plans of this project, our goal is to design a domain-specific, fine-tuned Qwen3 large language model (Base & Thinking) for iGEM experimental-plan design that employs a dual strategy of SFT and RAG.

During this design process we also established construction schemes with multiple parameter scales and different model types, as well as ablation experiments on whether to use SFT/RAG/SFT+RAG. We carried out model tests across different parameter scales and model types to identify the parameter size and Fine-Tuned strategy best suited to this project, hoping that these experiences will inspire future iGEMers or professionals to develop fine-tuned LLM for other vertical domains.

3.2 Model Construction

3.2.1 Dataset Preparation

3.2.1.1 Background and Objectives

Dataset quality directly affects the efficacy of model fine-tuning; only high-quality data can yield good fine-tuning results. To enrich the data and give our own members and other iGEMers a better experience, we chose to crawl the wikis of past iGEM teams as our data source.

The goal of this stage is to collect, for the past five years, each team’s WikiKI description, experimental , model and Plant page, together with authoritative synthetic-biology journals and existing domain-specific large language models in synthetic or general biology. These materials are then further human-distilled, with QA pairs generated from iGEM web pages accounting for 85 %, authoritative journals 10 %, and manual distillation 5 %.

3.2.1.2 Data Collection

The iGEM web-page dataset was built in two steps: ① team-name collection and ② team-wiki-page acquisition.

For team-name collection: we started from iGEM’s official award list and, by exploiting its distinctive tags, extracted the names of all gold-medal teams. A manual spot-check of 10 % of the crawl gave 100 % accuracy.

For wiki-page acquisition: because iGEM score pages follow a fixed URL pattern, three nested loops (year / team name / desired wiki section) were sufficient to pull the required pages. A self-written script then converted each page to Markdown and fed it into the open-source dataset builder EasyDataset. EasyDataset splits the text into chunks via similarity algorithms and sends each chunk to a strong LLM to generate question–answer pairs

The gold-team-name crawler and the wiki-section crawler are given below:

The gold-team-name crawler（in iGEM Team Gitlab）
Scrape the names of gold medal-winning teams

wiki-section crawler（in iGEM Team Gitlab）
Scrape the target pages of award-winning teams

Journal Dataset Construction : Through an HP (Human Practices) activity that involved expert interviews, we selected 10 high-impact synthetic-biology review articles (their high information density fixed the journal count at 10 to keep the dataset balanced). Each PDF was converted to Markdown with the open-source tool MinerU and fed into EasyDataset. Because reviews are denser than wiki pages, we shortened the chunk-splitting length so the text was divided more finely before Q–A pairs were generated.

Human-distilled DSLLM Dataset : Many fine-tuned biology LLMs already exist, trained on larger parameters and richer data than ours. To add extra dimensions, domain experts crafted valuable synthetic-biology questions, ran them through these external fine-tuned models, and collected the (question, model answer) pairs—distilling them manually into additional Q–A pairs for our dataset（in iGEM Team Gitlab）.
Qwen3 Dataset

3.2.2 Supervised Fine-Tuning

3.2.2.1 Background and Objectives

SFT (Supervised Fine-Tuning) is a common method of training an already pre-trained model on a specific task to improve its performance on that task. Before starting the fine-tuning work, two things need to be confirmed: ① the choice of the base model ; ② the fine-tuning strategy .

By investigating the score performance of various open-source large models and considering both subjective and objective factors, we decided to choose Qwen3 as our base model. It offers multiple parameter options , supports both non-chain-of-thought and chain-of-thought modes, and has achieved excellent results on various open-source model leaderboards.

Regarding the fine-tuning strategy , taking into account the team's budget and the overall size of the dataset, we decided to use 4-bit quantized QLoRA for partial parameter fine-tuning.

The goal of SFT fine-tuning the LLM is to integrate the professional knowledge in the dataset into the weight matrix of the LLM, making the resulting LLM more professional in its processing. It can provide more professional answers to vertical domain questions, reduce model hallucinations, and enable team members and iGEMers to receive more in-depth and accurate answers to their questions about experimental design.

3.2.2.2 Model Training

The core relies on the fine-tuning operations provided by the open-source framework LLaMAFactory .

The basic steps are ① Environment installation → ② Launch WebUI → ③ Configure training parameters → ④ Train-Export-Test .

Environment installation
Detailed steps and code are given below（in iGEM Team Gitlab）:
LLaMA-Factory Environment Setup Guide

Fonfigure training parameters

Figure 3.1：LLaMa Parameter Design Diagram 1

①Select the type of model to be fine-tuned

② Select the model path; if LLaMA-Factory does not detect the model locally, it will be downloaded from ModelScope

③ Adopt the QLoRA quantized fine-tuning method,with the quantization level set to 4

④Select the dataset (note: the dataset must first be imported into the LLaMA-Factory folder and declared in the data_info.json file)

⑤Number of training epochs set to 4

Figure 3.2 ：LLaMa Parameter Design Diagram 2

① Warm-up steps set to 4 to prevent sharp loss fluctuations or training divergence caused by an overly large initial learning rate.

② An Instruct model is used here, so thinking mode is not enabled; if a Thinking model is chosen and chain-of-thought answers are desired, this option must be selected.

③ Scaling coefficient set to 256

Click “Start” at the bottom to begin training. After training starts, the training log and progress are displayed, and the loss function is shown on the right.

Figure 3.3： Schematic diagram of the LLaMa webpage panel UI in operation

Animated Figure 3.4： Diagram of the LLaMaFactory launch operations

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Animated Figure 3.5： LLaMaFactory Parameter Design

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

3.2.3 RAG

3.2.3.1 Background and Objectives

RAG (Retrieval-Augmented Generation), or retrieval-enhanced generation, is currently one of the most important and popular architectural paradigms in the field of artificial intelligence, especially in large language model (LLM) applications.

An RAG system consists of two main stages: retrieval and augmented generation. Through these two stages, the model can, on the basis of its powerful LLM capabilities, expand access to domain-specific databases or internal knowledge bases to make up for the general model's shortcomings in vertical domains.

The goal of introducing RAG technology is to make the large model's answers more professional in vertical domains, to further reduce the occurrence of hallucinations, and to improve the quality of information obtained by team members and future iGEMers. At the same time, it also further explores the effects of the new model after the deep integration of SFT and RAG.

3.2.3.2 Model Construction

Figure3.6： RAG Framework Diagram

The RAG pipeline we built includes the following steps: environment preparation, loading the base LLM, loading the embedding model, loading the reranker model, building the vector database, splitting documents, reranking, assembling the full RAG chain and starting the query.

Detailed code files can be found below (#^.^#).

RAG Environment Configuration Guide（in iGEM Team Gitlab）： RAG Environment Configuration Guide

Full RAG Construction Code（in iGEM Team Gitlab）: RAG

Animated Figure 3.7： Diagram of RAG construction operations

You can view Animated image details by: ① right-clicking the image → ② selecting "Open image in new tab".

3.3 Results Analysis

3.3.1 Evaluation Plan

Since this LLM is grounded exclusively in past gold-medal iGEM teams and authoritative review journals, its knowledge domain is extremely vertical; no open-source benchmark can objectively rate its performance. We therefore evaluate quality subjectively by collecting user feedback, organized into two parts: ① Question Design ;②Scoring Criteria.

Question Design: based on typical user goals, we prepared two categories:

(1) synthetic-biology experimental-protocol design (primary)

(2) iGEM competition-related queries

The detailed question list is as follows.

Table 3.1： List of questions

Scoring Criteria: the total score is 100 points. Each dimension is assigned a different weight according to question type, so that the model’s capability in handling different kinds of queries is reflected fairly. Based on a literature survey, we devised the detailed scoring criteria shown in the table below.

Table 3.2： LLM evaluation table

3.3.2 Analysis of Results from Different Fine-Tuning Strategies

Figure 3.8： Violin plot of different fine-tuning strategies distributed across five dimensions

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Figure 3.9： Violin plot of different fine-tuning strategies distributed by total score

Scientific Accuracy

In the dimension of scientific accuracy, Qwen3 RAG+SFT achieved the highest median score of ≈8.5. Its distribution is tightly concentrated between 8 and 9 with a narrow IQR, indicating stable, low-variance performance; almost no outliers fall below 7, demonstrating strong factual consistency. Qwen3 SFT yielded a median of ≈7.0, a much wider spread, a lower quartile close to 6, and multiple outliers below 5, revealing noticeable scientific errors or hallucinations. Qwen3 RAG alone registered a median of ≈8.0, better than SFT but slightly right-skewed and less dense in the high-score region than RAG+SFT.

Feasibility

For feasibility, Qwen3 RAG was the most robust: median ≈8.5, extremely narrow IQR, nearly all samples lying in the 8–9 band with no significant low-score outliers, showing high reliability in “can the task be accomplished?”. Qwen3 RAG+SFT had a median slightly lower at ≈8.0, a somewhat wider spread, lower quartile around 7, and occasional scores down to 5–6, suggesting that in complex or ambiguous tasks the hybrid strategy can fail when retrieval and generation are not well aligned. Qwen3 SFT produced a median of ≈7.5 and a bimodal trend: some samples performed well (8–9), but a substantial proportion fell below 6, giving high overall variance.

Completeness

On completeness, Qwen3 RAG again showed high consistency: median ≈8.5, a sharp peak with >90 % of ratings in the 8–9 interval and almost no low-score tail, confirming that reliance on an external knowledge base effectively guarantees comprehensive information coverage. Qwen3 RAG+SFT had a median of ≈8.0 and a slightly wider spread, with some scores around 7, implying that in certain contexts the generation module did not fully integrate retrieved content, leading to omissions. Qwen3 SFT attained a median of only ≈7.0, a visibly left-skewed distribution, lower quartile near 6, and multiple outliers below 5, indicating that without external knowledge it tends to miss key details.

iGEM Compatibility

Regarding adaptation to the iGEM synthetic-biology competition, Qwen3 RAG+SFT was the strongest: median ≈8.5, highly concentrated distribution, pronounced density peak, and very few samples below 7, demonstrating an effective combination of domain knowledge and language modeling that meets iGEM project requirements. Qwen3 RAG alone gave a median of ≈8.0, symmetric but with a thinner high-score tail, probably limited by less flexible language expression. Qwen3 SFT showed a median of ≈7.5 and a broad spread: some samples scored high (close to 9), but a considerable fraction fell below 7, reflecting limited generalization on specialized tasks.

Clarity of Expression

Clarity was the only dimension where Qwen3 SFT clearly led: median ≈8.0, compact distribution with a pronounced density peak in the 8–9 band, yielding fluent and logically coherent text. Qwen3 RAG+SFT had a slightly lower median of ≈7.5 and a wider spread; occasional expression breaks caused by awkward stitching of retrieved fragments concentrated scores in the 7–8 range. Qwen3 RAG performed weakest, median ≈7.0, right-skewed distribution, and numerous samples below 6; typical issues included stiff sentences, repetition, or visible splice marks that harmed readability.

Total Score

Qwen3 SFT: median ≈8.0, bimodal trend—one cluster in 7–9, another with many samples below 6, narrow mid-region, elongated tails, several outliers <5, indicating pronounced polarization and occasionally very poor answers.

Qwen3 RAG+SFT: median ≈8.5, near-bell shape concentrated in 8–10, slightly peaked top, widest central region, showing most samples at medium-high level; only a few scores <7 and no extreme lows.

Qwen3 RAG: median ≈7.5, wide base narrowing upward, “inverted-triangle” pattern, main mass 6–8,sparse upper portion, very few >9, no extreme lows but limited ceiling.

Conclusion

Across the five sub-dimensions and the weighted total, RAG+SFT dominates most key metrics, especially scientific accuracy and domain compatibility, making it the only strategy that reliably produces scores above 9. SFT, despite better fluency, suffers obvious shortcomings in core knowledge dimensions that drag down its total. RAG alone is constrained by expression issues that cap its maximum achievable score.

3.3.3 Analysis of Results from Different Parameter Sizes

Figure 3.10： Violin plot of models with different parameter sizes distributed across five dimensions

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Figure 3.11： Violin plot of models with different parameter sizes distributed by total score

Scientific Accuracy

Qwen3 4B and Qwen3 8B show the highest medians (≈9) with tight distributions. Qwen3 1.7B has a slightly lower median but wider spread, indicating some samples perform well. Qwen3 0.6B and Qwen3 14B are weaker, especially 14B, which has many low-score points. Conclusion: 4B and 8B are best for scientific accuracy; 14B may suffer from over-fitting or reduced generalization.

Feasibility

All models cluster around 8–9 points. Qwen3 4B and Qwen3 8B have the narrowest, most consistent distributions. Qwen3 14B is wider with clear low-score points (<5), implying occasional poor feasibility. Conclusion: 4B and 8B are most stable; 14B is powerful but erratic.

Completeness

Qwen3 4B and Qwen3 8B again peak near 9 with tight spreads. Qwen3 0.6B and 1.7B are markedly lower, 1.7B showing many low scores. Qwen3 14B has a high median but a low floor, risking incomplete outputs. Conclusion: 4B and 8B are best for completeness; small models lose information.

iGEM Fitness

Qwen3 4B has the highest median (≈9) and most peaked distribution. Qwen3 8B is strong but slightly behind. Qwen3 0.6B and 1.7B medians sit below 8. Qwen3 14B is wide and unstable despite an acceptable median. Conclusion: 4B is the most iGEM-fit model, likely task-optimized.

Clarity of Expression

Qwen3 4B and Qwen3 8B medians top ≈9 with tight spreads. Qwen3 1.7B is decent yet has many low-score points. Qwen3 0.6B is wide and low, often incoherent. Qwen3 14B remains high-median but produces more vague phrasing. Conclusion: 4B and 8B are clearest; small models organize language poorly.

Total Score

Qwen3 4B is the only “spike” distribution: narrow box, highest median (≈8.4), no extreme lows, dense 8–9 scores—stable and reliable. Qwen3 1.7B and 0.6B are small yet promising; 1.7B’s median edges above 0.6B and reaches near-10 samples. Qwen3 8B and 14B fall into the “large-model trap”: bigger parameters, same median (~8.0), wider spread, more 5–6 scores.

Conclusion

Qwen3 4B posts the highest composite score (median ≈8.4), tightest spread, strongest stability, noweak dimension—optimal for broad deployment.
Qwen3 8B and 14B illustrate that larger parameter counts do not guarantee better performance; added complexity breeds instability—a “scale paradox ” demanding careful fine-tuning or system-level optimization.
Qwen3 1.7B offers high cost-effectiveness, ideal for edge or mobile scenarios; paired with RAG or prompt engineering it can climb further.

Qwen3 0.6B suits only ultra-simple tasks, fluctuates wildly, and is unfit for complex reasoning.

3.4 Exploration of the Fine-Tuned Qwen3 Thinking-14B Model

3.4.1 Model Training

Figure 3.12： Schematic diagram of fine-tuning training parameter modifications for the Qwen3-Thinking model

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Similar to the basic operation of SFT, the following differences need to be noted:

① Regarding the dataset: a dataset with a chain of thought is required (in EasyDataset, select a model with a chain of thought to generate the Q&A pairs).

② Choose a model that supports the thinking mode (with the suffix “thinking”, e.g., Qwen3-14B-thinking).

③ Llama-factory parameter selection: in the second step of the fine-tuning process shown below, check “enable thinking”.

3.4.2 Model Usage Results

Animated Figure 3.13： Effectiveness illustration of the Qwen3-Thinking model usage

You can view Animated image details by: ① right-clicking the image → ② selecting "Open image in new tab".

After a quick 10-question test , we found that the reasoning capability doesn’t fundamentally change the model’s output. Moreover, preparing and training on a “thinking” dataset is computationally and financially expensive .

Therefore, for small-scale, domain-specific tasks, fine-tuning a reasoning model is usually NOT cost-effective . Nevertheless, reasoning models do make the black-box nature of large models more transparent and can reduce hallucinations to some extent .

3.5 Contribution

① By additionally introducing an artificial-distillation step during dataset construction, we enriched the dataset variety and improved model performance, offering future iGEMers the insight that model development can be driven by greater data richness. Additionally, we have released the dataset used for our model training at the following link（in iGEM Team Gitlab）.
Qwen3 Dataset

②Under the small-domain setting of iGEM experimental-design and competition, we systematically explored the optimal configuration of large models along the two axes—parameter count and fine-tuning strategy identified “ Qwen3-4B + RAG + SFT ” as the best balance between computational cost and answer quality, providing a reference for follow-up domain-specific model work that must trade off performance and efficiency.

③We also probed the use of Thinking-type models in the iGEM context, showing future teams that refining reasoning models is another promising direction for building deeper, domain-tailored iGEM large models.

Model4: Thermodynamic Modeling of High-Folate Soybean Application Conditions

4.1 Background and Objectives

Soybeans usually often enter the consumer market in the form of soy products. During the processing of soybeans into soy products, there is a step that affects folate content, namely heating .

Currently, the thermal stability of the common folate structure, folic acid, has been extensively studied. However, for the active folate 5-methyltetrahydrofolic acid involved in this project, research is not as in-depth . There are reports indicating that the structure of active folate is generally unstable, but there is no specific research on its thermal stability under common heating conditions (80°C to 100°C).

The goal of this model is to further understand the stability of active folate under heating treatment through molecular dynamics modeling, which will help optimize the processing conditions for high-folate soybeans, thereby facilitating the better promotion of the project and filling the research gap on the thermal stability of 5-methyltetrahydrofolic acid under common heating conditions.

4.2 Model Construction

To explore the behavior of 5-methyltetrahydrofolic acid under common processing temperatures from a structural perspective, we employed Gromacs to build a small-molecule dynamics simulation model. We observed its structural state over a 60ns simulation period at temperatures of 353K and 363K. The main steps of the dynamics simulation are as follows:

Structure preprocessing: Input the ligand file;
Topology generation: The topology of the small molecule is generated using SwissParam;
System assembly: Construct a water box and add ions to neutralize the system;
Energy minimization: Eliminate unreasonable contacts;
Equilibration phase: First NVT (constant temperature and volume), then NPT (constant temperature and pressure);
Production molecular dynamics simulation: Generate trajectories for analysis.

The detailed simulation code is shown below（in iGEM Team Gitlab）:
Small-molecule ligand molecular dynamics simulation code

4.3 Results Analysis

Figure 4.1: RMSD results at 353K and 363K

The Root Mean Square Deviation (RMSD) (nm) indicates the deviation of the folate small molecule (LIG) from its initial conformation. A lower RMSD suggests a more stable structure; larger fluctuations indicate more conformational changes, which may imply higher flexibility or instability.Overall, the red line (353K) is relatively stable with small fluctuations. The green line (363K) shows significantly increased fluctuations, especially with multiple peaks in the middle and later stages.

Under the condition of 353K , after approximately 2000 ps, the RMSD enters a stable plateau, maintaining within the range of ~0.22–0.24 nm. Although there are minor oscillations, there is no obvious drift or continuous upward trend, indicating that the system has reached conformational equilibrium.

This indicates that folate exhibits good conformational stability at 353K, with no significant structural disintegration or conformational rearrangement. Under this temperature, molecular motion is within a controllable range, and hydrogen bond networks and hydrophobic interactions are relatively intact.

Under the condition of 363K , there is already high fluctuation in the initial stage (<10,000 ps). Although it briefly tends to stabilize, there are multiple peaks >0.4 nm between 30,000 and 40,000 ps. Finally, at 55,000 ps, a significant jump occurs, with the RMSD approaching 0.45 nm, much higher than the level at 353K.

This indicates that folate exhibits poor thermal stability at 363K, with frequent large-scale conformational changes. There is non-equilibrium behavior, which may be close to or entering a " thermal perturbation" state, making it unsuitable for precise structural and functional analysis.

Based on the above analysis ,the processing conditions for high-folate soybeans should avoid prolonged heat treatment above 353K (80°C). Instead, it is recommended to use low-temperature processing technologies such as cold pressing for oil (to preserve folate in the meal),low-temperature spray drying (for soy flour and protein powder),high-pressure processing (HPP),or pulsed electric field (PEF) non-thermal sterilization technologies.These methods can help retain the active folate nutrients in high-folate soybeans as much as possible. This provides more theoretical support for the actual commercialization of the project.

Animated Figure 4.2: 3D visualization of molecular dynamics simulation at 353K

You can view image details by: ① right-clicking the image → ② selecting "Open image in new tab".

Animated Figure 4.3: 3D visualization of molecular dynamics simulation at 363K

4.4 Contributions

This model analyzes the thermodynamic properties of 5-methyltetrahydrofolic acid from the perspective of molecular dynamics , concluding that it remains relatively stable around 80°C. This can provide certain technological guidance for future projects that produce high-folate foods using high-folate crops (with 5-methyltetrahydrofolic acid as the key component).