Models Overview
To optimize our multi-step detection platform, we developed ordinary differential equation models for proximity-dependent ligation (PDL), recombinase polymerase amplification (RPA), CRISPR-Cas12a (Cas12a), and CRISPR-interference (CRISPRi) using MATLAB SimBiology. The models employed mass action kinetics for molecular interactions, Michaelis-Menten equations for enzyme catalysis, and Hill equations for cooperative binding. With guidance from Dr. Mark Styczynski at the Georgia Institute of Technology, we refined model parameters through iterative experimental validation.
Each model served distinct purposes: PDL simulations identified optimal aptamer concentrations for sufficient double-stranded (dsDNA) generation (see PDL Modeling);RPA modeling predicted amplification kinetics and guided primer optimization (see RPA Modeling);Cas12a models characterized trans-cleavage dynamics and fluorescence timescales (see Cas12a Modeling); and CRISPRi simulations optimized repression parameters for individual targets Bb0250 and Bb0841, and multiplexed repression scenarios (see CRISPRi Modeling). This modeling framework enabled systematic optimization across all detection stages and provided predictive capabilities for future refinement. Current methods for tracking Lyme disease lack predictive capability for identifying future outbreak hotspots. We developed geospatial models integrating historical case trends, socioeconomic factors, and environmental variables to predict exposure risk and case prevalence at the county and subnational level. Using multiple linear regression (LSR), we generated risk intensity heatmaps for U.S. counties and European regions, visualizing current spatial risk patterns. To forecast future disease burden, we implemented an XGBoost model predicting case counts through 2050 at decadal intervals. Cross-validation and hold-out testing confirmed model reliability. These predictive tools enable healthcare providers to target allocation of LANCET diagnostic and therapeutic resources to regions facing the greatest Lyme disease burden (see HP Modeling).
Judging
- 
What kind of modeling is being done and what information will it provide? 
 Lambert iGEM created multiple sets of mathematical models using Ordinary Differential Equations (ODEs) to predict the kinetic changes of several important wetlab reactions. These models describe how the concentrations of enzymes, substrates, and products change over time by applying principles of Michaelis-Menten equations, mass-action kinetics, and Hill equations to represent different rates of biochemical reactions (see PDL Model Development);(see RPA Model Development);(see Cas12a Model Development);(see CRISPRi Model Development). Lambert iGEM also utilized a least square regression (LSR) to calculate the risk intensity of Lyme disease at the county level and created a deep learning XG Boost model to predict future cases in 2025, 2030, 2040, and 2050 (see HP Model Development).
- 
What kind of data was used to build/assess the model? 
 o build the models, we used data from different wetlab reaction concentrations found in their respective wetlab design manuals and literature. Parameter values used in the models were gathered from relevant research papers or calculated from our own wetlab reaction values. In particular, we compared these parameter values to our experimental data, allowing us to fit the parameters to be more specific to our wet lab conditions (see PDL Model);(see RPA Model);(see Cas12a Model);(see CRISPRi Model Development). For the deep learning model, we utilized open-source government datasets of historical Lyme disease trends, environmental, and socioeconomic data, scaled at the United States county level and European subnational level (see HP Model).
- 
What assumptions were made and why? 
 Our ODE models relied on several key assumptions to ensure computational efficiency and compatibility within MATLAB. First, we assumed rate constants to remain stable throughout our reactions. This assumption is justified because our in vitro experimentation occurs under controlled conditions without significant environmental fluctuations that could alter kinetic parameters. Second, the model assumes precise knowledge of all initial species concentrations, which were determined experimentally or obtained from past literature. Third, we assume that polymerase enzymes, such as RNA and DNA polymerases, maintain consistent synthesis rates regardless of target sequences. This simplification allows computationally feasible modeling while capturing the essential reaction dynamics, though sequence-dependent effects on polymerase processivity are not explicitly represented (see PDL Model Assumptions);(see RPA Model Assumptions);(see Cas12a Model Assumptions);(see CRISPRi Model Assumptions). Our LSR heatmap assumes socio-environmental factors remain stable over time and that linear relationships exist between standardized predictors and per-capita cases, enabling interpretable risk assessment. The XGBoost model assumes 1 to 2 year lag features adequately capture disease dynamics, that optimal hyperparameters and patterns learned from 2010 to 2023 data generalize to future decades, and that the CDC’s underreporting factor remains constant. These assumptions balance model tractability with predictive capability while acknowledging potential deviations in future conditions (see Heatmap Model Assumptions);(see XGBoost Model Assumptions).
- 
How did the model results affect the project design and development? 
 The model identified specific reagent concentration limitations that prevented certain assays from achieving sufficient product yields. We adjusted experimental conditions for PDL based on our model’s predictions, specifically increasing aptamer concentrations which resulted in drastically improved amplification outcomes. Additionally, our models provided predictive baselines for RPA, Cas12a, and CRISPRi experiments, which guided wetlab experimental design. After running our experiments, we used an Analysis of Variance model coupled with a Tukey Post-Hoc test to verify the statistical significance of our results(see PDL Model Results);(see RPA Model Results);(see Cas12a Model Results);(see CRISPRi Model Results). This modeling-experimental feedback loop enabled systematic optimization by aligning theoretical predictions with practical constraints. Our predictive models will directly inform LANCET implementation strategy. By identifying counties at highest current risk and forecasting future hotspots through 2050, we can target deployment to parks and public health agencies in high-burden regions. These risk assessments will guide decisions about where LANCET diagnostics would have maximum impact, enabling proactive resource allocation to areas with growing Lyme disease prevalence before outbreaks escalate (see HP Model).
PDL
Overview
To develop a rapid blood-based diagnostic for Lyme Disease, Lambert iGEM is using a PDL assay that detects bacterial protein biomarkers using high-affinity aptamers. This reaction follows 4 main steps:
- Binding: Two single stranded DNA (ssDNA) aptamers selectively bind to the surface of the CspZ target protein (Guérin et al., 2025)
- Hybridization: A complementary ssDNA bridge simultaneously binds to both aptamers (Fredriksson et al., 2002)
- Ligation: T4 DNA ligase joins together the two aptamer constructs and ssDNA bridge sequence (Fredriksson et al., 2002)
- Extension: T4 DNA polymerase initiates DNA synthesis, producing a double stranded final product (TakaraBio, 2025)
Development
Because the CspZ protein remains even after B. Burgdorferi leaves the bloodstream, we needed to determine how long the protein remains detectable for our assay. To address this, we developed a deterministic kinetic model to track protein concentration over time. We used MATLAB to predict protein concentration over 300 days using mass action kinetics, which describe the rate of molecular interactions in solution. To optimize the reaction process of the PDL pathway, our team utilized various ODEs to simulate the progression of the assay. Our model uses initial target protein and aptamer concentration into account to predict DNA product yield, guiding data analysis and informing wetlab workflow. After using our model to guide experimentation, we refined model parameters based on experimental results, improving its predictive capability for future research.
Signalling Pathway
By utilizing MATLAB’s built-in SimBiology software, we were able to map out the full PDL pathway using the Law of Mass Action and Michaelis-Menten equations (see Fig. 1). Michaelis-Menten equations are used to map out reactions involving enzyme kinetics, and Mass-Action kinetics simulate basic collisions of reactants.

Biochemical Reactions
The PDL model contains 6 core biochemical reactions as shown below (see Table 1):
| # | Reaction | Description | 
|---|---|---|
| 1 | P + A1 ↔ [P-A1] | Aptamer 1 binds to the CspZ protein, creating an Aptamer 1–Protein complex. | 
| 2 | [P-A1] + A2 ↔ [P-A] | Aptamer 2 binds to the Aptamer1–Protein complex, creating a Protein–Aptamer complex. | 
| 3 | [P-A] + B ↔ [P-A-B] | The Bridge binds to the Protein–Aptamer complex, creating a Protein–Aptamer–Bridge complex. | 
| 4 | L + [P-A-B] → X + [Free-P] + [Free-L] | Ligase connects the Aptamers and Bridge sequences together on the Protein–Aptamer–Bridge complex, creating the PDL Product. | 
| 5 | P → null | The CspZ protein degrades over time. | 
| 6 | X rharr; null | The PDL Product degrades over time. | 
Initial Values of Species
To simulate our reactions, we initialized each component of our model by using initial values for each species that were derived from the wetlab protocols for PDL (see Table 2; see Experiments).
| # | Name | Value | Units | 
|---|---|---|---|
| 1 | P | 2.1000e6 | molarity | 
| 2 | A1 | 2.0000e-8 | molarity | 
| 3 | P-A1 | 0 | molarity | 
| 4 | A2 | 2.0000e-8 | molarity | 
| 5 | P-A | 0 | molarity | 
| 6 | L | 8.3000e-15 | molarity | 
| 7 | Free-P | 0 | molarity | 
| 8 | Free-L | 0 | molarity | 
| 9 | B | 4.0000e-7 | molarity | 
| 10 | P-A-B | 0 | molarity | 
| 11 | X | 0 | molarity | 
Parameters
Rate constants and degradation rates were inputted into our Mass Action and Michaelis-Menten equations. These parameters were either sourced from existing literature or estimated through experimentation when values could not be found. The value for each parameter is listed below (see Table 3):
| # | Name | Reaction | Value | Units | Resource | 
|---|---|---|---|---|---|
| 1 | kf_PA1 | Forward rate constant for Aptamer 1 binding to Protein | 1.0000e6 | liter/mole/second | estimated | 
| 2 | kf_PA | Forward rate constant for Aptamer 2 binding to the [P-A1] complex | 1.0000e6 | liter/mole/second | estimated | 
| 3 | k_P | First-order degradation rate of the CspZ protein | 1.2000e-7 | 1/second | estimated | 
| 4 | kf_X | First-order degradation rate of PDL product | 1.0000e-5 | 1/second | estimated | 
| 5 | Vm_PAML | Maximum velocity of Ligase-catalyzed product formation | 3.3000e-15 | molarity/second | (Alkhamis et al. 3) | 
| 6 | Km_PABL | Michaelis-Menten constant for Ligase-catalyzed reaction | 2.5000e-9 | molarity | estimated | 
| 7 | kf_PAB | Forward rate constant for Bridge binding to Protein-Aptamer complex | 1.0000e6 | liter/mole/second | estimated | 
| 8 | kr_PA1 | Reverse rate constant for dissociation of Aptamer 1-Protein complex | 0.2082 | 1/second | (Alkhamis et al. 3) | 
| 9 | kr_PA | Reverse rate constant for dissociation of Aptamer 2 from Protein-Aptamer complex | 0.2082 | 1/second | (Alkhamis et al. 3) | 
Ordinary Differential Equations
We incorporated the Mass Action Kinetic Laws and Michaelis-Menten Equations to represent the reactions for the PDL model. We generated 11 ordinary differential equations for the full ligation pathway, as shown below (see Table 4):
| # | Ordinary Differential Equations | 
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | 
Assumptions
Our PDL ODE model assumes that all rate constants and parameters remain constant, disregarding external environmental factors. Thus, the model replicates only uses the aptamer-protein binding process to simulate dsDNA product formation. The model also assumes that the total initial quantities of CspZ protein, aptamers, ligase, and bridge are known.
Results
CspZ Protein Degradation
Using MATLAB, we modeled CspZ protein concentration over time. We estimated a half-life of 14 days based on published values for bacterial outer-surface proteins and simulated the corresponding degradation curve (See Fig. 2).

Our model predicted protein concentration in nanomolar (nM) units over time. To compare with our PDL assay threshold of 24,000 molecules/microliter, we converted the curve to equivalent molecular units (see Fig. 3). Based on our model predictions and reported PDL thresholds in literature, we anticipated assay viability up to 86 days post-infection (pi). Experimental validation, however, revealed sustained detectability to at least 250 days pi (see PDL Results)

PDL
After developing a complete set of ODEs, we utilized Simbiology’s Model Analyzer to simulate the quantity of DNA product over 75 minutes with 20 picomolar of aptamer initially (see Fig. 4). The PDL product increases linearly over time, showing an end result of ~0.6 copies.

Because our model predicted fewer than one copy of dsDNA product, indicating limited assay sensitivity, we performed a sensitivity analysis to identify critical concentrations (Fig. 5). This analysis revealed that dsDNA yield was most sensitive to aptamer concentration.

Based on this insight, we tested increasing aptamer concentrations in our model and found that raising the concentration from 20 pM to 20 nM would yield 1.01 × 10^-14 M dsDNA (~800 copies), sufficient for downstream amplification (See Fig. 6).

Experimental results confirmed that higher aptamer concentrations successfully produced detectable PCR products (See Fig. 7), demonstrating that our initial conditions had generated insufficient PDL product for downstream amplification, consistent with ODE model predictions.

RPA
Overview
RPA is used in Lambert iGEM’s diagnostic system to rapidly amplify the DNA produced by PDL, which helps with detection of Lyme-associated biomarkers. We modeled the RPA reaction to evaluate how amplified DNA yield can change over time according to varying template concentration and limited biochemical resources. This takes place with four main steps:
- Binding: Recombinase proteins bind to forward and backward primers, creating a recombinase-primer complex.
- Strand Displacement: The recombinase–primer complex locates and invades the double stranded DNA, the primers displace their respective strands and pair with their complementary targets. SSBPs (single-stranded binding proteins) stabilize the displaced strands.
- Extension: DNA polymerase extends the bound primer, synthesizing new DNA without the need for thermal cycling.
- Amplification: The newly formed DNA strands act as templates for additional recombinase–primer complexes. This process repeats, producing exponential amplification of the target sequence in minutes.
Development
To understand the reaction process of the RPA pathway, our team utilized a variety of ODEs to simulate the progression of the system. Our model takes the initial PDL dsDNA product concentration, primer concentrations, and recombinase protein concentration to predict the amount of amplified DNA over the entire duration of the RPA reaction (See Fig. 8). Through iterative cycles of model-guided experimentation and parameter refinement based on experimental data, we improved the model’s predictive accuracy for future applications.
Signalling Pathway
By utilizing MATLAB’s built-in SimBiology software, we were able to map out the full RPA pathway using the Law of Mass Action and Michaelis-Menten equations. Michaelis-Menten equations are used to map out reactions involving enzyme kinetics, while Mass Action Kinetics are used to diagram the collision of reactants.

Biochemical Reactions
The model contains a total of six biochemical reactions for the RPA pathway (see Table 5):
| # | Reaction | Description | 
|---|---|---|
| 1 | Rec + Prime → RecP | Recombinase protein and Primer form Recombinase Primer Complex. | 
| 2 | RecP + PDL_product → DNA_open | Recombinase Primer Complex binds to the PDL Product, creating an open strand of DNA. | 
| 3 | DNA_open + SSB → DNA_stab | SSBs bind to the open strand of DNA and stabilize the DNA. | 
| 4 | DNA_stab + DNA_poly → DNA_amp + PDL_product + PDL_product_copy | DNA Polymerase binds to the stabilized strand of DNA, generating amplified DNA from the PDL Product. | 
| 5 | PDL_product_copy → PDL_product | The PDL Product becomes the new template, and gets amplified further. | 
| 6 | DNA_amp → null | Amplified DNA degrades over time. | 
Initial Values of Species
In order to simulate our reactions, our team needed to input quantities to initialize each component in our model (See Table 6). The initial value of each species was obtained from RPA wetlab protocols (see Experiments; (see PDL).
| # | Name | Value | Units | 
|---|---|---|---|
| 1 | rpa_system | 1 | liter | 
| 2 | Rec | 1.0000e-4 | molarity | 
| 3 | Prime | 1.0000e-4 | molarity | 
| 4 | RecP | 1.0000e-4 | molarity | 
| 5 | PDL_product | 3.5200e-16 | molarity | 
| 6 | DNA_open | 0 | molarity | 
| 7 | SSB | 3.0000e-6 | molarity | 
| 8 | DNA_stab | 0 | molarity | 
| 9 | DNA_poly | 1.5000e-6 | molarity | 
| 10 | DNA_amp | 0 | molarity | 
| 11 | PDL_product_copy | 0 | molarity | 
Parameters
Rate constants and degradation rates are required as input in our Mass Action Laws and Michaelis-Menten Equations. These values were derived from literature or estimated if the values could not be found. The value of each parameter is as shown (see Table 7):
| Name | Reaction | Value | Units | Resource | 
|---|---|---|---|---|
| k_RecP | Rate at which recombinase proteins and primers form a recombinase primer complex. | 1.000000e6 | liter/mole/second | (K.M. Juma, T. Takita, K. Ito et al., 197) | 
| k_prime | Rate at which recombinase primer complex binds to the PDL product, creating an open strand of DNA. | 1.000000e7 | liter/mole/second | (K.M. Juma, T. Takita, K. Ito et al., 197) | 
| k_ssb | Rate at which SSB proteins bind to the open strand of DNA and stabilize the DNA. | 4.000000e7 | liter/mole/second | (K.M. Juma, T. Takita, K. Ito et al., 197) | 
| Vm_poly | Maximal velocity of DNA Polymerase binding to the stabilized strand of DNA, generating amplified DNA from the PDL Product. | 6.670e-7 | molarity/second | (K.M. Juma, T. Takita, K. Ito et al., 198) | 
| Km_poly | Michaelis-Menten constant for DNA polymerase binding | 1.4500e-6 | molarity | (K.M. Juma, T. Takita, K. Ito et al., 198) | 
| k_dna_deg | Rate at which DNA degrades | 0.00001 | 1/second | (Bhoyar et al., 2024) | 
ODEs
We used the Mass Action Kinetic Laws and Michaelis-Menten Equations to represent different reactions for the RPA model. We generated five ordinary differential equations for the full RPA system (see Table 8):
| # | Ordinary Differential Equation | 
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | 
Assumptions
Our RPA ODE model assumes rate constants as continuous, as there are no environmental factors that can influence the reaction. Therefore, the model is able to isolate the RPA process to simulate the amplification rate over time. The model also keeps the initial concentrations of PDL product, primers, recombinase proteins, and polymerase constant over time.
Results
RPA
We used MATLAB’s SimBiology software to construct an ODE-based model of the RPA pathway, predicting the DNA amplification dynamics from the PDL dsDNA product. Our model ran with an initial concentration of 1.01E-14 molarity (M), over 25 minutes. Our graph shows an exponential increase of PDL product over time (See Fig. 9).

After creating our initial model, we consulted Dr. Styzynski from Georgia Institute of Technology, and after examining our model, he noted that our output was off by a factor of two. After accounting for this, we were able to achieve a final concentration of 3.18E-12M (see Fig. 10), which correlates to 1.92E+12 copies.

Densities for our experimental RPA reactions were quantified using the ImageJ software, normalized to the negative controls and standardized against the DNA ladder. By using bands on the ladder of known nanogram weight, the standardized integrated densities of samples can be converted to copies of DNA produced (see Fig. 11).

Experimental validation using the RPA target construct and 2-day post-infection PDL samples showed that DNA yields remained within the model’s predicted range, though consistently toward the lower boundary (see Fig. 11). The model’s tendency to predict higher yields is expected given that it represents idealized reaction conditions, whereas experimental systems are subject to inefficiencies and variability not captured in the simplified kinetic framework.
Statistical Analysis
ANOVA
In addition to our ODE model, we conducted a significance test using RStudio – a statistical analysis platform built on R – to validate the statistical significance of our results with RPA. Specifically, we performed an Analysis of Variance (ANOVA) model to compare the mean signal intensity of standardized integrated densities across PDL protein concentrations corresponding to 2, 25, 50, 75, 100, and 150 days pi after amplification with PCR and RPA.
The critical statistical values for our ANOVA analysis include the following (Bevans, 2020): F value: the test statistic from an F test; the greater the F value, the greater the likelihood that the variation between groups is caused by the independent variable(s) and not from random chance Pr(>F): the p value of the F statistic; a measure of the significance of the F value and how likely it is that the resultant F value could occur due to random chance under the null hypothesis Residual variance: the sum of the squares of the residuals of the data set; a smaller residual variance indicates that the independent variables explain more of the variation
To begin, we ran two separate one-way ANOVAs to model signal intensity as a function of the method of amplification and as a function of the time passed pi. Based on the results above, the p-values for both one-way ANOVAs are < 0.01, showing a likelihood that method of amplification and time elapsed have a real impact on signal intensity (see Tables 9- 10). If the independent variables mapped were not statistically significant, there would be a 5.29e-7% and 0.055% chance, respectively, for results from differing methods and times to happen by random chance.
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) | 
|---|---|---|---|---|---|
| Method | 1 | 7.890 | 7.890 | 37.99 | 5.29e-7 | 
| Residuals | 34 | 7.061 | 0.208 | 
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) | 
|---|---|---|---|---|---|
| Time | 5 | 4.355 | 0.871 | 2.466 | 0.055 | 
| Residuals | 30 | 10.596 | 0.3532 | 
We then ran a two-way ANOVA, which would model signal intensity as a function of both amplification method and time elapsed. Combining the two conditions improved our model’s predictions, as the residual variance not explained by our independent variables decreased from 7.061 and 10.596 to 2.706 (see Table 11). Both the amplification method and time passed were once again statistically significant (p-values < 0.001), showing that both variables likely impact the amount of DNA product generated.
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) | 
|---|---|---|---|---|---|
| Method | 1 | 7.890 | 7.890 | 84.552 | 4.27e-10 | 
| Time | 5 | 4.355 | 0.871 | 9.335 | 2.20e-5 | 
| Residuals | 29 | 2.706 | 0.093 | ||
| Signif. codes | 0 ‘’ 0.001 ‘**’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1** | 
Finally, we ran an ANOVA analysis that tested whether the independent variables have an interactive effect, where time elapsed may also affect the activity of the amplification methods. The interaction ANOVA showed significant effects of Method (F(1,24) = 5456.1, p < 0.001), Time (F(5,24) = 602.4, p < 0.001), and their interaction (F(5,24) = 369.5, p < 0.001). Because the ‘Method:Time’ variable has a relatively high sum-of-squares value, low p-value, and small residual variance, much of the variation can be explained by the interaction between amplification method and time elapsed (see Table 12).
If the differences were not statistically significant, the data would follow the null hypothesis and there would only be a 2e-16% for the results to occur under random change (see Table 8). Therefore, the interaction between our amplification methods and time elapsed past infection experimentally show a significant difference which is not due to random chance.
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) | 
|---|---|---|---|---|---|
| Method | 1 | 7.890 | 7.890 | 5456.1 | <2e-16 | 
| Time | 5 | 4.355 | 0.871 | 602.4 | <2e-16 | 
| Method:Time | 5 | 2.671 | 0.534 | 369.5 | <2e-16 | 
| Residuals | 24 | 0.035 | 0.001 | ||
| Signif. codes | 0 ‘’ 0.001 ‘**’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1** | 
To decide which of our four different ANOVA models is the ‘best-fit’ to explain the data, we utilized the Akaike information criteria (AIC) test. The calculated AIC values show that the information value of the interaction ANOVA is the lowest, and has 100% of the AIC weight, meaning it is best fit to explain the variation in the data (see Table 13).
| Model | K | AICc | Delta_AICc | AICcWt | Cum.Wt | LL | 
|---|---|---|---|---|---|---|
| interaction | 13 | -105.29 | 0.00 | 1 | 1 | 73.92 | 
| two.way | 8 | 30.33 | 135.62 | 0 | 1 | -4.50 | 
| one.way.method. | 3 | 50.27 | 155.56 | 0 | 1 | -21.76 | 
| one.way.time | 7 | 76.13 | 181.42 | 0 | 1 | -29.07 | 
Homoscedasticity
To check if our analysis was homoscedastic, an assumption that residual variance is constant across independent variables, we also plotted various diagnostic plots of our model (See Fig. 5).

Our diagnostic plots display the residual variance across the range of the observed data in multiple ways. The Residuals vs. Fitted graph has a roughly constant amount of variance and is mostly centered at zero, meaning there are no major outliers causing variance.
The Q-Q Residuals graph plots a regression between the theoretical residuals of a perfectly homoscedastic model and our actual residuals. Our Q-Q plot is very close to a slope of 1 with slight deviation at the extremes, meaning there is generally a low level of variability.
For the Scale-Location graph and Leverage plot, the mean of the residuals (the red line) is generally constant, meaning the variance of our data set is stable and there are no extreme outliers. From these residual plots we find that our ANOVA model fits the assumption of homoscedasticity.
Tukey HSD Post-Hoc Test
ANOVA analyses are prone to confounding experimental error rates, especially as the number of groups compared increases. This error rate can be up to 97% for a comparison between 12 groups (Bevans, 2020). To control this variability, we ran Tukey’s Honest Significant Difference (HSD) post-hoc test to adjust the p-values by comparing the pairwise difference between each specific mean at a 95% confidence interval.
We then plotted the confidence intervals, which demonstrates no significant internal differences between the PCR groups, as intervals largely overlap zero across all time points (see Fig. 13).

The faceted plot revealed that all PCR groups (a) remain statistically indistinguishable and cluster at low intensities (~0.2–0.3), indicating no significant time effect for PCR (see Fig. 14). RPA groups, however, are separated into distinct subsets with RPA:2 being the highest (b), RPA:25 & RPA:50 intermediate (c), RPA:75 & RPA:100 lower (d), and RPA:150 as the lowest (e) (see Fig. 14).
Figure 14

In addition to the Tukey post-hoc test, we graphed the standardized integrated densities of post-PDL PCR and post-PDL RPA at 2 days pi. The plot clearly showed that the signal intensity for RPA was greater by a factor of ten compared to PCR (see Fig. 15).

Together, the adjusted p-values from the post-hoc Tukey HSD test and graph of post-PDL PCR compared to post-PDL RPA validates our choice of RPA within LANCET’s diagnostic workflow. Our statistical analysis demonstrates that RPA groups are always significantly greater than PCR groups, even at the latest time points. These findings support RPA’s greater sensitivity and stronger amplification efficiency as compared to PCR, experimentally confirming that LANCET is better suited for point-of-care testing as compared to traditional methods of DNA amplification.
Cas12a
Overview
In order to quantify the amount of target DNA following our previous systems, Lambert iGEM implemented the CRISPR-Cas12a system, which uses a guide RNA to bind and cleave a targeted dsDNA sequence (see Fig. 16). After the initial cut, the Cas12a enzyme remains activated and begins indiscriminately cleaving nearby ssDNA. To detect this activity, we introduced reporter molecules: short strands of ssDNA tagged with a fluorophore and a quencher. When cleaved by the activated Cas12a enzyme, the fluorophore separates from the quencher, producing a detectable fluorescent signal that is proportional to the amount of target DNA.

Development
We created a deterministic ODE model to simulate the CRISPR-Cas12a system, with guidance from Dr. Mark Styczynski at the Georgia Institute of Technology on parameter selection and model structure. This model was designed to serve as a baseline for our wetlab results, predicting the fluorescence output based on target DNA concentration. By modeling the activation of the Cas12a enzyme and its trans-cleavage of fluorophore-quencher probes, we created a reference curve that helps guide experimentation. After CRISPR-Cas12a experimentation finished, their results refined our parameters, creating a collaborative data streamline.
Signaling Pathway
We used MATLAB’s SimBiology software to map out the Cas-12a workflow using the Law of Mass Action and Michaelis-Menten equations (see Fig. 17). Michaelis-Menten equations are used to map out reactions involving enzyme kinetics, whereas Mass Action equations are used to diagram the collision of reactants in a solution. The pathway begins with the introduction of the Cas-12a enzyme and crRNA, and ends with the fluorescence output of the FAM-6 fluorophore.

Biochemical Reactions
The model contains a total of three biochemical reactions for the Cas12a pathway (see Table 14) :
| Reaction | Description | |
|---|---|---|
| 1 | cas12a + crRNA ↔ cas_crRNA | The Cas-12 Enzyme, LbCpf1, binds to a crRNA, forming a Cas12a–crRNA complex. | 
| 2 | cas_crRNA + target → byproduct2 + byproduct1 + act_cas12 | The Cas12a-crRNA complex cleaves the target DNA. | 
| 3 | act_cas12 + reporter → fluorophore + quencher | The activated enzyme cleaves the reporter probe, splitting apart the FAM-6 fluorophore and quencher. | 
Initial Values of Species
In order to simulate our reactions, our team needed to input quantities to initialize each component in our model. The initial value of each species was obtained from Cas12a wetlab protocols (see Table 15; see Experiments):
| Name | Value | Units | 
|---|---|---|
| cas12system | 1 | L | 
| cas12a | 2.5e-7 | M | 
| crRNA | 2.5e-7 | M | 
| cas_crRNA | 0 | M | 
| target | 2.5e-7 | M | 
| byproduct1 | 0 | M | 
| byproduct2 | 0 | M | 
| act_cas12a | 0 | M | 
| reporter | 2.5e-7 | M | 
| fluorophore | 0 | M | 
| quencher | 0 | M | 
Parameters
Rate constants and degradation rates are required as input in our Mass Action Laws and Michaelis-Menten Equations. These values were derived from literature or estimated if the values could not be found. The value of each parameter is as shown (see Table 16):
| Name | Reaction | Value | Units | Reference | 
|---|---|---|---|---|
| k_complex | Rate at which the Cas-12 Enzyme binds to a crRNA. | 76000000 | liter/mole/second | (Nalefski et al., 2021) | 
| kr_complex | Rate at which the Cas-12 Enzyme separates from the crRNA. | 0.00024 | 1/second | (Nalefski et al., 2021) | 
| k_cleave | Rate at which Cas12a-crRNA complex cleaves the Target DNA. | 81000000 | liter/mole/second | (Strohkendl et al., 2018) | 
| Vm_act | Maximal velocity at which the activated enzyme cleaves the reporter probe. | 8.45e-8 | molarity/second | (Nalefski et al., 2021) | 
| Km_act | Michaelis-Menten constant for Cas12a trans-cleavage. | 0.0000337 | molarity | (Nalefski et al., 2021) | 
Ordinary Differentials Equations
We used the Mass Action Kinetic Laws and Michaelis-Menten Equations to represent different reactions for the Cas12a model. We generated nine ordinary differential equations for the full Cas12a system (see Table 17).
| Number | ODE Equation | 
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | 
Assumptions
The Cas12a ODE model assumes rate constants as continuous, due to a lack of external environmental factors. Therefore, the model can predict fluorophore release from the reporter probe to capture the resulting fluorescence. The model also assumes that the total quantities of the LbCPf1 Cas-12 enzyme, crRNA, Target DNA, and Reporter Probe are known initially and remain constant over time. Additionally, the model assumes that Cas12a cleavage of target DNA and reporter probes occurs at a constant rate regardless of target DNA or reporter probes.
Results
Cas12a
We developed a complete set of ODEs describing the Cas12a collateral cleavage mechanism and implemented them in MATLAB SimBiology. Using SimBiology’s Model Analyzer, we simulated 6FAM fluorophore concentration over 30 minutes with an initial dsDNA target concentration of M (see Fig. 18). The model captures the sequential reaction mechanism: Cas12a binds and cleaves the target dsDNA, becoming activated and subsequently executing collateral cleavage of the ssDNA reporter probe to separate the fluorophore from its quencher, generating a fluorescent signal.

Experimental vs Theoretical Cas12a
We analyzed experimental fluorescence data from Cas12a trans-cleavage reactions. Fluorescence was detected by the plate reader as reporter probes were cleaved. To compare with model predictions, we converted RFU measurements to 6FAM fluorophore concentrations using our calibration equation (see Fig. 19). The resulting experimental concentration profiles closely matched simulated data (see Table 18), validating our ODE model’s ability to predict Cas12a detection kinetics.



CRISPRi
Overview
In 2024, Lambert iGEM developed a system of deterministic ODEs to model our CRISPRi reaction. CRISPRi can be summarized by the steps below: Protein synthesis: Proteins required for the CRISPRi system are synthesized and translated using the Sigma70 cell-free TXTL kit, which provides all components for in vitro protein expression. Binding: The scaffold region of the sgRNA associates with dCas9 to form a stable dCas9–sgRNA complex that guides the system to its target. Knockdown: The sgRNA–dCas9 complex binds to the complementary DNA sequence of the target gene, reducing fluorescence by inhibiting Green Fluorescent Protein (GFP) expression.
Development
To assist in interpreting and validating CRISPRi reactions, we incorporated an ODE model to simulate the reaction and correlate fluorescence expressed by a target DNA construct to time elapsed. We consulted with Dr. Mark Styczynski from the Georgia Institute of Technology to identify certain parameters and rate constants. Comparison between model predictions and experimental fluorescence data enabled parameter refinement, yielding an improved ODE model with updated kinetic parameters. Using this refined model, we generated sgRNA concentration curves to optimize repression efficiency and extended the framework to incorporate our multiplexing approach, enabling prediction of multi-target CRISPRi performance.
Signaling Pathway
We used MATLAB’s SimBiology software to diagram each reaction using the Law of Mass Action, Michaelis Menten Enzyme Kinetics, and Hill Equations. The Mass Action equations are used to diagram the collision of reactants in a solution, which is needed for all the reactants in our reaction pathways. Michaelis Menten and Hill equations are to properly diagram catalytic enzyme kinetics when utilizing multiple ligand areas. The pathway begins with the introduction of a DNA construct into the system and ends with the production of GFP (see Fig. 20).

Biochemical Reactions
The CRISPRi model contains a total of 16 biochemical reactions for the overall pathway, as shown below (see Table 19):
| # | Reaction | Description | 
|---|---|---|
| 1 | sgDNA_0250 + RNAP → sgRNA_0250 | RNA Polymerase transcribes the guide DNA construct for the 0250 gene into the sgRNA construct. | 
| 2 | RNAP + sgDNA_0841 → sgRNA_0841 | RNA Polymerase transcribes the guide DNA construct for the 0841 gene into the sgRNA construct. | 
| 3 | dCas9_P + TXTL → dCas9 | dCas9 Protein synthesized in the cell-free system. | 
| 4 | sgRNA_0841 + dCas9 → d_sg_0841 | 0841 Guide RNA binds with dCas9 to form sgRNA_dCas9 complex. | 
| 5 | sgRNA_0250 + dCas9 → d_sg_0250 | 0250 Guide RNA binds with dCas9 to form sgRNA_dCas9 complex. | 
| 6 | d_sg_0250 + TC_0250 ↔ TC_d_sg_0250 | 0250 sgRNA_dCas9 complex binds to 0250 Target Construct, inhibiting its GFP expression. | 
| 7 | TC_0841 + d_sg_0841 ↔ TC_d_sg_0841 | 0841 sgRNA_dCas9 complex binds to 0250 Target Construct, inhibiting its GFP expression. | 
| 8 | TC_0250 → GFP | Before inhibition, the 0250 target construct is producing GFP. | 
| 9 | TC_0841 → GFP | Before inhibition, the 0841 target construct is producing GFP. | 
| 10 | sgDNA_0250 → null | DNA degrades over time. | 
| 11 | sgDNA_0841 → null | DNA degrades over time. | 
| 12 | RNAP → null | RNA Polymerase degrades over time. | 
| 13 | dCas9 → null | Deactivated Cas9 degrades over time. | 
| 14 | TC_0250 → null | Gene 0250 Target Construct degrades over time. | 
| 15 | TC_0841 → null | Gene 0841 Target Construct degrades over time. | 
| 16 | GFP → null | GFP Protein degrades over time. | 
Initial Values of Species
In order to simulate our reactions, our team needed to input quantities to initialize each component in our model (see Table 20). The initial value of each species was obtained from CRISPRi wetlab protocols (see Experiments):
| Variable | Species | Initial Value | Units | 
|---|---|---|---|
| sgDNA_0250 | 0250 sgRNA DNA construct | 5.08643e-7 | Molarity | 
| sgDNA_0841 | 0841 sgRNA DNA construct | 5.93846e-7 | Molarity | 
| RNAP | RNA Polymerase | 4e-7 | Molarity | 
| sgRNA_0250 | 0250 sgRNA | 0 | Molarity | 
| sgRNA_0841 | 0841 sgRNA | 0 | Molarity | 
| dCas9_P | dCas9 plasmid | 4e-8 | Molarity | 
| TXTL | TXTL cell-free system | 3e-8 | Molarity | 
| dCas9 | dCas9 protein | 0 | Molarity | 
| d_sg_0250 | dCas9 and sgRNA 0250 complex | 0 | Molarity | 
| d_sg_0841 | dCas9 and sgRNA 0841 complex | 0 | Molarity | 
| TC_0250 | 0250 gene target construct | 8.14000e-8 | Molarity | 
| TC_0841 | 0841 gene target construct | 9.5000e-8 | Molarity | 
| GFP | GFP Protein | 0 | Molarity | 
| TC_d_sg_0250 | 0250 Target Construct, with dCas9 and sgRNA 0250 complex | 0 | Molarity | 
| TC_d_sg_0841 | 0841 Target Construct, with dCas9 and sgRNA 0841 complex | 0 | Molarity | 
Parameters
Rate constants and degradation rates were inputted into our Mass Action equations. These parameters were either sourced from existing literature or estimated when values could not be found. The value for each parameter is listed below (see Table 21):
| Variable | Reaction | Values | Units | Citation | 
|---|---|---|---|---|
| kf_plas | Rate of protein synthesis from dCas9 plasmid to dCas9 protein | 2.6000e-8 | liter/mole/second | (Marshall et al., 2017) | 
| kf_complex | Rate at which guide RNAs bind with dCas9 to form sgRNA_dCas9 complexes | 0.9800 | liter/mole/second | (Marshall et al., 2017) | 
| kf_0250 | Rate at which the 0250 sgRNA_dCas9 complex binds to the 0250 Target Construct, inhibiting its GFP expression | 1.0000e-5 | liter/mole/second | (Marshall, Beisel, & Noireaux, 2020) | 
| kf_0841 | Rate at which the 0841 sgRNA_dCas9 complex binds to the 0841 Target Construct, inhibiting its GFP expression | 1.0000e-9 | liter/mole/second | (Marshall, Beisel, & Noireaux, 2020) | 
| Vm_RNAP | Maximal velocity at which RNA Polymerase transcribes the guide DNA constructs | 6.5000e-4 | molarity/second | (Marshall et al., 2017) | 
| Km_RNAP | Michaelis-Menten constant for RNA Polymerase transcription | 0.0100 | molarity | (Marshall et al., 2017) | 
| Vm_0250 | Maximal velocity at which the 0250 target construct produces GFP | 2.0000e-4 | 1/second | experimentation | 
| Kp_0250 | Michaelis-Menten constant for 0250 target construct production of GFP | 1.0000e-5 | molarity | experimentation | 
| Vm_0841 | Maximal velocity at which the 0841 target construct produces GFP | 9.0000e-5 | 1/second | experimentation | 
| Kp_0841 | Michaelis-Menten constant for 0841 target construct production of GFP | 1.0000e-5 | molarity | experimentation | 
| kr_0250 | Rate at which the 0250 sgRNA_dCas9 complex releases from the 0250 Target Construct | 1.0000e-7 | 1/second | (Marshall, Beisel, & Noireaux, 2020) | 
| kr_0841 | Rate at which the 0841 sgRNA_dCas9 complex releases from the 0841 Target Construct | 1.0000e-7 | 1/second | (Marshall, Beisel, & Noireaux, 2020) | 
| k_dna_deg | Rate at which DNA degrades | 1.0000e-6 | 1/second | (Marshall et al., 2017) | 
| k_protein_deg | Rate at which protein degrades | 3.9000e-7 | 1/second | (Marshall et al., 2017) | 
Ordinary Differential Equations
We used the Mass Action Kinetic Laws, Michaelis-Menten Equations, and Hill equations to represent different reactions for the CRISPRi model. We created 12 ordinary differential equations for the full CRISPRi system (see Table 22):
| Number | ODE Equation | 
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | 
Assumptions
The CRISPRi ODE model assumes rate constants to be continuous, as there are no environmental factors that can influence the reaction in the in vitro cell free mix. Therefore, the model is able to isolate the CRISPRi process to simulate repression activity. The model also assumes the total initial concentrations of dCas9 plasmid, target constructs, sgRNA construct, sigma70 ribosome, Chi6, and RecBCD protein. Additionally, the sigma70 RNA polymerase is expected to synthesize DNA at the same rate regardless of whether the construct being transcribed is a plasmid or linear sequence.
Results
CRISPRi
After creating a complete set of ODEs, we utilized Simbiology’s Model Analyzer to simulate the concentration of GFP over 16 hours with an initial target construct concentration of 81.4 nM for gene 0250 and 95 nM for gene 0841 (See Figs. 21-22). As time passes, the target construct produces GFP, and as the dCas9 enzyme goes downstream, it represses the construct.


Experimental vs Theoretical CRISPRi
We analyzed fluorescence data from our CRISPRi experiments targeting Bb0250 and Bb0841. During the first 5 hours, RFU values remained elevated, indicating active GFP production. Subsequently, a noticeable plateau in RFU becomes evident, correlating with the inhibition of GFP transcription. To evaluate the effectiveness of our wetlab experimentation, we utilized a line-of-best-fit between GFP expression and fluorescence to correlate the reactions, represented by the equation: for our plate reader (See Fig. 23). By converting our relative fluorescence to GFP concentration, the similarity presented by the logarithmic shape of our model and our experimental results proves the accuracy of our wetlab (see Tables 23-24).





Multiplexed CRISPRi
After demonstrating successful repression of individual target genes, we sought to achieve more comprehensive bacterial inhibition through multiplexing—simultaneous repression of both Bb0250 and Bb0841. We constructed a multiplexed CRISPRi model in MATLAB SimBiology incorporating negative regulation pathways for both targets. Using initial target construct concentrations of 81.4 nM for Bb0250 and 95 nM for Bb0841, we simulated GFP expression over 16 hours. The model predicted a plateau in GFP concentration following dCas9-sgRNA complex formation (see Fig. 24), indicating effective dual-target repression.

Experimental vs Theoretical Multiplexed CRISPRi
We analyzed fluorescence data from our multiplexed CRISPRi experiments where Bb0250 and Bb0841 were simultaneously targeted. RFU values remained elevated during the first 5 hours, indicating active GFP production from both target constructs, before plateauing as CRISPRi-mediated repression of both genes took effect. To enable quantitative comparison with our multiplexed ODE model, we converted RFU measurements to GFP concentrations using the relationship: . Both experimental and simulated data for the multiplexed system exhibited similar characteristic logarithmic profiles (see Table 25), demonstrating that our model accurately predicts dual-target CRISPRi kinetics.


HP
Overview
As the incidence of Lyme disease continues to escalate, the absence of predictive tools limits proactive public health responses. To address this gap in disease forecasting, we developed projection models based on machine learning to identify regions of high-risk and inform strategic resource allocation. By integrating various factors our models assess future case prevalence at the county level in the U.S. and subnational level in Europe.
Development
Our model utilizes LSR to estimate county-level Lyme disease risk by integrating socioeconomic, demographic, and environmental features. To forecast future disease burden, we implemented an XGBoost model, which is an ensemble gradient boosting algorithm based on decision trees, to project case counts for 2025, 2030, 2040, and 2050. The model incorporated temporal lag features, which leverage historical case data to capture disease trends, and systematically tuned hyperparameters, adjustable settings that control learning behavior, to maximize performance.
The accuracy of our model was evaluated by Root Mean Square Error (RMSE) to quantify average error of predictions andcross-validating performance across multiple data subsets to ensure reliability and prevent overfitting. These models identify counties at higher risk, providing a data-driven strategy for deploying Lyme disease diagnostics such as LANCET in vulnerable communities (see Deployment). To demonstrate global applicability, we generated an additional heatmap for European subnational regions.
Dataset Development
To develop county-level predictive models, we integrated official public datasets capturing key transmission risk factors from 2010 to 2023. Sources included CDC case counts, U.S. Census demographics and geography, Bureau of Economic Analysis income statistics, U.S. Forest Service land cover estimates, and NOAA climate data. Datasets were merged by county (see Table 26):
| Dataset | Year | Description | 
|---|---|---|
| Lyme Disease Count | 2010–2023 | Reported cases per county per year (CDC) | 
| Population | 2010–2023 | Annual county population (Census) | 
| Personal Income | 2023 | Per capita personal income, USD (BEA) | 
| Vulnerable Risk Age (Under 5 & Over 65) | 2023 | Population risk ages (United States Census) | 
| Forest Cover | 2019 | Acres of forested land (USFS) | 
| Average Temperature (May–July) | 2023 | °F seasonal average (NOAA) | 
| Average Precipitation (May–July) | 2023 | Inches seasonal average (NOAA) | 
| County Area | 2020 | Square miles (Gazetteer) | 
| Latitude / Longitude | 2020 | Coordinates (Gazetteer) | 
Heatmap
To quantify county-level risk intensity, we developed an LSR model combining observed cases with socio-environmental factors. LSR provides interpretable coefficient estimates for each feature while maintaining predictive accuracy, deriving year-specific intensity scores scaled as representative risk percentages.
Assumptions:
Our LSR heatmap model assumes that socio-environmental factors ranging from 2010-2023 remain constant as factors are stable over time. The model standardizes all predictor variables using z-scores and combines them linearly using an LSR. The model also assumes that linear relationships exist between these standardized features and per-capita case rates.
Feature Engineering
Population sizes were normalized and per-capita cases were calculated to standardize comparison across counties. Seven parameters influencing Lyme disease were standardized using z-scores, ensuring equal contribution with zero mean and unit variance.
We applied LSR to estimate linear relationships between standardized parameters and per-capita cases, producing a socio-environmental risk component value for each county. This value was calculated through weighting and normalization, then scaled on a range from one to ten based on intensity distribution.
County-level intensity scores were used to generate interactive heatmaps using the Folium software spanning from 2010 to 2023 (See Fig. 25). Counties were color-coded from lime (low risk) to black (highest risk). Interactive features including a time slider, searchable sidebar, and legend enable yearly risk interpretation and county-specific data access.

XGBoost
To forecast trends beyond current data, we applied XGBoost regression using historical cases alongside socioeconomic and environmental variables, projecting Lyme disease incidence for 2025, 2030, 2040, and 2050. These projections identify emerging high-risk regions and can help inform targeted public health strategies.
Assumptions:
Our XGBoost forecasting model assumes that lag features capturing cases from 1-2 years prior adequately represent time-based disease dynamics. Socio-environmental features and hyperparameters are set constant and optimal across all forecast periods. The model also assumes that the patterns learned from training data can be generalized to future decades and the NIH conversion factor remains constant across all predicted years and regions applied.
Feature Engineering
To help the model recognize patterns that change over time, we created lag features using data from previous years. The variables cases_lag_1 and cases_lag_2 represent the number of cases one and two years earlier, respectively, allowing the model to learn how past outbreaks influence current trends. The variable population_lag_1 captures the population from the previous year to adjust for changes in population size. We also included cases_per_capita_lag_1, which measures the number of cases relative to population, and years_since_start, which tracks long-term trends over the forecast period. Together, these features help the model understand both temporal correlations and population-driven effects on disease incidence.
Model Tuning
We trained XGBoost to forecast county-level cases using historical data from 2010 to 2023. To optimize performance, we fine-tuned the following hyperparameters (See Table 27):
| Hyperparameters | Value | 
|---|---|
| objective | ’reg:squarederror’ | 
| n_estimators | 1000 | 
| learning_rate | 0.05 | 
| max_depth | 5 | 
| subspace | 0.8 | 
| colsample_bytree | 0.8 | 
| random_state | 42 | 
| n_jobs | -1 | 
XGBoost was configured as a regression since case counts are continuous values. We set n_estimators to 1000 with a learning rate of 0.05 to allow for gradual learning and reduce overfitting risk. We set a maximum tree depth of five to balance the model’s complexity to allow for the algorithm’s ability to capture nonlinear interactions between each of the datasets.The stochastic subsampling of 80% of both training and feature instances per tree introduces randomness which reduces the sensitivity to outliers and correlated variables.
Forecasted Case
Using the model, we predicted county-level Lyme disease cases for 2025, 2030, 2040, and 2050. Lyme disease is severely underdiagnosed and underreported, with many cases going unconfirmed due to testing limitations and lack of healthcare access. Our model predicts confirmed case counts based on historical reporting data. We then applied the NIH’s established methodology for estimating total case burden, reflecting the true disease prevalence (see Table 28) (Kugeler et al., 2021).
| Year | Predicted Confirmed Cases | Predicted Total Cases | 
|---|---|---|
| 2025 | 105,681 | 476,000 | 
| 2030 | 176,771 | 707,085 | 
| 2040 | 226,011 | 904,044 | 
| 2050 | 240,473 | 961,892 | 
These predictions indicate substantial increases in Lyme disease burden across the United States over the coming decades. Current case tracking underestimates true incidence due to underreporting and limited testing, making local and national risk assessment challenging. By aggregating county-level forecasts, our XGBoost model accounts for multiple risk factors and identifies emerging hotspots that may be underrepresented in current surveillance. This predictive tool enables proactive public health planning and intervention before outbreaks escalate.
Results
To visualize predicted Lyme disease case distribution, choropleth maps were generated for each forecast year (see Figs. 26-29). Maps are presented in log scale to enhance visibility of differences between high and low case count counties, clearly identifying emerging hotspots.




Logarithmic scaling is particularly beneficial given wide variation in county-level incidence, with some counties reporting very few cases historically. This transformation ensures all counties, not just highest incidence areas, remain visible and interpretable for public health planning and resource allocation.
Impact
These forecasts provide insights into future Lyme disease risk across U.S. counties, enabling officials to allocate resources and monitor high-risk areas. By combining historical trends, socioeconomic factors, environmental data, and machine learning, this tool identifies emerging hotspots in regions with underreporting and limited diagnostic access.
Validation
To validate model accuracy and robustness, we performed a cross-validation and a hold-out validation. In the cross-validation, each year from 2012 to 2021 was sequentially excluded from training data and used as the test set. The model trained on remaining years and was evaluated on the excluded year to assess generalization across temporal contexts. Performance was measured using RMSE, quantifying average difference between predicted and actual case counts (see Table 28).
| Year | Cross Validation RMSE | 
|---|---|
| 2012 | 24.616 | 
| 2013 | 18.203 | 
| 2014 | 23.884 | 
| 2015 | 20.341 | 
| 2016 | 26.535 | 
| 2017 | 15.775 | 
| 2018 | 21.367 | 
| 2019 | 14.100 | 
| 2020 | 24.031 | 
| 2021 | 16.881 | 
| Overall | 20.573 | 
For hold-out validation, we trained on 2010 to 2016 data and tested on 2017 to 2021, simulating conditions where future data is unavailable during development (see Table 29):
| Year | Hold-Out RMSE | 
|---|---|
| 2017 | 16.376 | 
| 2018 | 19.758 | 
| 2019 | 16.273 | 
| 2020 | 30.976 | 
| 2021 | 17.991 | 
| Overall | 20.275 | 
Results indicate consistent prediction accuracy with average RMSE values of approximately 20 cases per county, sufficient for identifying high-risk counties and emerging hotspots. Cross-validation demonstrates strong generalization across years, while hold-out validation confirms accurate performance on unseen future data, validating XGBoost’s suitability for forecasting county-level Lyme disease trends.
###XGBoost Rationale We selected XGBoost for its ability to capture nonlinear relationships while avoiding overfitting through regularization. Unlike standard random forests, which treat all trees equally, XGBoost uses gradient boosting to iteratively improve predictions by assigning greater weight to previously mispredicted samples, enhancing county-level forecasting accuracy.
Global Application
To demonstrate broader applicability beyond the U.S., we adapted our LSRL approach to generate subnational risk visualizations for European countries (see Fig. 30). Using reported incidence and geographic data at the NUTS2 administrative level, we created normalized choropleth maps visualizing risk intensity (see Table 30). This demonstrates how our models can serve as a blueprint for global monitoring and inform international public health planning and diagnostic deployment.

| Dataset | Year | Development | 
|---|---|---|
| Lyme Disease case incidence per 100,000 | 2019 | Reported cases per NUTS 2 region for modeling incidence and risk levels (PubMed, Taylor & Francis, Science Direct) | 
| Forest Cover | 2018 | Observed Forest Cover per Square Kilometer (Eurostat) | 
| Economic Factor | 2020 | Economic Activity my million euros (Eurostat) | 
| Population Density | 2018 | Individuals per km2 (Eurostat) | 
| Vulnerable Ages of Disease(Under 3) | 2018 | Population risk ages (Eurostat) | 
| NUTS2 European Region Coordinates | - | Latitude and Longitude coordinates (Eurostat) | 
Future Plans
We plan to expand model applicability for healthcare providers and researchers by incorporating additional datasets such as wildlife density, healthcare accessibility, and educational attainment levels. Beyond current predictions, we want to scale this approach globally to map risk intensity and forecast cases worldwide, enabling earlier hotspot identification and timely health interventions in regions around the world facing increasing Lyme disease burden.
