To realize our project, we conducted two types of modeling: Viral Transmission and Protein Optimization. The former is a mathematical model of the infection dynamics of avian influenza virus, designed to set key strategic targets by identifying the indicators we should focus on for virus suppression. The latter was devised to assist in the sequence design of our device, COCCO, by introducing new mutations to ensure that the novel fusion protein maintains its expected folding and undergoes the desired structural change upon ligand administration.
Both modeling efforts were not only indispensable steps for the progress of our project, but their content also includes several parts that could be generally useful for the project progression of other iGEM teams. In addition to specifically explaining how this modeling benefited our own project, we will also focus on and highlight several modules that we believe will be useful to other teams.
In conducting our modeling, we paid special attention to the following points:
- That the analysis of the modeling would help us understand the conditions required for COCCO and support our decision-making in the process of devising COCCO's design.
- To sufficiently incorporate literature research on existing models and the insights gained from Human Practice.
- To integrate evaluation by the Wet Lab as much as possible.
- To give special priority to validation through modeling for aspects that cannot be demonstrated by proof-of-concept in the Wet Lab, such as the effect of suppressing virus spread.
As a result, several methods were created that future iGEM teams can reference. To share these with other teams, this document provides procedures in as much detail as possible. This overview will outline each of them.
Viral Transmission
To understand the dynamics of intercellular virus infection in chicken cells with the introduction of COCCO, we constructed and analyzed its mathematical model. We built a model using a system of differential equations, considering COCCO's characteristic of inducing cell death more rapidly than viral proliferation. By mathematically analyzing this model, we found that the system exhibits threshold-like behavior. Furthermore, we ran simulations using data collected from wet lab experiments or scientific papers and showed that COCCO attenuates the spread of infection. We also confirmed that inducing cell death has an advantage over other methods in halting the spread of infection.
This modeling is likely a method that future iGEM teams should reference when utilizing modeling in their projects for the following reasons:
- By building a minimal model to determine the potential success of a project, the objective can be achieved without losing reliability or interpretability.
- By constructing a model that is not overly complex and analyzing it mathematically, the essential properties of the system can be revealed.
- Analyzing the threshold-like behavior of a system can provide a basis for judging whether the system will function effectively.
- By limiting the parameters estimated at one time and estimating them in stages using multiple experimental datasets, a highly reliable parameter estimation can be achieved.
For example, when handling a model of the glycolytic pathway and considering the rate regulation by phosphofructokinase (PFK), the following serves as an example of how the above points can be referenced:
- For the connection between the glycolytic system and the external environment, consider only the inflow and outflow of glucose, pyruvate, and other energy carriers at very simple, constant rates. Additionally, for reactions not controlled by PFK, describe the reaction rates with simple equations, such as a proportional formula or the Michaelis-Menten equation.
- Since the reaction exhibits a threshold-like property due to PFK's regulation, by fitting an appropriate functional form to the reaction rate, the threshold that controls the reaction's progression can be identified.
Protein Optimization
To realize COCCO, we designed a fusion protein of RIG-I and APAF1 (see Design). However, if the wild-type domains are simply linked, the binding affinity between RIG-I and the CARD of APAF1 is lost, making it highly likely that COCCO would not possess the complex functions required by the project. Therefore, in silico protein simulations, we attempted to design a protein with two functions by introducing appropriate amino acid mutations to improve binding affinity: (1) detecting dsRNA and (2) switching a signal ON/OFF using the CARD of different origins. For this purpose, we constructed EVOLVE, a workflow composed of multiple software programs. Furthermore, we conducted experiments in the Wet Lab using the protein designed in silico and verified the binding function of the mutated fusion protein was improved.
Our EVOLVE is useful for other iGEM teams to design and evaluate proteins in silico to improve the binding affinity of proteins that interact with other substances. For this purpose, EVOLVE is designed to be user-friendly in the following ways:
- The input and output of each software component of EVOLVE are in an easy-to-handle format.
- All software is written in Python and can therefore be run on Google Colaboratory, making the environment setup extremely simple.
- The description and usage instructions for all software are clearly documented.
For example, when designing an antibody to detect the conserved region of a viral hemagglutinin, it is necessary to insert an amino acid sequence that sufficiently increases binding affinity. In such cases, the use of our EVOLVE is strongly recommended.
Abstract
COCCO is characterized by its ability to rapidly induce apoptosis immediately after viral entry. This self-elimination of infected cells prevents viral replication and the spread of infection to surrounding cells. To reproduce this characteristic of COCCO – immediate response after infection – we believed it was necessary to construct a mathematical model that could express the time-dependent dynamics of infected cells. In constructing the model, based on feedback from Human Practices, we aimed for a simple model design focused on minimal elements. This approach was taken to complement the knowledge of viral transmission that cannot be obtained from wet lab experiments, while also allowing for the quantitative verification of the COCCO concept.
Through this viral transmission modeling, we achieved the following:
- By mathematically analyzing this model, we understood that a characteristic parameter with a threshold determines whether an infection is established.
- By running simulations using data collected from wet lab experiments or scientific papers, we showed that the spread of viral infection can be attenuated by COCCO.
- We confirmed that controlling the intensity of apoptosis via COCCO is the most effective intracellular control method we can implement to overcome the spread of infection.
Introduction
COCCO is a system where cells infected by a virus immediately die via apoptosis, but it does not directly inhibit the production of virus particles within the cell. Therefore, it is suggested that it is crucial for apoptosis to occur before the infected cell releases virus particles into its surroundings. To address this property, we utilized a model [3] that incorporates the "infection age"—the time elapsed since a virus entered a target cell—into the TIV (Target cell-Infected cell-Virus) model [1][2], which is the simplest description of intercellular infection dynamics.
Furthermore, because COCCO affects viral particle dynamics within an infected cell, there are many phenomena we should not consider for our COCCO simulations, such as the spatial heterogeneity of cells, variations in infection modes, and existing immune mechanisms. A point of caution common in mathematical modeling is that a model including excessive conditions to perfectly replicate a real situation will have its reliability compromised. Taking this into account, we constructed a model that incorporates the minimum necessary conditions for COCCO's environment.
When estimating all parameters in a mathematical model simultaneously, there is a concern that the uncertainty of the estimated parameters increases due to the sheer number of them. To limit this high degree of freedom in parameters, we devised a method to ultimately estimate all parameters by first estimating only a subset of them using reduced equations corresponding to the available experimental data, and repeating this process for multiple reference datasets. This technique is a very powerful estimation process in biological mathematical modeling, which tends to have many equations and associated parameters.
In this part, we will discuss whether the intercellular spread of infection can be terminated in the body of a chicken with COCCO, and the efficacy of accelerating the death rate of infected cells by inducing apoptosis. Supplementary information that is not essential for following the discussion is summarized in the Appendices.
Model
Model Structure
We assume that the process of a virus propagating between cells is a transition of epidemiological states as shown below, and we use the following age-structured TIV model [3] as the mathematical model to describe this.
\begin{align} \frac{dT(t)}{dt} = & ~\lambda - dT(t) - \beta T(t)V(t) \tag{1} \\ \frac{\partial i(t, a)}{\partial t} = & - \frac{\partial i(t, a)}{\partial a} - \delta (a)i(t, a) \tag{2} \\ \frac{dV(t)}{dt} = & \int_0^\infty p(a)i(t, a) da - cV(t) \tag{3} \end{align}Boundary Conditions:
\begin{equation} i(t,0) = \beta T(t)V(t) \tag{4} \end{equation}Initial conditions:
\begin{equation} T(0) = T_{0}, ~i(0, a) = i_{0}(a), ~V(0) = V_{0} \tag{5} \end{equation}The variables for the number of cells and virus particles, and the other parameters included in the partial differential equation above, are defined as follows.
Table 1. The variables in the differential equation system.
| Name | Definition | Unit | Discription |
|---|---|---|---|
| $t$ | Time | h | Time for the entire system. |
| $a$ | Infection age | h | Time that has elapsed since a virus particle entered a target cell |
| $T(t)$ | Number of target cells | cells / mL | A target cell is a cell that is completely uninfected by a virus. |
| $i(t,a)$ | Number of infected cells per unit age of infection | cells / (mL・h) | An infected cell is a cell that has been invaded by a virus. |
| $V(t)$ | Number of virus particles | HA / mL | Infectious virus particles outside of the cells are counted. |
Table 2. The other parameters in the differential equation system.
| Name | Definition | Unit | Discription |
|---|---|---|---|
| λ | Production rate of target cells | cells / (mL・h) | Target cells are produced at a constant rate λ. |
| d | Death rate of target cells | 1 / h | Target cells die at a constant rate d. |
| β | Infection rate | mL / (HA ・h) | Target cells are infected by the virus at a constant rate β per virus particle. |
| δ(a) | Death rate of infected cells | 1 / h | Infected cells die at a rate δ(a) that depends on the infection age a. |
| p(a) | Virus production rate | HA / cells | Infected cells produce virus particles at a rate p(a) that depends on the infection age a |
| $c$ | Virus clearance rate | 1 / h | Virus particles are cleared at a constant rate c |
| $T_{0}$ | Initial number of target cells | cells / mL | Number of target cells at t=0 |
| $i_{0}(a)$ | Initial number of infected cells per unit infection age | cells / (mL・h) | Distribution of infected cells with respect to infection age a at t=0 |
| $V_{0}$ | Initial number of virus particles | HA / mL | Number of infectious extracellular virus particles at t=0 |
$V(t)$ is defined as the number of virus particles , but measured as HA titer.
In our model, the following detailed assumptions are made:
- Stochastic events are averaged out, and only a deterministic model is considered.
- The rates $\lambda$, $d$, $\beta$ and $c$ are treated as time-independent constants, with all individual differences among cells and virus particles averaged out. In this process, the infectivity of a virus particle naturally does not depend on the time since it was released from an infected cell.
- The time dependence of the rates $\delta$ and $p$ is assumed only with respect to the infection age $a$, and they are expressed as the functions $\delta(a)$ and $p(a)$ .
- Spatial heterogeneity among cells is ignored. Specifically, cell-to-cell infection is ignored, and only cell-free infection is considered.
- Since only virus particles with infectivity are considered observable, only infectious virus particles are counted as $V(t)$.
- Once a virus particle enters a target cell, the cell immediately becomes an infected cell, and no other virus particles will enter it again.
- COCCO and existing intracellular/extracellular immune mechanisms are not explicitly distinguished as terms in the mathematical model but are expressed as contributions to the relevant parameters. COCCO is assumed to have the function of increasing $\delta (a)$.
- Viral mutation during the course of the infection is ignored.
Because these assumptions omit various events that can occur in reality, the mathematical model does not faithfully reproduce the real world. However, in order to prevent excessive complexity and maintain the model's reliability, it is necessary to minimize the model's components through such detailed assumptions.
Analysis
Among iGEM projects, mathematical modeling is often used to fix parameters and run specific numerical simulations to see if a system shows the desired behavior. However, this approach alone doesn't reveal sudden changes in the system's behavior from parameter variations, nor does it clarify the parameter conditions needed for a proof-of-concept. Therefore, it's essential to discuss the system's structure and fundamental properties by mathematically analyzing the model.
In this section, we discuss a method to analytically determine whether an epidemic will be eradicated, using the age-structured TIV model mentioned above.
Hereafter, we set $\displaystyle T_{0}=\frac{\lambda}{d}$. This corresponds to the value at the equilibrium state that $T(t)$ reaches in a situation where no infected cells or virus particles exist, i.e., when $i(t, a),~V(t)=0$. Here, as the value that determines whether an infection can be established, $R_{0}$ is defined in the following form (see Appendix A) [4].
\begin{equation} \label{eq:R0} R_{0}=N\cdot \frac{\beta T_{0}}{c} \tag{6} \end{equation}Here, N is defined as follows (see Appendix B) [3].
\begin{equation} \label{eq:N} N=\int_0^\infty p(a)e^{-\int_0^a \delta (s)ds}da \tag{7} \end{equation}$N$ signifies the burst size; that is, the total number of virus particles produced by a single infected cell from the time it's created until it dies [1][2][3]. Also, $\displaystyle \frac{1}{c}$ represents the average lifespan of a virus particle (see Appendix A) [2]. Therefore, the $R_{0}$ above is the basic reproduction number: the number of new infected cells produced by a single infected cell over its lifetime during the initial stage of an infection [4]. The basic reproduction number has a threshold of 1, which governs whether an epidemic is established or fails. If \(R_{0}\lt1\) , the infectious disease dies out, and if $R_{0}\gt1$, the infection spreads (see Appendix B) [4].
Results
Parameter Estimation
To discuss the effectiveness of COCCO, the model must be constructed with parameters that match the cellular and viral conditions to which COCCO is applied. In this section, we describe the method used to estimate the parameters in our model and calculate $R_{0}$. Since COCCO is a system that accelerates the death rate of cells that have taken in virus particles via apoptosis, we assumed in our model that the effect of COCCO appears only in $\delta (a)$. When estimating parameters, we used experimental data from either the Wet Lab or scientific papers. To ensure the target parameters could be accurately estimated, we selected experimental conditions that allowed for assumptions that would reduce the degrees of freedom of the parameters. For the estimation, we used the Python least_squares function, which performs a least-squares method.
To investigate the effect of COCCO on $\delta (a)$, we estimated it using a single-cycle infection experiment conducted in the Wet Lab. The experimental data was obtained by simultaneously infecting HEK293 cells with influenza virus A/Victoria/361/2011(H3N2) and determining the percentage of viable cells at each time point post-infection. Since the only available time points were 6 hpi and 12 hpi, we concluded that we could not estimate the form of $\delta (a)$ as a function of $a$. Therefore $\delta (a)$ was set as a constant. We estimated and compared the parameters for wild-type cells and for a stable cell line expressing PKR-APAF1, respectively. We named these respective parameters $\delta _{wild}$ and $\delta _{COCCO}$.
The value of $\delta$ was estimated to be 0.240 and 0.295 for wild-type cells and PKR-APAF1 expressing cells, respectively. We named these $\delta _{wild}$ and $\delta _{COCCO}$. Since an experiment involving multiple cycles of virus infection could not be performed in the Wet Lab, it was necessary to use data from the literature to estimate the other parameters.
We performed parameter estimation using data from cell infection experiments with influenza virus A/PR/8/34 (H1N1) and HEK293 cells [5][6], which is conducted under conditions similar to our Wet Lab infection experiment. The parameter estimation was performed stepwise: first, $\lambda$ and $d$; second, $\delta(a)$, $p(a)$, and $c$; and third, $\beta$.
δ(a) was set as a constant $\delta$, and p(a) was specifically defined as the following function [3]. The terms $p_{max}$, $a_{1}$, $b_{1}$ are positive constants and are the parameters to be estimated.
\begin{equation} p(a)=p_{max}\left( 1-e^{-b_{1}(a-a_{1})} \right) \tag{8} \end{equation}For the detailed method of parameter estimation, see Appendix C. The results were as follows.
Table 3. Estimated parameters
| Name | Definition | Unit | Value |
|---|---|---|---|
| λ | Production rate of target cells | cells / (mL・h) | $5.64\times 10^{4}$ |
| d | Death rate of target cells | 1 / h | $5.97\times 10^{-3}$ |
| β | Infection rate | mL / (HA ・h) | $6.61\times 10^{-3}$ |
| δ(a) | Death rate of infected cells | 1 / h | $6.18\times 10^{-2}$ |
| p(a) | Virus production rate | HA / cells | \begin{align*} p_{\max}&=2.93\times 10^{2} \\ a_{1}&=5.00 \\ b_{1}&=1.10\times 10^{-3} \end{align*} |
| $c$ | Virus Clearance rate | 1 / h | $5.73\times 10^{3}$ |
From these parameters, when T₀ = 1.00×10⁶, we obtain the following:
\begin{equation} \begin{array}{cc} N=\int_0^\infty p(a)e^{-\int_0^a \delta (s)ds}da = 60.5, & \quad R_{0}=N\cdot \frac{\beta T_{0}}{c} = 69.9 \end{array} \end{equation}Because the virus strain, an important experimental condition, is different, the $\delta _{wild}$ and $\delta _{COCCO}$ estimated from the Wet Lab data cannot be directly compared with the $\delta$ estimated from literature data. Therefore, we consider $\delta'$, which is the death rate when cells express PKR-APAF1 under the experimental conditions in the literature. Assuming that the effect of expressing PKR-APAF1 is the ratio $\frac{\delta_{\text{cocco}}}{\delta_{\text{wild}}}$, then:
\begin{equation} \delta' = \delta \cdot \frac{\delta_{\text{cocco}}}{\delta_{\text{wild}}} \tag{9} \end{equation}Calculating with the parameter values above gives $\delta '=7.60\times 10^{-2}$, $N = 37.4$, and $R_{0}=43.2$
This result does not satisfy the condition for ending the infection, which is $R_{0} \lt 1$. However, because $R_{0}$ was reduced compared to HEK293 wild type cells, it is considered that COCCO certainly has the effect of inhibiting the spread of viral infection. Although the ratio $\frac{\delta_{\text{cocco}}}{\delta_{\text{wild}}}$ is about 1.2, $R_{0}$ is reduced to 61.8% of the original. Thus, $R_{0}$ is greatly affected by only a slight change in $\delta$, which suggests that the spread of infection can be efficiently prevented by inducing the death of infected cells. We will discuss this in detail in the next section.
Sensitivity Analysis
$R_{0}$ is the value that determines whether an infection can be established, and as stated in equations (6) and (7), it is determined by the parameters $\beta$, $c$, $T_{0}$, $\delta (a)$, and $p(a)$. Since COCCO is a system that accelerates the death rate of cells that have taken in virus particles via apoptosis, the effect of COCCO corresponds to an increase in $\delta(a)$. However, methods of suppressing the infection by changing other parameters can also be considered. For example, decreasing $\beta$ corresponds to preventing new cell infections, and increasing c corresponds to promoting the clearance of infectious virus particles. To quantitatively discuss how much increasing $\delta(a)$, as COCCO does, contributes to suppressing $R_{0}$ below 1, it is necessary to perform a parameter sensitivity analysis on the controllable parameters that constitute $R_{0}$. We considered the sensitivity of $R_{0}$ to each parameter—that is, its local sensitivity—and compared their absolute values.
We calculated the local sensitivity of $R_{0}$ with respect to $\beta$, $c$, and $T_{0}$ using the following formula.
\begin{equation} E_{x}:=\frac{x}{R_{0}} \cdot \frac{\partial R_{0}}{\partial x} ~~(x\in \{\beta, c, T_{0}\}) \tag{10} \end{equation}Next, we calculated the functional local sensitivity of $R_{0}$ with respect to $p(a)$ using the following formula (see Appendix C).
\begin{equation} e_{p}(a):=\frac{p(a)}{N} e^{-\int_0^a \delta (s)ds} \tag{11} \end{equation}Furthermore, we calculated the functional local sensitivity of $R_{0}$ with respect to $\delta(a)$ using the following formula (see Appendix C).
\begin{equation} e_{\delta}(a):=\frac{\delta (s)}{N} \int_a^\infty p(u)e^{\int_0^u \delta (s)ds} du \tag{12} \end{equation}$e_{p}(a)$ gives the distribution of the contribution for each infection age $a$, and by integrating this over all a, the local sensitivity of $R_{0}$ to $p(a)$, $E_{p}$, can be obtained.
\begin{equation} E_{p}:=\int_0^{\infty} e_{p}(a)da \tag{13} \end{equation}Similarly, the local sensitivity of $R_{0}$ with respect to $\delta(a)$ over all $a$, $E_{\delta}$, is defined as:
\begin{equation} E_{\delta}:=\int_0^{\infty} e_{\delta}(a)da \tag{14} \end{equation}Here, it is found by analysis that regardless of the values of the parameters, $E_{\beta}=E_{T_{0}}=E_{p}=1, ~E_{c}=-1$. $E_{\delta}$ takes a negative value, but it is not generally -1. If we define $p(a)$ and $\delta(a)$ as in (8) and (9),
\begin{equation} E_{\delta}=-\left( 1+a_{1}\delta +\frac{\delta}{\delta +b_{1}} \right) \tag{15} \end{equation}Since $\delta, ~a_{1}, ~b_{1}\lt0$, we have $E_{\delta} \lt -1$, and regardless of what values the parameters take,
\begin{equation} |E_{x}| \lt |E_{\delta}| \quad (x\in \{\beta, c, T_{0}, p\}) \tag{16} \end{equation}This indicates that by increasing $\delta$ —that is, by inducing the death of infected cells— $R_{0}$ can be reduced more efficiently than by other methods. Substituting the values from Table 3 for $\delta, ~a_{1}, ~b_{1}$, we get $E_{\delta}=-2.29$, which shows that a change in $\delta$ has more than double the effect on $R_{0}$ compared to the other parameters.
Discussion & Conclusion
To understand the effect of COCCO on intercellular virus infection, we analyzed a TIV model that considers the infection age. As a result, the following points were clarified:
- The basic reproduction number $R_{0}$ determines whether an infection is established.
- The introduction of COCCO mitigates the spread of viral infection.
- In combating the spread of infection, promoting the death of virus-infected cells is the most effective method compared to others.
The reason for introducing the infection age in the first place was the importance of inducing apoptosis before an infected cell releases virus particles into its surroundings. In our mathematical model, this means that for the infection to end, the effect of $\delta(a)$ must exceed that of $p(a)$. However, in the Wet Lab data available this time, there were few time points, making it impossible to determine the functional form of $\delta(a)$, so we had no choice but to set $\delta$ as a constant. For this reason, we could not fully obtain the benefit of introducing the infection age. For a more accurate analysis of COCCO's effect, it is necessary to conduct single-cycle and multi-cycle infection experiments and to track virus titers and viable cell counts at frequent time points. In our current calculations, $R_{0}$ did not fall below the threshold of 1 even with the introduction of COCCO, but the results could potentially change with such an accurate analysis.
This method of constructing and analyzing a mathematical model has potential applications for various projects, as follows:
- By building a minimal model to determine the potential success of a project, the objective can be achieved without losing reliability or interpretability.
- By constructing a model that is not overly complex and analyzing it mathematically, the essential properties of the system can be revealed.
- Analyzing the threshold-like behavior of a system can provide a basis for judging whether the system will function effectively.
- By limiting the parameters to be estimated at one time and estimating them in stages using multiple experimental datasets, a highly reliable parameter estimation can be achieved.
Appendix
References
[1] Iwami, S., Nakaoka, S., & Iwanami, S. (2024).
Mathematical
modeling and simulation of viral infection: Quantitative
understanding of
data [in Japanese]. Kyoritsu Shuppan.
[2] Iwami, S., Sato, K., & Takeuchi, Y. (2023). Virus
infection and
ordinary differential equations [in Japanese]. Science Sha.
[3] Nelson, P. W., Gilchrist, M. A., Coombs, D., Hyman, J.
M., &
Perelson, A. S. (2004). An Age-Structured Model of HIV Infection that
Allows for
Variations in the Production Rate of Viral Particles and the Death Rate
of
Productively Infected Cells. Mathematical Biosciences &
Engineering,
1(2), 267–288.
[4] Huang, G., Liu, X., & Takeuchi, Y. (2012). Lyapunov
Functions and
Global Stability for Age-Structured HIV Infection model. SIAM
Journal on Applied
Mathematics, 72(1), 25–38.
[5] Küchler, J., Opitz, P., Jordan, I., Genzel, Y., Benndorf,
D., &
Reichl, U. (2025). Quantification of intracellular influenza A virus
protein
dynamics in different host cells after seed virus adaptation.
Applied
Microbiology and Biotechnology, 109(1).
[6] Ru, A. L., Jacob, D., Transfiguracion, J., Ansorge, S.,
Henry, O., &
Kamen, A. A. (2010). Scalable production of influenza virus in HEK-293
cells for
efficient vaccine manufacturing. Vaccine, 28(21),
3661–3671.
[7] Gonis, A. (2014). Functionals and functional derivatives
of wave
functions and densities. World Journal of Condensed Matter
Physics,
04(03), 179–199.
Abstract
The fusion protein used in COCCO must be able to recognize viral dsRNA and induce apoptosis. To achieve this, we designed a fusion protein in which some domains of RIG-I were replaced (see Design). Based on the structure and function of RIG-I, it is highly likely that a fusion protein created by simply substituting domains would not have the necessary function. Therefore, it was necessary to improve this fusion protein by introducing amino acid mutations in silico.
We achieved the following:
- We integrated multiple software programs to build EVOLVE(Enhanced Variant Optimization via in-silico Ligand and interface EValuation), a workflow for designing optimal mutant proteins.
- We evaluated the performance of the designed mutant protein through modeling and wet lab experiments.
- As protein modeling provided an opportunity to improve the fusion protein, we were able to adopt the RIG-I and APAF1 fusion protein as a component to realize COCCO.
- From a modeling perspective, improvement in the function of the fusion protein could be expected.
The EVOLVE we constructed is in a format that is easy for users to handle during environment setup and execution. Therefore, our EVOLVE will likely be essential for future iGEM teams to meaningfully design proteins with enhanced interaction capabilities in silico.
Introduction
RIG-I is an intracellular innate immune factor that recognizes dsRNA and is composed of CARD1, CARD2, Helicase1 (Hel1), Helicase2 (Hel2), Helicase2i (Hel2i), Pincer, and CTD domains [1][2][3]. The two CARD domains transmit a signal to the downstream protein MAVS [2]. In the absence of dsRNA, CARD2 is bound to Hel2i, and RIG-I does not transmit a signal downstream [1][2]. When dsRNA binds to the CTD and Helicase domains, the structure changes, CARDs are released, and the signal is transmitted downstream.
In COCCO, instead of the two CARDs of RIG-I, a fusion protein is used that links the CARD of APAF1, which transmits a signal to Caspase9 (see Design). If both the CARD of APAF1 and the Hel2i of RIG-I remain wild-type, it's believed that the CARD will be constantly exposed due to the low binding affinity between them. In this state, accidental protein aggregation could potentially induce apoptosis even without dsRNA, so safety must be ensured when introducing it into chickens. To address this issue, it's necessary to introduce amino acid mutations into either the CARD of APAF1 or Hel2i to strengthen the Hel2i-CARD interaction. However, when mutations are introduced into the CARD of APAF1, the CARD-Caspase9 interaction must not be lost. Similarly, when mutations are introduced into Hel2i, the Hel2i-dsRNA interaction must not be lost.
We have constructed a workflow called EVOLVE to design mutant proteins with interaction capabilities enhanced beyond the wild-type. EVOLVE consists of five steps:
- Step1. Mutable Site Selection
- Step2. Mutation Introduction
- Step3: Structure Prediction
- Step4. Visual Inspection
- Step5. Docking & Stability Analysis.
There are two versions of EVOLVE, which differ in the method used in Step 2: Mutation Introduction. A description of EVOLVE will be provided in the following chapters.
In this part, we will discuss the design of the mutant protein via EVOLVE and the evaluation of the resulting fusion protein. Additionally, similarly to the previous part, information that is not necessarily required to be read is compiled in the Appendices.
Workflow
EVOLVE consists of the steps described above. Version 1 was used for designing mutants of Hel2i and CARD of APAF1 , while Version 2 was used for the Hel2i. The main difference between Versions 1 and 2 is the screening method for the Hel2i-CARD of APAF1 binding affinity—the property we wanted the fusion protein to acquire. In Version 1, this binding affinity is evaluated in Step 5: Docking & Stability Analysis. In contrast, Version 2 is designed to increase the binding affinity at the same time the mutation is introduced in Step 2: Mutation Introduction. Version 1 is appropriate for introducing mutations into a wild-type sequence to search for sequences with high binding affinity. On the other hand, version 2 is suitable for improving binding affinity while maintaining a known complex structure. In our modeling, we initially designed a mutant protein using Version 1, but later devised Version 2 because we believed it was better to maintain the existing 3D structure to achieve COCCO's function of dynamically and precisely switching signals.
When building EVOLVE, we were careful to create a workflow that is easy for users to handle. First, the software used in EVOLVE is written in Python, so it can be easily run on Google Colaboratory without a complex environment setup. Google Colaboratory is an environment where Python can be run for free in a Google browser. However, since the GUI for PyMOL [4] is not available on Google Colaboratory, we recommend installing the application.
In addition, the input and output for the software used in each step are either an amino acid sequence or a molecular 3D structure, and only the following two file formats are used:
- FASTA: A text-based file format commonly used for recording sequence information of proteins and nucleic acids.
- PDB: A text-based file format commonly used for recording 3D structural information of proteins and nucleic acids.
We created a user manual so that anyone can use EVOLVE.
Step1. Mutable Site Selection
If all amino acids are targeted for mutation, the number of candidates becomes enormous; if mutations can be introduced at n sites, 20ⁿ combinations exist. In addition, there is also a risk that significant changes may occur in the structure of the domain. Therefore, we targeted only those amino acids thought to be involved in the formation of the Hel2i-CARD of APAF1 complex for mutation. The mutable sites were selected by investigating the amino acid residues with side chains involved in Hel2i-CARD2 complex formation, with reference to the literature [1][3][5][6], and by observing the visualized model using PyMOL.
Step2. Mutation Introduction
Version 1
Even when limiting the mutatable sites, 20 different amino acids can be placed at each site, making the total number of patterns enormous. Therefore, creating candidates randomly is inefficient. Since proteins with unnatural sequences are likely to be ineffective [7], we evaluated a naturalness score. We generated sequences with high naturalness through directed evolution. To score the naturalness of amino acid sequences, we used the protein language model ESM2 [7], and as software to search for high-scoring sequences via directed evolution, we used EvoProtGrad [8].
Version 2
ProteinMPNN [9] is a software that takes the 3D structure of a protein as input and uses machine learning to output an amino acid sequence that is highly likely to adopt that structure. We created mutant sequences using ProteinMPNN as follows. We prepared the complex structure of the domain we wanted to mutate and the domain we wanted it to bind to. For this, we used a structure where the CARD2 in the Hel2i-CARD2 complex was replaced with the CARD of APAF1. We specified the mutatable sites among the amino acids of Hel2i. Then, an amino acid sequence that can stably form this complex structure is generated via machine learning. The top-ranking sequences are selected based on a "likelihood" score.
The likelihood score is calculated as -log(P), where P is the probability that the model assigns to a sequence; a smaller value indicates a better fit.
Step3. Structure Prediction
Using AlphaFold2 [10], we predicted the 3D structures of the generated mutant sequences. The reliability of the predicted structures is evaluated by a score called pLDDT. Since sequences with a low pLDDT score are unlikely to form a stable 3D structure, we excluded them. For execution, we used the software Biopython [11] to extract amino acid sequence information from the FASTA files output in Step 2 and to retrieve the pLDDT scores from the PDB files output by AlphaFold2.
Step4. Visual Inspection
We visualized the predicted 3D structures with PyMOL and excluded sequences that deviated significantly from the original structure, as they could potentially inhibit the complex movements of the fusion protein.
Step5. Docking & Stability Analysis
PyRosetta [12] is software for performing molecular modeling in Python. Using PyRosetta, we docked domains together, predicted the structure of the complex, and scored the stability of the interaction. There are the following two types of docking methods in PyRosetta:
- Rigid Docking: Moves only the relative positions of the molecules while maintaining the 3D structure within each molecule. The computational cost is low.
- Relax Docking: Also flexibly changes the relative atomic positions within each molecule. The computational cost is high, but it allows for more precise docking.
In both docking methods, the molecules are docked in a way that minimizes the total energy of the complex (Docking energy). It is also possible to calculate only the strength of the interaction at the interface from the complex's structure (Interface score). The stability score is an energy calculated by PyRosetta's evaluation function, with its unit in REU (Rosetta Energy Unit). A lower value indicates a more stable complex. In this stage, we docked the Hel2i with its ligands, the CARD of APAF1 or dsRNA, and evaluated their stability with the following procedure:
- (i) We performed Rigid Docking. However, we repeated the docking five times, slightly varying the initial position, and selected only the complex with the most stabilized total energy.
- (ii) We calculated the Interface score of the rigid complex structures created above and kept only the top-ranking results.
- (iii) We performed Relax Docking based on the rigid complex structures.
- (iv) We calculated the Interface score of the relaxed complex structures.
In this procedure, to reduce computational cost, we first performed rigid docking for rough screening, followed by relaxed docking.
Version1
When selecting mutant Hel2i, we first performed Hel2i-CARD of APAF1 docking (i-iv). After keeping only those with a high Interface score for the relaxed complex (iv), we performed dsRNA docking (i-iv) and excluded those with significantly worse dsRNA binding affinity compared to wild-type Hel2i. When selecting the mutant CARD of APAF1, we performed only the Hel2i-CARD of APAF1 docking. The docking of the CARD of APAF1 and the CARD of Caspase9 was not performed because their complex structure has not been uniquely predicted.
Version2
We evaluated the mutant Hel2i-dsRNA binding affinity and excluded any mutants with significantly worse binding affinity compared to the wild-type Hel2i. The method used was the same as in Version 1. Although not used as a selection criterion, We also evaluated the binding affinity between Hel2i and the CARD of APAF1 for comparison with the sequences created in Version 1.
Results
Using EVOLVE, we designed mutant protein domains with strong binding affinity for Hel2i and the APAF1 CARD, which are the domains that constitute the mutant fusion protein used in COCCO. The conditions for each step and the number of proteins remaining after selection during our in silico optimization are summarized in Appendix A. The affinities and sequences for the domains finally selected by EVOLVE, as calculated by PyRosetta's Relax docking, are summarized in Table 1 below. Here, "Affinity" represents the magnitude of the interface score from Relax docking, and its unit is REU (Rosetta Energy Unit).
Table 1. Results of the mutant sequence optimization by EVOLVE.
(A) Optimization results for mutant Hel2i from Version 1.
| No. | Affinity to the CARD of APAF1 | Affinity to the dsRNA | Sequences |
|---|---|---|---|
| WT | 32.59 | 53.01036 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACMVFQMPDKDEESRICKALFLYTSHLRKYNDALIISEHARMKDALDYLKDFFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 95 | 46.87 | 53.61502 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACRVFQMPDEEEEKRILKALELYTSHLRKYNDALIISEHARMKDALDYLKDVFSNPRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 68 | 43.82 | 56.22829 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACMVFQMPDKSEESRIRKALFLYTSHLRKYNDALIISEHARMKDALDYLKKFFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 178 | 43.38 | 53.10431 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACNVFQMPDDDEESRIVKALRHYTSHLRKYNDALIISEHARMKDALDYLKTEFSNIRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 162 | 37.71 | 54.20295 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACMVFQMPDKDEESRICLALFLYTSHLRKYNDALIISEHARMKDALDYLKDQFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 141 | 37.36 | 54.48599 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACEVFQMPDKIEEERISKALHFYTSHLRKYNDALIISEHARMKDALDYLKSLFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
(B) Optimization results for mutant CARD of APAF1 from Version 1.
| No. | Affinity to the Hel2i | Sequences |
|---|---|---|
| WT | 32.59 | MDAKARNCLLQHREALEKDIKTSYIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIPVVSSS |
| 134 | 51.90 | MDAAARNTLLLHRELNLDDIIVESIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIPVVSSS |
| 84 | 51.52 | MDELARNALLFHREHQIPDIAWSSIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIPVVSSS |
| 46 | 47.69 | MDKIARNILLMHRELYRTDIEVLYIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIPVVSSS |
| 108 | 46.10 | MDEKARNWLLAHREDIDEDIIAESIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIPVVSSS |
| 0 | 44.54 | MDQLARNVLLTHREININDIQVEGIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIPVVSSS |
(C) Optimization results for mutant Hel2i from Version 2.
| No. | Affinity to the CARD of APAF1 | Affinity to the dsRNA | Sequences |
|---|---|---|---|
| WT | 32.59 | 53.01036 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACMVFQMPDKDEESRICKALFLYTSHLRKYNDALIISEHARMKDALDYLKDFFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 38 | 24.66 | 53.52874 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQEACKVFQMPDKEEEKRICEALNKYTSHLRKYNDALIISEHARMKDALDYLKEYFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
| 552 | 20.24 | 52.39647 | FKYIIAQLMRDTESLAKRICKDLENLSQIQNREFGTQKYEQWIVTVQKACEVFQMPDKEEEKRICDALKKYTSHLRKYNDALIISEHARMKDALDYLKKYFSNVRAAGFDEIEQDLTQRFEEKLQELESVSRDPSNEN |
Results of the mutant sequence optimization selected by EVOLVE. "No." is the number given to the generated sequence in Step 2: Mutation Introduction, "WT" is the wild-type protein domains, the "Affinity" value represents the magnitude of the Interface score from Relax docking, the pink highlight in the sequence indicates mutable amino acids, and the red letters indicate mutated amino acids.
Additionally, regarding the performance of these protein domains according to Wet Lab results, the experimental results confirming whether mutant Hel2i designed by EVOLVE established binding with both RIG-I and APAF1 CARDs were as follows (see Result).
Discussion & Conclusion
For the success of our project, we considered the following two requirements for the mutant protein:
- The binding affinity between the Hel2i of RIG-I domain and the CARD of APAF1 is sufficiently high, ensuring the signal is OFF in the absence of dsRNA.
- The inherent binding affinity of the Hel2i of RIG-I domain for viral dsRNA is not lost by the mutation, allowing it to maintain binding with dsRNA in its presence.
The results from designing proteins that meets these conditions by using our workflow EVOLVE suggest the following:
- From the results of evaluating protein function by modeling, it can be seen that mutant Hel2i and mutant CARD of APAF1, which enhanced the interaction ability necessary for the realization of COCCO, could each be designed in silico.
- Despite these good results, according to Wet Lab experiments, none of the mutant Hel2i generated in Version 1 showed binding affinity with the CARD of APAF1.
From a protein modeling perspective, a possible reason for this outcome is that the parameters we set in each step of EVOLVE may have reduced the search freedom for the optimal sequence (see Appendix A, Manual). These parameters include:
- Step 1: The number of amino acids designated as mutable.
- Step 2: The number of simulation trials, the number of amino acid sequences to generate, the maximum number of mutations, etc.
- Step 3: The percentage of Cα atoms showing a high pLDDT.
- Step 5: Settings for the docking simulation, such as initial position, initial perturbation, and the number of simulation trials.
If these parameter settings can be improved, it should be possible to design promising proteins through EVOLVE. However, since the parameters are determined by the user's computational cost and the number of protein types that can be experimentally verified in the Wet Lab, the allocation of these costs is another important consideration when using EVOLVE.
Keeping the above in mind, EVOLVE is extremely useful for designing, in silico, mutant proteins with binding affinities enhanced beyond the wild-type. Furthermore, this workflow makes it possible to design mutations that enable interactions between proteins that would not normally bind. In particular, when future iGEMers design ligand detection systems, Version 1 can be used to enhance existing binding affinity and improve detection sensitivity, while Version 2 can be used to develop new detection systems that can interact with arbitrary proteins. EVOLVE is a significant approach that greatly expands the possibilities of protein design.
Appendix
In this section, we introduce the specific methods and setting of selection criteria when we actually used EVOLVE for the project. Version 1 was used for variant design of Hel2i and CARD of APAF1, and Version 2 was used for variant design of Hel2i. Furthermore, it is recommended that other iGEMers who refer to EVOLVE change the detailed selection criteria according to conditions such as the properties of the protein they want to create, computational resources, and the number of proteins that can be experimented with in Wet Lab.
Table 2. Detailed settings of EVOLVE and transition of protein numbers
(A) Hel2i mutant design by Version 1.
| Step | Specific Criteria Setting | Remaining Proteins |
|---|---|---|
| Mutable Site Selection | $20^{138}$→$20^{12}$ | |
| Mutation Introduction | Generated 200 sequences | $20^{12}$→200 |
| Structure Prediction | Among $C_{\alpha}$ carbons, those with pLDDT $\lt$ 50 are less than 30%, and those with pLDD10T $\gt$ 70 or more are 80% or more | 200→200 |
| Visual Inspection | 200→200 | |
| Docking & Stability Analysis | Interface score of Rigid Docking complex with CARD of APAF1 is less than Interface score of wild-type Hel2i (-14.1) | 200→73 |
| Docking & Stability Analysis | Top 5 in descending order of Interface score of Relax Docking complex with CARD of APAF1 | 73→5 |
| Docking & Stability Analysis | Interface score of Relax Docking complex with dsRNA is less than Interface score of wild-type Hel2i (-53.0) + 2 | 5→5 |
(B)CARD mutant design of APAF1 by Version 1.
| Step | Specific Criteria Setting | Remaining Proteins |
|---|---|---|
| Mutable Site Selection | $20^{97}$→$20^{12}$ | |
| Mutation Introduction | Generated 300 sequences | $20^{12}$→300 |
| Structure Prediction | Among $C_{\alpha}$ carbons, those with pLDDT $\lt$ 50 are less than 30%, and those with pLDDT $\gt$ 70 or more are 80% or more | 300→208 |
| Visual Inspection | 208→202 | |
| Docking & Stability Analysis | Interface score of Rigid Docking complex with Hel2i is less than Interface score of wild-type Hel2i (-14.1) - 5 | 202→57 |
| Docking & Stability Analysis | Top 5 in descending order of Interface score of Relax Docking complex with Hel2i | 57→5 |
(C)Hel2i mutant design by Version 2.
| Step | Specific Criteria Setting | Remaining Proteins |
|---|---|---|
| Mutable Site Selection | $20^{138}$→$20^{12}$ | |
| Mutation Introduction | Among the generated 1024 sequences, 2 with high scores | $20^{12}$→$10^{24}$→2 |
| Structure Prediction | Among $C_{\alpha}$ carbons, those with pLDDT $\lt$ 50 are less than 30%, and those with pLDDT $\gt$ 70 or more are 80% or more | 2→2 |
| Visual Inspection | 2→2 | |
| Docking & Stability Analysis | Interface score of Relax Docking complex with dsRNA is less than Interface score of wild-type Hel2i (-53.0) + 2 | 2→2 |
It shows the detailed criteria that we independently established at each step when we executed EVOLVE, and the transition of the number of proteins that were selected and remained there. The detailed criteria are our example and should be appropriately set by the user.
References
[1] Kowalinski, E., Lunardi, T., McCarthy, A. A., Louber, J.,
Brunel,
J., Grigorov, B., Gerlier, D., & Cusack, S. (2011). Structural basis for
the
activation of innate immune pattern-recognition receptor RIG-I by viral
RNA.
Cell, 147(2), 423–435.
[2] Thoresen, D., Wang, W., Galls, D., Guo, R., Xu, L., &
Pyle, A. M.
(2021). The molecular mechanism of RIG-I activation and signaling.
Immunological Reviews, 304(1), 154–168.
[3] Wang, W., & Pyle, A. M. (2022). The RIG-I receptor adopts
two
different conformations for distinguishing host from viral RNA ligands.
Molecular Cell, 82(21), 4131–4144.e6.
[4] The PyMOL Molecular Graphics System, Version 3.1
Schrödinger,
LLC.
[5] Jiang, F., Ramanathan, A., Miller, M. T., Tang, G., Gale,
M.,
Patel, S. S., & Marcotrigiano, J. (2011). Structural basis of RNA
recognition
and activation by innate immune receptor RIG-I. Nature,
479(7373), 423–427.
[6] Louber, J., Brunel, J., Uchikawa, E., Cusack, S., &
Gerlier, D.
(2015). Kinetic discrimination of self/non-self RNA by the ATPase
activity of
RIG-I and MDA5. BMC Biology, 13(1).
[7] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W.,
Smetanin,
N., Verkuil, R., Kabeli, O., Shmueli, Y., Costa, A. D. S.,
Fazel-Zarandi, M.,
Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale
prediction of
atomic-level protein structure with a language model. Science,
379(6637), 1123–1130.
[8] Emami, P., Perreault, A., Law, J., Biagioni, D., & St
John, P.
(2023). Plug & play directed evolution of proteins with gradient-based
discrete
MCMC. Machine Learning Science and Technology, 4(2),
025014.
[9] Dauparas, J., Anishchenko, I., Bennett, N., Bai, H.,
Ragotte, R.
J., Milles, L. F., Wicky, B. I. M., Courbet, A., De Haas, R. J., Bethel,
N.,
Leung, P. J. Y., Huddy, T. F., Pellock, S., Tischer, D., Chan, F.,
Koepnick, B.,
Nguyen, H., Kang, A., Sankaran, B., . . . Baker, D. (2022). Robust deep
learning–based protein sequence design using ProteinMPNN.
Science,
378(6615), 49–56.
[10] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov,
M.,
Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko,
A.,
Bridgland, A., Meyer, C., Kohl, S. a. A., Ballard, A. J., Cowie, A.,
Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., . . . Hassabis, D.
(2021).
Highly accurate protein structure prediction with AlphaFold.
Nature,
596(7873), 583–589.
[11] Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A.,
Cox,
C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski,
B., & De
Hoon, M. J. L. (2009). Biopython: freely available Python tools for
computational molecular biology and bioinformatics.
Bioinformatics,
25(11), 1422–1423.
[12] Bender, B. J., Cisneros, A. III, Duran, A. M., Finn, J.
A., Fu,
D., Lokits, A. D., Mueller, B. K., Sangha, A. K., Sauer, M. F., Sevy, A.
M.,
Sliwoski, G., Sheehan, J. H., DiMaio, F., Meiler, J., & Moretti, R.
(2016).
Protocols for molecular modeling with Rosetta3 and
RosettaScripts.
Biochemistry, 55(34), 4748–4763.