Software

Introduction

The objective of the software team is to have giant heads this year is to incorporate in-silico modelling into our experimental loop and design process, and was started later in the project timeline of FloraGuard. Because of the development timeline, a key point of interest was to develop software that would still be useful for next year’s competition, regardless of whether or not NEU iGEM decides to continue developing FloraGuard or to choose a new project.

To this end, the software team has built out a pipeline capable of generating fusion proteins, by rearranging relevant protein domains sourced from our literature search, and then bulk predicting the structures of the fusion proteins bonded to cellulose, as well as their binding affinity Kd using Boltz-2. This pipeline will likely still either be directly useful, or have salvageable code for next year, as long as the team is working with proteins.

Motivation

When we first started exploring the idea of pursuing FloraGuard, a literature search was conducted to find what protein flame retardants are in current use. The issue that we ran into when designing our fusion proteins was that there were many possible fusion proteins we could construct.

The relative ordering of the domains within a fusion protein, which in many cases (including FloraGuard’s case) is not something that is specifically designed, can greatly affect the protein’s performance. The number of potential fusions grows factorially with respect to the number of domains that are allowed to change positions (basically, those that are not spacers or tags), and polynomially with the number of domains as a whole (fixed and unfixed positioning).

This growth of potential fusion proteins makes it impossible to screen all possibilities in the lab, because of both time constraints and resource limitations. Ergo, the option of virtual screening becomes more sought-after.

Pipeline

The code allows for the user to create fusion protein sequences by specifying the data sources for the amino acid sequences of the domains/regions. There are three kinds of data sources permitted: the sequence can either be input from a fasta file, a folder containing fasta files, or the string can be passed directly in python. Each component of the fusion protein can also be specified as having its position fixed or not fixed, allowing us to rearrange the relative ordering.

Here is an example of how this was used for our project:


HisTag = FusionComponent("6xHisTag",SingleStringSource("HisTag","MHHHHHH"), position_fixed=True)
Spacer = FusionComponent("Spacer",SingleStringSource("Spacer","GSSASS"), position_fixed=True)
CBD = FusionComponent("Cellulose Binding Domain",    
SingleStringSource("CBD","SSIINPTSATFDKNVTKQADVKTTMTLNGNTFKTIT" \
"DANGTALNASTDYSVSGNDVTISKAYLAKQSVGTTTLNFNFSAGNPQKLVI"), position_fixed=False)
Linker = FusionComponent("Linker",FolderFastaSource("Sequences/Linkers"),position_fixed=True)
PFR = FusionComponent("Protein Flame Retardant",FolderFastaSource("Sequences/PFRs"),position_fixed=False)

# Add components in N->C order
fusion = FusionBuilder(outputdir="Sequences/Fusion_Proteins")
(
    fusion.add_component(HisTag)
    .add_component(Spacer)
    .add_component(CBD)
    .add_component(Linker)
    .add_component(PFR)
)

# Generate the fusion protein fastas
fusion.permute_fusions(preserve_relative_ordering = False)

After generating the fusion proteins as fastas and storing them in a folder running Boltz-2 on the fusion proteins is relatively simple. Modal, a serverless cloud computing platform was used to run the Boltz-2 model for ease and speed of development.

Below is the code that the pipeline used to run the modal app after generating the fusion proteins and storing them in “Sequences/Fusion_Proteins”.


cellulose = 'O[C@H]1[C@H](O)C(O)C(O[C@H]2O[C@H](CO)[C@H](O)[C@H](O)[C@H]2O)O[C@H]1CO'
fasta_folder = "Sequences/Fusion_Proteins"

subprocess.run([
    "modal", "run", "src/boltz_command.py",
    "--proteins", fasta_folder,
    "--ligand", cellulose
])

Compare and Contrast Binding Software

There are many docking softwares that could have been used in conjunction with protein folding models to generate plausible docked structures. The reason that Boltz-2 was chosen over this was because Boltz-2 was trained on Kd, Ki, and IC_50 values, which are much closer to real biologically relevant values than the value of a scoring function.

The other reason was that the quick development needed for the project made it simpler to just retrieve the docked structure directly from boltz-2 in one command, rather than having to first retrieve the structures and then dock them separately.

This, of course, is not final, and depending on further research and project goals can change to suit the team objective later.

Results and Drawbacks

The structures generated by Boltz-2 show that there is significant off-site binding in the fusion proteins. The predicted Kd values are also very high, while the binding probability is low. The exact reason for these results is still being investigated, but the likely answer is that either the binding site on the cellulose binding domain is obstructed, or the PFRs themselves have hydrophobic regions that can bind to the cellulose, raising concerns about our actual fusion design.

Figure 1. 3D Model of FloraGuard Protein Complex. Cellulose Binding Domain (Green), Flame Retardant Protein (Purple).

Plans for the Future

In developing a viable, effective, environmentally responsible, and safe cellulose binding protein flame retardant, the goal of FloraGuard represents a highly multifaceted optimization problem that the software team is committed to tackling.

There is a lot of room to build on the work that has been done so far. The ultimate goal of the software team, with respect to supporting the wet lab, is to be able to produce a short list of fusion proteins that are worth investing time and money in experimenting with.

Performing virtual screening of a library of fusion proteins, before having to validate them in the lab, allows NEU iGem to save money, time, and to explore more chemical space, leading to better candidate proteins.

We are currently looking for metrics to use that can help us reasonably evaluate proteins in silico, and perform virtual directed evolution experiments. One of these metrics (or potentially several) is a measure of a PFR’s flame retardancy. This requires a deeper understanding of the exact mechanisms that explain how these PFRs work, and how to select these features from a given protein.

There are also plans to explore the possibility of predicting some properties related to the concerns of the larger Boston community, namely biodegradability and stability over time.

Navigation