Overview
Our software was created to model the fragments resulting from an enzyme’s cleavage of gluten and to determine which (if any) IgE epitopes the fragments matched. This allowed the team to model the expected functionality of different enzymes in silico; helping the wet lab determine how enzymes of interest were likely to perform alone or in combination with other enzymes.
To do this the dry lab team created three programs: the FASTA Ripper, the Enzyme Epitope Predictor and the File Splicer. These scripts worked together to deliver the desired data in an easy and accessible way.


Program Summaries and Quick Instructions
FASTA Ripper
- Function: Splits a multi-FASTA file into individual FASTA files.
- Input: multi-FASTA file in .fasta or .txt format.
- Output: Folder of individual FASTA sequences in .fasta format.
- Why: The Gluten database had a single file with all possible sequences and epitopes within. This made running a program on it time consuming and made selecting sequences of interest more difficult than if it was just single files. This script was created to split the larger FASTA file into individual sequences.
Enzyme Epitope Predictor
- Function: To determine the results of enzyme cleavage and any epitopes that would brings to the resulting fragments.
- Input:
- Epitopes: Epitope CSV (file must be in .csv format not .xlsx)
- Folder of Sequences: A folder with fasta sequences in .fasta format.
- Output Folder: A folder for the output to be loaded into (we sugest that it be empty)
- Output:
- Shiny App: Summary CSV for all enzymes and the positive control, sequences displayed in the window of the shiny app.
- R-script: Summary CSV for all enzymes and the positive control, containing the fasta sequences in .fasta format marked fail, negative or success - all contained in the selected output folder.
- Why: This program was the cornerstone of our modeling. Cleaving the protein sequences based on selected enzymes and matching the resulting segments to a list of provided epitopes. The output of this program informed our selection of enzymes.
File Splicer
- Function: Divides out successful sequences from failed ones.
- Input: Folder of Enzyme Epitope Predictor Results
- Output: Folder of failed sequences (data missing epitopes) and a folder of successful sequences (those that have epitopes)
- Why: This program made getting the sequences that matched with epitopes easy to obtain and reassess. It made sorting the results very easy.
Why Multiple Scripts
The dry lab attempted having all scripts combined in one software. However, due to the large number of files the program was designed to process it caused the program to be time consuming to run. As such (and as the additional functions could easily be run once, not every time) it was determined that the programs were more efficient separately.
Use as Our Model
We used our software to test the function of the following five enzymes against the GlutPro 6.1 Database [1]:
- KUMA030
- AN-PEP
- SC-PEP
- EP_B2
- FVp-P
- CARICAIN (Enzyme used in comercial product Gluteguard)
The enzymes were selected for their potential to degrade gluten, as their catalytic motifs target amino acid sequences commonly found within gluten proteins. They are all derived from either fungal or plant genomes.
To evaluate enzyme performance, the Enzyme Epitope Predictor was used to generate a rough model of the expected gluten fragments, including their size, number, and sequence. These fragments were then compared to a previously established database of gluten IgE epitope sequences. Based on this comparison, it was possible to infer whether a single enzyme would be sufficient to break gluten into non-immunogenic fragments or whether a combination of enzymes would be necessary.
The results of the model indicated that our prefered enzyme candidate AN-PEP did leave a fragment that could be bound by an epitope, indicating that a combination of enzymes would be required to completely prevent flareups.
These results guided our decision to begin exploring enzymes that could be used in combination with AN-PEP.
Limitations
Though this software was valuable in helping the team determine what enzymes they wished to pursue there are some limitations. The system does not account for enzyme concentration, reaction time or any real world conditions such as temperature or pH. In essence it models an idealized system.
The Enzyme Epitope Predictor can determine wheather an enzyme could theoretically cleave gluten at every matching motif, but not whether similar results could realistically be achieved in vivo.
The program also does not do multi-enzyme digests, each sequence is cut with each enzyme once, preventing it from exploring the effect of enzymes in combination - though given more time to work on the program, this issue would be addressed.
Download Instructions
Download the software from our Git Lab!
- Ensure you have R installed on your computer.
- All of our programs are built in R version 4.5.1 and run on R-studio, best results will be obtained using the same version.
Instructions will be provided under the assumption that R-studio is being used.
- All of our programs are built in R version 4.5.1 and run on R-studio, best results will be obtained using the same version.
- Visit our Git Lab and download the required program(s)
- Open the file(s) in R and hit the run button.
- Approve the installation of any packages the software requires.
- Use as directed in the instructions.
FASTA Ripper
- Open the FASTA Ripper in R-studio
- Hit Source (the run button) in the top right hand corner of the window.
- A file navigator will open, within it select the .fasta or .txt file you would like to rip.
- A new folder will open with the ripped FASTA sequences inside.
Enzyme Epitope Predictor Instructions
CSV File Prep:
The Enzyme Epitope predictor requires the epitopes to be provided as a CSV.
The program must have a .csv file extension or it will not work.
The file should have three columns as
demonstrated in Table 1.
Col 1 | Col 2 | Col 3 |
---|---|---|
Protein Name | Epitope | ID Code |
String (Does not need to be unique) |
Amino Acid Sequence Where X is an unknown/variable amino acid |
Number (Must be unique to each epitope) |
Instructions
The program will run the same whether using the shiny app or R studio. It will call for the same inputs. Only the output will change, we suggest the shiny app for shorter jobs where you just need the pass/fail info and don’t intend to work further with the FASTA files and the R script for larger datasets where recovering successful/failed files would be useful (say to go back through and see if other enzymes get rid of any epitope matches).
- Gather your documents:
- Epitope CSV (.csv format)
- Folder of .fasta files to be tested (can be made wit the FASTA Ripper)
- Output folder
- Open the Enzyme Epitope Predictor, it will open a shiny app.
- Upload the epitope CSV
- Select the FASTA folder
- Select the output folder
- Select the enzymes you wish to test (or add the information for your own*)
- If using the Shiny app, select output type:
- Overall Summary: tabs show success and failed sequences accross all enzymes
- Summary Per Enzyme: tabs show the information seperated by enzyme
- Hit run analysis
*if adding your own enzymes in the R-script you will need to do that by altering the code. The function is built into the app.
Warning: Depending on the specifications of your device, the program may take longer times to run (roughly 5-10 minutes). More complex cleavage patterns take more time in return.
Interpreting Your Results
Data is only useful if it can be easily interpreted. In this section we walk you through how to do that. Focusing on the output of the shiny app.

If the user decides to choose the Overall Summary option, they will be presented with this screen. In the top middle, there are 2 visible tabs: Success and Failed. The success tab shows sequences that have matched with a given Ige Epitope from the CSV file imputed by the user. From left to right, the information given is as follows:
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 |
---|---|---|---|---|
The enzyme used | The sequence title | The number associated with the fragment that is a positive match in this case | The fragment sequence that was cleaved | The epitope code(s) that matched in the sequence, if multiple exist, they will be delimited by a semi-colon(;) |
There is also the option to show more entries at once, either 10, 25, 50 or 100.On the bottom of the results, the user can keep track of how many trials they have done, BUT is unable to retrieve older results if a new run has been started. The user is also able to download the results to their devices in the form of a CSV file. The user can either download the results of a specific enzyme or all at once in a ZIP folder. If they download the ZIP file, the trial number would be written in the title of that folder.

Similar to the Overall Summary, the Per-Enzyme Summary mode displays the data in a similar manner, but the tabs are replaced with the Enzymes instead. Each tab would display the Positive Matches first, and then the failed matches under it. If no positive or failed matches exist, the table would display a “No data Available in Table” message.

After downloading the CSV file, the user will see 4 columns with data:
Col 1 | Col 2 | Col 3 | Col 4 |
---|---|---|---|
The Sequence Name | The Fragment Number | The Sequence in the fragment | The Epitope code if an epitope match occured |
The sequence name will appear truncated at first, but if the user decides to expand the cell with the name they want, they can view the full title of the sequence.
File Splicer Instructions
This program is only useful when processing the results of the Enzyme Epitope Predictor R-script. The shiny app produces only the summary CSV files not the anotated FASTA sequences.
- Hit Run code
- Select the folder containing the results of the Enzyme Epitope Enzyme Predictor
- Select output folder
- Results will appear in the output folder, seperated based on success and fail tags.
References
- Bromilow, S., Daly, M., Gethings, L. A., Mills, E. N. C., Nitride, C., & Shewry, P. R. (2020). The GluPro suite of curated cereal seed storage prolamin protein sequence datasets [Data set]. figshare. https://doi.org/12613154
- OpenAI. (2025). ChatGPT (GPT-5) [Large language model]. Retrieved July 15, 2025, from https://chat.openai.com/
- Posit. (2025). Shiny [Web framework]. Retrieved September 19, 2025, from https://shiny.posit.co/
- GeeksforGeeks. (2024, July 25). Biostrings in R. GeeksforGeeks. https://www.geeksforgeeks.org/r-language/biostrings-in-r/
- Chua, E. H. (n.d.). Regular Expressions (Regex). Nanyang Technological University. https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
- RDocumentation. (n.d.). RDocumentation. July 15, 2025, from https://www.rdocumentation.org/