SMILES Script

Prediction of Peptide Structure

Abstract

For the high-throughput LC-MS measurement of our donor library, we needed to calculate the masses of 105 expected peptide products (part collection). Doing this by hand would be extremely time-consuming, contradicting the idea of a high-throughput approach and limiting accessibility for other iGEM teams. To solve this, we developed a script that automates the generation of chemical structures and their corresponding masses using SMILES concatenation. On this page, we explain how we built and applied this tool.

Introduction

Our tool is composed of a Jupyter Notebook (or Colab Notebook) and an Excel workbook with three sheets. The first sheet serves as the output, while the second and third sheets provide the input for the script. This setup ensures an accessible format, even for users with little to no programming experience. The output of our tool is a SMILES code (Simplified Molecular Input Line Entry System)^[1] for every peptide which can possibly be generated using our platform. The SMILES algorithm is a powerful method of turning complex organic molecules into a one-dimensional, computer-readable string. In order to use our tool, you don't have to understand how SMILES work.

The rationale behind SMILES

In SMILES code, atoms are given the same abbreviations as in the periodic table (i. e. C for carbon, N for nitrogen, O for oxygen). Hydrogen atoms are not required (in most cases), as they are implicitly added when an atom does not have enough bonds. For example, the SMILES code for water would simply be O. Oxygen has a valence of two, and since it is not explicitly connected to any other atom, there must be two implicit bonds to hydrogen. Double and triple bonds can be shown with the symbols = and #, respectively. Single bonds can be shown with a -, but are usually simply omitted. The SMILES codes for ethane, ethene and ethyne would therefore be CC (or less commonly C-C), C=C and C#C. Branches are shown with parentheses, thus acetone would be CC(=O)C. Rings are displayed by putting the same number behind two atoms to indicate that they are connected: Cyclohexane can be written as C1CCCCC1, where the numbers indicate that the first and the last carbon atom are bound together. Finally, stereochemistry can be shown with the indicators @ and @@ and they require the mention of all four substituents around the carbon atom, meaning that the ⍺-hydrogens in stereogenic amino acids do need to be explicitly shown. An example would be N[C@H](C)C(=O)O and N[C@@H](C)C(=O)O for D-alanine and L-alanine, respectively. Further features that can be included in SMILES strings are charge and aromaticity. Fig. 1 shows how a more complicated molecule could be expressed as SMILES.

**Fig. 1:** The peptide C4-W-f-AMMON (where C4 stands for butyric acid and AMMON for ammonia) comparing its chemical structure and SMILES code. The structure and the SMILES are color-coded to show the main chain (red), side chains (gray, turquoise and violet) and cyclizations (yellow).

The SMILES format turned out to be especially useful. First, it allows us to use established tools to calculate molecular masses. Second, SMILES strings can be directly copied into accessible software that converts them into chemical structures, giving us immediate visualizations of our peptides.

The SMILES look-up table

The third sheet of the Excel workbook functions as an internal “database”. It lists every amino acid or other moiety incorporated by an NRPS module, together with its corresponding SMILES code fragment. During the SMILES code generation of a peptide product, this sheet serves as a look up table for the SMILES of the individual amino acids in the peptide. The peptide sequence is determined by the plasmids (see module look-up table) as well the combination of the three plasmids used for every NRPS expression (see Code and Output).

These fragments are just representations of amino acids with the C-terminal OH-group removed, for instance, the fragment for L-alanine would be N[C@@H](C)C(=O). However, there are many ways of writing amino acids - OC(=O)[C@H](C)N would be equally valid SMILES for L-alanine. Our fragments needed to follow somewhat stricter rules for the algorithm to work. Amino acids in the peptide chain must always start with the N of the N-terminal amino group and end with C(=O), representing the C-terminal carboxy group. The same goes for modifications: N-terminal modifications end in C(=O) (e. g. CC(=O) for acetic acid) and C-terminal modifications start in N (e. g. NCCCCN for putrescine). These SMILES can be simply concatenated in order to create a linear peptide. Fig. 2 shows how the SMILES from fig. 1 that our code would create for the peptide sequence C4-W-f-AMMON are directly assembled from the SMILES fragments. We would also like to highlight that the SMILES that we create are very readable because their main chain always follows the peptide chain. We have color-coded the main chain and side chains in this example - but with a bit of practice, it is possible to visualize the chemical structure of a peptide just by looking at the SMILES, which was very helpful for bug fixing!

**Fig. 2:** SMILES for the linear peptide C4-W-f-AMMON, showing how it is constructed from the SMILES fragments in our database.

The module look-up table

The second sheet is a look-up sheet containing the names of all plasmids that we used for peptide production and the amino acids that the NRPS modules on these plasmids incorporate. Fig. 3 shows how this sheet looks with the nine plasmids that we used for our initial intein shuffling experiments.

**Fig. 3:** Module lookup table, the second sheet of the workbook filled out for our standard plasmids of the chaiyaphumine, szentiamide and xentrivalpeptide synthetases.

It is straightforward to add a new plasmid to this table: The amino acid sequence is added in column D as a one-letter code with hyphens between the amino acids (e. g. P-V for pIG023, which incorporates proline and valine). L-amino acids are uppercase, D-amino acids are lowercase. Unusual amino acids, methylated amino acids and C- and N-terminal modifications can also be given (e. g. PAA for phenylacetic acid).

Column C asks whether the unit starts with an E domain or a dual E/C domain. This is important because these domains will epimerize the amino acid incorporated by an upstream A domain, which is on a different plasmid. Conversely, the last amino acid included by each plasmid must always be given as uppercase, since the information whether it is epimerized is only situated on the next downstream plasmid.

Column E (which must only be filled out for termination plasmids) contains information about the type of peptide produced. ‘0’ means that the peptide will be linear. In this case, the amino acid sequence must not end with the last amino acid, but with the nucleophile that releases the peptide from the NRPS: ammonia (given as ‘AMMON’ in our amino acid code) will lead to a primary amide C-terminus, water (given as ‘WATER’) leads to a carboxylic acid C-terminus and a number of amines such as putrescine (given as ‘PUT’) will lead to secondary amide C-termini.

If column E contains a ‘1’, this means that the peptide is end-to-end cyclized (i.e. the N-terminus and the C-terminus form an amide bond). In this case, it must be made sure that the amino acid sequences contain neither C-terminal nor N-terminal modifications. Finally, a ‘2’ in column E means that the TE domain creates a depsipeptide. In this case, the C-terminus must not be modified (but N-terminal modifications are acceptable). However, the one-letter code of the threonine or serine at which the depsicyclization occurs must be followed by a ‘1’ (i. e. T1, S1, t1 or s1).

Macrocyclization

You might have noticed in Fig. 1 that the number ‘1’ is used to close two different rings. This is possible because SMILES are parsed from left to right - whenever there is a number, that ring is ‘opened’, and when the same number appears again, the ring is ‘closed’ and the number becomes available again. However, to create macrocycles, a new number has to be used because the two parts that will be connected are on completely different ends of the SMILES string. We chose ‘3’ because the numbers in our SMILES fragments only go up to ‘2’ (in tryptophane).

To generate SMILES for macrocyclic peptides, we still start by concatenating the SMILES fragments like for linear peptides. The two ‘3’s are inserted only after this process: For end-to-end cyclizations, one ‘3’ is inserted directly after the first character (i. e. after the N-terminal N) and the other is inserted as the last character (after the C-terminal C(=O)). This way, both termini are connected with an amide bond (Fig. 4).

**Fig. 4:** Structure and SMILES code of the end-to-end cyclized peptide N-R-P-i-e-c-e-S. Cyclizations are marked in yellow. Because the number ‘3’ is uniquely used for macrocyclizations in our approach, it does not interfere with any other cycles such as the one in proline.

For depsipeptides, only one ‘3’ at the very end needs to be added to the concatenated fragments. The other is provided by the depsicyclization-specific SMILES fragments, one (and only one) of which must be a part of the sequence for any depsipeptide. For example, chaiyaphumine A has the sequence PAA-T1-f-a-P-W and the SMILES fragment for T1, N[C@@H]([C@H](O3)C)C(=O), contains the second ‘3’ for the cyclization (Fig. 5).

**Fig. 5:** Structure and SMILES for chaiyaphumine A. Cyclization points are shown in yellow.

Code and Output

Although the code looks up everything it needs in sheets 2 and 3, the output is given in sheet 1. Before running the code, the user must indicate the plasmids that they want to analyze in this sheet by writing the names in columns D, E and F (for starter, elongation and termination units) (Fig. 6).

**Fig. 6:** Sheet 1 before running the code.

Then, they specify the path towards the Excel file in the code and run all cells of the code consecutively. This is very intuitive with Jupyter Notebooks and requires no coding experience. If you are not familiar with Jupyter Notebook, we recommend that you use Colab which provides a very accessible web-based interface. The first cell of code installs rdkit (which we use to calculate masses and chemical formulas from the SMILES) and the second loads the provided Excel workbook. The following three cells consecutively fill out the columns in sheet 1: first, all permutations of the given plasmids are created (columns D-G), then, the amino acid sequence is created for each of the permutations using the module look-up sheet (column H and I) and finally, the SMILES, mass and sum formula are calculated for each amino acid sequence (columns J-L). The final Excel workbook can be downloaded, an example is shown in Fig. 7.

**Fig. 7:** Sheet 1 after running the code.

Automation of MS spectra analysis

Using the list of masses created with this script, we could also automate the MS analysis. DataAnalysis, the Bruker tool that we used for LC-MS analysis, provides the option to process multiple spectra in batches, which we did using the (slightly modified) script by Gonschorek et al.^[2] The script automatically creates EICs of the expected masses, colors the BPC and EICs to create ready-to-export figures and returns whether the area under the curve of the EICs is under or over a certain threshold (i. e. whether a compound with the correct mass was found or not).

Script for automated MS analysis


                Dim BPC
                Dim EIC
                Dim areaThreshold
                areaThreshold = 3000000
                Dim EICwidth
                EICwidth = 0.2
                Dim relAreaFracThreshold
                relAreaFracThreshold = 50
                ' Define the Excel file path
                Dim excelFilePath
                excelFilePath = "PATH-TO_FILE\expIG040_script.xlsx"
                ' Create an instance of Excel
                Dim excelApp
                Set excelApp = CreateObject("Excel.Application")
                Dim workbook
                ' Open the Excel workbook
                Set workbook = excelApp.Workbooks.Open(excelFilePath)
                ' Work only on sheet1 now
                Dim sheet
                Set sheet = workbook.Sheets(1)
                ' Search column C for the cell that matches the current dataset name
                Dim cell
                Dim foundCell
                Set foundCell = Nothing
                For Each cell In sheet.Range("C:C")
                If cell.Value = analysis.name Then
                Set foundCell = cell
                Exit For
                End If
                Next
                ' Check if the matching cell was found
                If Not foundCell Is Nothing Then
                Set BPC = CreateObject("DataAnalysis.BPCChromatogramDefinition")
                Set EIC = CreateObject("DataAnalysis.EICChromatogramDefinition")
                ' Set the colors for the chromatograms
                BPC.Color = RGB(144, 144, 144) ' grey
                EIC.Color = RGB(200, 0, 60) ' red
                ' Add the TIC chromatogram (BPC) without integrating
                Analysis.Chromatograms.Clear
                Analysis.Chromatograms.AddChromatogram BPC
                ' Check for target masses in columns K, then every 8th column
                Dim targetMass
                Dim col
                Dim count
                For col = 11 To sheet.UsedRange.Columns.Count Step 8
                targetMass = sheet.Cells(foundCell.Row, col).Value
                If Not IsEmpty(targetMass) Then
                count = count + 1
                Set EIC = CreateObject("DataAnalysis.EICChromatogramDefinition")
                EIC.Range = targetMass + 1.007276
                EIC.Color = RGB(200 - (count * 25), 0, 60)
                EIC.widthleft = EICwidth
                EIC.widthright = EICwidth
                EIC.BackgroundType = daSpectral
                Analysis.Chromatograms.AddChromatogram EIC
                Analysis.Chromatograms(count + 1).AddRangeSelection 1.0, 15.0, 0, 0
                Analysis.Chromatograms(count + 1).IntegrateOnly
                Dim compoundFound
                compoundFound = False
                ' Determine output columns for RT + Area
                Dim rtCol, areaCol
                rtCol = col + 2 ' retention time (K->M, S->U, etc.)
                areaCol = col + 3 ' peak area (K->N, S->V, etc.)
                For num = 1 To Analysis.Compounds.Count
                ' Check thresholds
                If Analysis.Compounds(num).Area > areaThreshold _
                And InStr(Analysis.Compounds(num).Chromatogram, "EIC") > 0 _
                And Analysis.Compounds(num).RelativeAreaFraction > relAreaFracThreshold Then
                ' Write values into sheet1, same row as dataset
                sheet.Cells(foundCell.Row, rtCol).Value = Analysis.Compounds(num).RetentionTime / 60
                sheet.Cells(foundCell.Row, areaCol).Value = Analysis.Compounds(num).Area
                compoundFound = True
                End If
                Next
                ' Color the mass cell green/red depending on detection
                If compoundFound Then
                sheet.Cells(foundCell.Row, col).Interior.Color = RGB(0, 255, 0) ' Green
                Else
                sheet.Cells(foundCell.Row, col).Interior.Color = RGB(255, 0, 0) ' Red
                End If
                Analysis.Compounds.Clear ' clear compound list
                End If
                Next
                Else
                MsgBox "No matching dataset name found in column C."
                workbook.Close False
                excelApp.Quit
                Set workbook = Nothing
                Set excelApp = Nothing
                WScript.Quit
                End If
                ' Close the workbook and quit Excel
                workbook.Close True ' Save changes
                excelApp.Quit
                ' Release Excel object references
                Set workbook = Nothing
                Set excelApp = Nothing
                ' Save data
                analysis.Save
                ' Close script window
                form.Close

References

[1] Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), 31–36. https://doi.org/10.1021/ci00057a005

[2] Gonschorek, P. et. al (2025). Split inteins for generating combinatorial non-ribosomal peptide libraries. bioRxiv, https://doi.org/10.1101/2025.10.02.680031

Show all references

Show less

Contents

Abstract

Introduction

Why SMILES?

The SMILES table

Module table

Macrocyclization

Code and Output

MS automation

SMILES Script

Prediction of Peptide Structure

Abstract

Introduction

The rationale behind SMILES

The SMILES look-up table

The module look-up table

Macrocyclization

Code and Output

Automation of MS spectra analysis

Script for automated MS analysis

References