SMILES Script
Prediction of Peptide Structure
Abstract
For the high-throughput LC-MS measurement of our donor library, we needed to calculate the masses of 105 expected peptide products (part collection). Doing this by hand would be extremely time-consuming, contradicting the idea of a high-throughput approach and limiting accessibility for other iGEM teams. To solve this, we developed a script that automates the generation of chemical structures and their corresponding masses using SMILES concatenation. On this page, we explain how we built and applied this tool.
Introduction
Our tool is composed of a Jupyter Notebook (or Colab Notebook) and an Excel workbook with three sheets. The first sheet serves as the output, while the second and third sheets provide the input for the script. This setup ensures an accessible format, even for users with little to no programming experience. The output of our tool is a SMILES code (Simplified Molecular Input Line Entry System)[1] for every peptide which can possibly be generated using our platform. The SMILES algorithm is a powerful method of turning complex organic molecules into a one-dimensional, computer-readable string. In order to use our tool, you don't have to understand how SMILES work.
The rationale behind SMILES
In SMILES code, atoms are given the same abbreviations as in the periodic table (i. e. C for carbon, N for nitrogen, O for oxygen). Hydrogen atoms are not required (in most cases), as they are implicitly added when an atom does not have enough bonds. For example, the SMILES code for water would simply be O. Oxygen has a valence of two, and since it is not explicitly connected to any other atom, there must be two implicit bonds to hydrogen. Double and triple bonds can be shown with the symbols = and #, respectively. Single bonds can be shown with a -, but are usually simply omitted. The SMILES codes for ethane, ethene and ethyne would therefore be CC (or less commonly C-C), C=C and C#C. Branches are shown with parentheses, thus acetone would be CC(=O)C. Rings are displayed by putting the same number behind two atoms to indicate that they are connected: Cyclohexane can be written as C1CCCCC1, where the numbers indicate that the first and the last carbon atom are bound together. Finally, stereochemistry can be shown with the indicators @ and @@ and they require the mention of all four substituents around the carbon atom, meaning that the ⍺-hydrogens in stereogenic amino acids do need to be explicitly shown. An example would be N[C@H](C)C(=O)O and N[C@@H](C)C(=O)O for D-alanine and L-alanine, respectively. Further features that can be included in SMILES strings are charge and aromaticity. Fig. 1 shows how a more complicated molecule could be expressed as SMILES.
The SMILES format turned out to be especially useful. First, it allows us to use established tools to calculate molecular masses. Second, SMILES strings can be directly copied into accessible software that converts them into chemical structures, giving us immediate visualizations of our peptides.
The SMILES look-up table
The third sheet of the Excel workbook functions as an internal “database”. It lists every amino acid or other moiety incorporated by an NRPS module, together with its corresponding SMILES code fragment. During the SMILES code generation of a peptide product, this sheet serves as a look up table for the SMILES of the individual amino acids in the peptide. The peptide sequence is determined by the plasmids (see module look-up table) as well the combination of the three plasmids used for every NRPS expression (see Code and Output).
These fragments are just representations of amino acids with the C-terminal OH-group removed, for instance, the fragment for L-alanine would be N[C@@H](C)C(=O). However, there are many ways of writing amino acids - OC(=O)[C@H](C)N would be equally valid SMILES for L-alanine. Our fragments needed to follow somewhat stricter rules for the algorithm to work. Amino acids in the peptide chain must always start with the N of the N-terminal amino group and end with C(=O), representing the C-terminal carboxy group. The same goes for modifications: N-terminal modifications end in C(=O) (e. g. CC(=O) for acetic acid) and C-terminal modifications start in N (e. g. NCCCCN for putrescine). These SMILES can be simply concatenated in order to create a linear peptide. Fig. 2 shows how the SMILES from fig. 1 that our code would create for the peptide sequence C4-W-f-AMMON are directly assembled from the SMILES fragments. We would also like to highlight that the SMILES that we create are very readable because their main chain always follows the peptide chain. We have color-coded the main chain and side chains in this example - but with a bit of practice, it is possible to visualize the chemical structure of a peptide just by looking at the SMILES, which was very helpful for bug fixing!
The module look-up table
The second sheet is a look-up sheet containing the names of all plasmids that we used for peptide production and the amino acids that the NRPS modules on these plasmids incorporate. Fig. 3 shows how this sheet looks with the nine plasmids that we used for our initial intein shuffling experiments.
It is straightforward to add a new plasmid to this table: The amino acid sequence is added in column D as a one-letter code with hyphens between the amino acids (e. g. P-V for pIG023, which incorporates proline and valine). L-amino acids are uppercase, D-amino acids are lowercase. Unusual amino acids, methylated amino acids and C- and N-terminal modifications can also be given (e. g. PAA for phenylacetic acid).
Column C asks whether the unit starts with an E domain or a dual E/C domain. This is important because these domains will epimerize the amino acid incorporated by an upstream A domain, which is on a different plasmid. Conversely, the last amino acid included by each plasmid must always be given as uppercase, since the information whether it is epimerized is only situated on the next downstream plasmid.
Column E (which must only be filled out for termination plasmids) contains information about the type of peptide produced. ‘0’ means that the peptide will be linear. In this case, the amino acid sequence must not end with the last amino acid, but with the nucleophile that releases the peptide from the NRPS: ammonia (given as ‘AMMON’ in our amino acid code) will lead to a primary amide C-terminus, water (given as ‘WATER’) leads to a carboxylic acid C-terminus and a number of amines such as putrescine (given as ‘PUT’) will lead to secondary amide C-termini.
If column E contains a ‘1’, this means that the peptide is end-to-end cyclized (i.e. the N-terminus and the C-terminus form an amide bond). In this case, it must be made sure that the amino acid sequences contain neither C-terminal nor N-terminal modifications. Finally, a ‘2’ in column E means that the TE domain creates a depsipeptide. In this case, the C-terminus must not be modified (but N-terminal modifications are acceptable). However, the one-letter code of the threonine or serine at which the depsicyclization occurs must be followed by a ‘1’ (i. e. T1, S1, t1 or s1).
Macrocyclization
You might have noticed in Fig. 1 that the number ‘1’ is used to close two different rings. This is possible because SMILES are parsed from left to right - whenever there is a number, that ring is ‘opened’, and when the same number appears again, the ring is ‘closed’ and the number becomes available again. However, to create macrocycles, a new number has to be used because the two parts that will be connected are on completely different ends of the SMILES string. We chose ‘3’ because the numbers in our SMILES fragments only go up to ‘2’ (in tryptophane).
To generate SMILES for macrocyclic peptides, we still start by concatenating the SMILES fragments like for linear peptides. The two ‘3’s are inserted only after this process: For end-to-end cyclizations, one ‘3’ is inserted directly after the first character (i. e. after the N-terminal N) and the other is inserted as the last character (after the C-terminal C(=O)). This way, both termini are connected with an amide bond (Fig. 4).
For depsipeptides, only one ‘3’ at the very end needs to be added to the concatenated fragments. The other is provided by the depsicyclization-specific SMILES fragments, one (and only one) of which must be a part of the sequence for any depsipeptide. For example, chaiyaphumine A has the sequence PAA-T1-f-a-P-W and the SMILES fragment for T1, N[C@@H]([C@H](O3)C)C(=O), contains the second ‘3’ for the cyclization (Fig. 5).
Code and Output
Although the code looks up everything it needs in sheets 2 and 3, the output is given in sheet 1. Before running the code, the user must indicate the plasmids that they want to analyze in this sheet by writing the names in columns D, E and F (for starter, elongation and termination units) (Fig. 6).
Then, they specify the path towards the Excel file in the code and run all cells of the code consecutively. This is very intuitive with Jupyter Notebooks and requires no coding experience. If you are not familiar with Jupyter Notebook, we recommend that you use Colab which provides a very accessible web-based interface. The first cell of code installs rdkit (which we use to calculate masses and chemical formulas from the SMILES) and the second loads the provided Excel workbook. The following three cells consecutively fill out the columns in sheet 1: first, all permutations of the given plasmids are created (columns D-G), then, the amino acid sequence is created for each of the permutations using the module look-up sheet (column H and I) and finally, the SMILES, mass and sum formula are calculated for each amino acid sequence (columns J-L). The final Excel workbook can be downloaded, an example is shown in Fig. 7.
Automation of MS spectra analysis
Using the list of masses created with this script, we could also automate the MS analysis. DataAnalysis, the Bruker tool that we used for LC-MS analysis, provides the option to process multiple spectra in batches, which we did using the (slightly modified) script by Gonschorek et al.[2] The script automatically creates EICs of the expected masses, colors the BPC and EICs to create ready-to-export figures and returns whether the area under the curve of the EICs is under or over a certain threshold (i. e. whether a compound with the correct mass was found or not).
Script for automated MS analysis
Dim BPC
Dim EIC
Dim areaThreshold
areaThreshold = 3000000
Dim EICwidth
EICwidth = 0.2
Dim relAreaFracThreshold
relAreaFracThreshold = 50
' Define the Excel file path
Dim excelFilePath
excelFilePath = "PATH-TO_FILE\expIG040_script.xlsx"
' Create an instance of Excel
Dim excelApp
Set excelApp = CreateObject("Excel.Application")
Dim workbook
' Open the Excel workbook
Set workbook = excelApp.Workbooks.Open(excelFilePath)
' Work only on sheet1 now
Dim sheet
Set sheet = workbook.Sheets(1)
' Search column C for the cell that matches the current dataset name
Dim cell
Dim foundCell
Set foundCell = Nothing
For Each cell In sheet.Range("C:C")
If cell.Value = analysis.name Then
Set foundCell = cell
Exit For
End If
Next
' Check if the matching cell was found
If Not foundCell Is Nothing Then
Set BPC = CreateObject("DataAnalysis.BPCChromatogramDefinition")
Set EIC = CreateObject("DataAnalysis.EICChromatogramDefinition")
' Set the colors for the chromatograms
BPC.Color = RGB(144, 144, 144) ' grey
EIC.Color = RGB(200, 0, 60) ' red
' Add the TIC chromatogram (BPC) without integrating
Analysis.Chromatograms.Clear
Analysis.Chromatograms.AddChromatogram BPC
' Check for target masses in columns K, then every 8th column
Dim targetMass
Dim col
Dim count
For col = 11 To sheet.UsedRange.Columns.Count Step 8
targetMass = sheet.Cells(foundCell.Row, col).Value
If Not IsEmpty(targetMass) Then
count = count + 1
Set EIC = CreateObject("DataAnalysis.EICChromatogramDefinition")
EIC.Range = targetMass + 1.007276
EIC.Color = RGB(200 - (count * 25), 0, 60)
EIC.widthleft = EICwidth
EIC.widthright = EICwidth
EIC.BackgroundType = daSpectral
Analysis.Chromatograms.AddChromatogram EIC
Analysis.Chromatograms(count + 1).AddRangeSelection 1.0, 15.0, 0, 0
Analysis.Chromatograms(count + 1).IntegrateOnly
Dim compoundFound
compoundFound = False
' Determine output columns for RT + Area
Dim rtCol, areaCol
rtCol = col + 2 ' retention time (K->M, S->U, etc.)
areaCol = col + 3 ' peak area (K->N, S->V, etc.)
For num = 1 To Analysis.Compounds.Count
' Check thresholds
If Analysis.Compounds(num).Area > areaThreshold _
And InStr(Analysis.Compounds(num).Chromatogram, "EIC") > 0 _
And Analysis.Compounds(num).RelativeAreaFraction > relAreaFracThreshold Then
' Write values into sheet1, same row as dataset
sheet.Cells(foundCell.Row, rtCol).Value = Analysis.Compounds(num).RetentionTime / 60
sheet.Cells(foundCell.Row, areaCol).Value = Analysis.Compounds(num).Area
compoundFound = True
End If
Next
' Color the mass cell green/red depending on detection
If compoundFound Then
sheet.Cells(foundCell.Row, col).Interior.Color = RGB(0, 255, 0) ' Green
Else
sheet.Cells(foundCell.Row, col).Interior.Color = RGB(255, 0, 0) ' Red
End If
Analysis.Compounds.Clear ' clear compound list
End If
Next
Else
MsgBox "No matching dataset name found in column C."
workbook.Close False
excelApp.Quit
Set workbook = Nothing
Set excelApp = Nothing
WScript.Quit
End If
' Close the workbook and quit Excel
workbook.Close True ' Save changes
excelApp.Quit
' Release Excel object references
Set workbook = Nothing
Set excelApp = Nothing
' Save data
analysis.Save
' Close script window
form.Close
References
[1] Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), 31–36. https://doi.org/10.1021/ci00057a005
[2] Gonschorek, P. et. al (2025). Split inteins for generating combinatorial non-ribosomal peptide libraries. bioRxiv, https://doi.org/10.1101/2025.10.02.680031