Model - ALSense

Glossary

Input:

Information that is introduced to a program. In our case it is a set of random DNA sequences and the sequence of the protein of interest.

Output:

Information that the program gives to the user when it is finished. In our case it's a set of DNA sequences.

Protein of interest:

Protein that one wants to detect with an aptamer.

Protein of negative interest:

Protein that one wants the aptamer not to bind to.

Aptamer:

Polynucleotide that acquires a stable three-dimensional conformation and is able to specifically recognize another molecule.

API:

Aptamer-Protein Interaction.

Affinity:

Strength of the binding between the aptamer (or antibody or nanobody) with its target molecule.

Specificity:

Capacity of the aptamer to bind to the target and not to other molecules.

Overview

In order to improve the performance of the experimental SELEX procedure, we have designed a preliminary in silico strategy that could be applied if sufficient computational resources were available. The pipeline has been called inSilico SELEX because it follows the same rationale as the experimental SELEX, but substitutes the experimental evaluation of the binding for a computational prediction.

The pipeline comprises three main steps (Figure 1):

DNA variability generation
Protein–DNA interaction prediction
Binding score rating and selection

The input of the whole program is a random set of DNA sequences together with the sequence of the protein of interest. The output is a reduced set of DNA sequences with expected high affinity.

In an initial approach, before applying the method to any protein of interest, it would be necessary to validate the approach (Figure 1.A). To do so, we would use a protein sequence with a known high-affinity aptamer binder. Starting with a random set of DNA sequences, we would perform several inSilico SELEX rounds and evaluate whether or not we recover the known aptamer sequence. The process would be repeated for a dataset of many identified proteins with high-affinity aptamers, and if a sequence similar to the real one is found in enough cases, the approach would be considered reliable.

Figure 1: InSilico Selex. A. Validatory approach to inSilico Selex. B. Approach to inSilico Selex when using a protein of interest, with the purpose of guiding the experimental Selex.

If that's the case, we would proceed to use the program on our proteins of interest (Figure 1.B). After several rounds of inSilico Selex, the reduced set of DNA sequences would be aligned and turned into what we call "probabilistic sequence". This is a DNA sequence where in each position there is no one single nucleotide, but a combination of the four in different probabilities. This probabilistic sequence would be used to order the real DNA library used for the experimental Selex.

Software Tools

Going more into detail, we have considered three possible ways of evaluating the binding between the protein and the aptamer (Figure 2).

AlphaFold + FoldX

AlphaFold would be used to fold both the protein and the aptamer and also to dock them. The resulting structural complex would be the input for FoldX, a force field based on experimental ΔΔG data. It would use the FoldX Analyze Complex command, which accurately predicts the stability of a complex.

A limitation of this approach is the fact that AlphaFold only gives one possible docking solution, when in reality there may be many in a dynamic equilibrium. Nevertheless, the software gives the most stable one most of the time. If not, it is probable that in the random initial set there may be several very similar sequences that, when inputted to AlphaFold, would give slightly different docking conformations.

AptaTrans

AptaTrans is a pipeline for predicting aptamer–protein interaction (API) using deep learning techniques. It would substitute the combination of AlphaFold + FoldX.

AptaBLE

AptaBLE is a sequence-based language model trained on its own experimentally generated databases of DNA and RNA. Their benchmarking analysis shows that it outperforms both AptaTrans and AlphaFold. Nevertheless, we would try the three approaches and analyze which works better in our cases.

Figure 2: InSilico Selex specificities. A. InSilico Selex when using AlphaFold to dock the protein and the aptamer and FoldX to calculate the stability of the complex. B,C. InSilico Selex when using AptaTrans and AptaBLE softwares respectively to calculate the binding or affinity scores. Thickness of the gray line represents the number of DNA sequences at each step.

Complementation with Negative Selection

The inSilico Selex can be complemented with a negative selection loop (Figure 3). This double loop would require not only the target protein of interest but also another protein (or set of proteins) to which the aptamer shouldn't bind (referred as "proteins of negative interest" from now on). The complementary negative loop would take the filtered sequences with high affinity for the protein of interest and would analyze the interaction with the protein(s) of negative interest. The sequences with high affinity for the proteins of negative interest would be discarded, while the ones with the lowest affinities for this protein(s) of negative interest would be kept. We could say that, if the first loop increases affinity, the second negative loop increases specificity.

In the case of our biomarkers, we want to specifically detect an isoform of the protein which is the product of an exon inclusion event. Therefore, the aptamer has to bind to the peptide product of the included exon. In the positive selection loop, we would use the protein with the included peptide, but to prevent the aptamer from binding other regions of the protein apart from the included peptide, we would use the WT form of the protein (without the peptide) in the negative selection loop. In this way, all the aptamers that bind to other parts of the protein would be discarded.

Figure 3: InSilico Selex with positive and negative selections. In the positive selection the complexes with highest binding scores are kept and enter the negative selection. Then, the lowest binding scores are kept.

Impact of in Silico Selex to Experimental Selex

As mentioned above, the result of the inSilico Selex would be a "probabilistic sequence" of DNA. This would be used to direct the experimental DNA library design. When ordering the library (in IDT or similar companies), the customer can define the probability of each nucleotide being added at each sequence position. As shown in Figure 4, normally the probability is defined equally for each nucleotide, but we want to tune the probabilities at each position. The purpose of this is to increase the probability of finding a high affinity sequence by discarding the sequences that wouldn't bind.

Figure 4: Traditional vs inSilico-Selex-guided DNA library design. In the usual way to order a random library of DNA, the 4 nucleotides have the same chances to form the polymer in each of the positions. In the inSilico-Selex-guided DNA library design, the probability of each nucleotide is not equal, but based on the frequencies of the sequences resulting from the inSilico Selex rounds.

This wouldn't be necessary if we could experimentally try all the possibilities, but that's not the case (Figure 5). The libraries bought give 10¹⁵ sequences. Considering that an aptamer has a mean length of 35 nucleotides (from 20 to 50 approx.) the possibilities would be 4³⁵. This is more or less 10²⁰. It means that the bought sequences are 1/10⁵ of the total possibilities, or 0.01%. It's true that when a new round of Selex starts, new sequences are explored. Nevertheless, if 10 rounds are done, the explored percentage would get to 0.1% at most.

Using in Silico Selex to guide experimental Selex, the probability of having the highest-affinity sequence among the ones tried is maximized. As schematically represented in figure 5, the pipeline analyzes many more sequences than the ones that could be ever experimentally tried and discards sequences of predicted low affinity. This reduces the time of Selex experimentation, the resource consumption and increases the overall probability of success.

Figure 5: Probability of success' increments using inSilico Selex. Considering that the aptamer has a length of 35 nucleotides, ordering a library of 10¹⁵ sequences would mean to explore 0.001% of all the possible sequences. When guiding the DNA library design, the low-probability sequences would be discarded and the probability to have a high-affinity sequence among the 10¹⁵ sequences ordered would be higher.