Glossary
Overview
In order to improve the performance of the experimental SELEX procedure, we have designed a preliminary in silico strategy that could be applied if sufficient computational resources were available. The pipeline has been called inSilico SELEX because it follows the same rationale as the experimental SELEX, but substitutes the experimental evaluation of the binding for a computational prediction.
The pipeline comprises three main steps (Figure 1):
- DNA variability generation
- Protein–DNA interaction prediction
- Binding score rating and selection
The input of the whole program is a random set of DNA sequences together with the sequence of the protein of interest. The output is a reduced set of DNA sequences with expected high affinity.
In an initial approach, before applying the method to any protein of interest, it would be necessary to validate the approach (Figure 1.A). To do so, we would use a protein sequence with a known high-affinity aptamer binder. Starting with a random set of DNA sequences, we would perform several inSilico SELEX rounds and evaluate whether or not we recover the known aptamer sequence. The process would be repeated for a dataset of many identified proteins with high-affinity aptamers, and if a sequence similar to the real one is found in enough cases, the approach would be considered reliable.
If that's the case, we would proceed to use the program on our proteins of interest (Figure 1.B). After several rounds of inSilico Selex, the reduced set of DNA sequences would be aligned and turned into what we call "probabilistic sequence". This is a DNA sequence where in each position there is no one single nucleotide, but a combination of the four in different probabilities. This probabilistic sequence would be used to order the real DNA library used for the experimental Selex.
Software Tools
Going more into detail, we have considered three possible ways of evaluating the binding between the protein and the aptamer (Figure 2).
AlphaFold + FoldX
AlphaFold would be used to fold both the protein and the aptamer and also to dock them. The resulting structural complex would be the input for FoldX, a force field based on experimental ΔΔG data. It would use the FoldX Analyze Complex command, which accurately predicts the stability of a complex.
A limitation of this approach is the fact that AlphaFold only gives one possible docking solution, when in reality there may be many in a dynamic equilibrium. Nevertheless, the software gives the most stable one most of the time. If not, it is probable that in the random initial set there may be several very similar sequences that, when inputted to AlphaFold, would give slightly different docking conformations.
AptaTrans
AptaTrans is a pipeline for predicting aptamer–protein interaction (API) using deep learning techniques. It would substitute the combination of AlphaFold + FoldX.
AptaBLE
AptaBLE is a sequence-based language model trained on its own experimentally generated databases of DNA and RNA. Their benchmarking analysis shows that it outperforms both AptaTrans and AlphaFold. Nevertheless, we would try the three approaches and analyze which works better in our cases.
Complementation with Negative Selection
The inSilico Selex can be complemented with a negative selection loop (Figure 3). This double loop would require not only the target protein of interest but also another protein (or set of proteins) to which the aptamer shouldn't bind (referred as "proteins of negative interest" from now on). The complementary negative loop would take the filtered sequences with high affinity for the protein of interest and would analyze the interaction with the protein(s) of negative interest. The sequences with high affinity for the proteins of negative interest would be discarded, while the ones with the lowest affinities for this protein(s) of negative interest would be kept. We could say that, if the first loop increases affinity, the second negative loop increases specificity.
In the case of our biomarkers, we want to specifically detect an isoform of the protein which is the product of an exon inclusion event. Therefore, the aptamer has to bind to the peptide product of the included exon. In the positive selection loop, we would use the protein with the included peptide, but to prevent the aptamer from binding other regions of the protein apart from the included peptide, we would use the WT form of the protein (without the peptide) in the negative selection loop. In this way, all the aptamers that bind to other parts of the protein would be discarded.
Impact of in Silico Selex to Experimental Selex
As mentioned above, the result of the inSilico Selex would be a "probabilistic sequence" of DNA. This would be used to direct the experimental DNA library design. When ordering the library (in IDT or similar companies), the customer can define the probability of each nucleotide being added at each sequence position. As shown in Figure 4, normally the probability is defined equally for each nucleotide, but we want to tune the probabilities at each position. The purpose of this is to increase the probability of finding a high affinity sequence by discarding the sequences that wouldn't bind.
This wouldn't be necessary if we could experimentally try all the possibilities, but that's not the case (Figure 5). The libraries bought give 1015 sequences. Considering that an aptamer has a mean length of 35 nucleotides (from 20 to 50 approx.) the possibilities would be 435. This is more or less 1020. It means that the bought sequences are 1/105 of the total possibilities, or 0.01%. It's true that when a new round of Selex starts, new sequences are explored. Nevertheless, if 10 rounds are done, the explored percentage would get to 0.1% at most.
Using in Silico Selex to guide experimental Selex, the probability of having the highest-affinity sequence among the ones tried is maximized. As schematically represented in figure 5, the pipeline analyzes many more sequences than the ones that could be ever experimentally tried and discards sequences of predicted low affinity. This reduces the time of Selex experimentation, the resource consumption and increases the overall probability of success.