Overview
This project is built around the core concept of “Medicine-Food Collaboration”, aiming to screen and identify aptamers that can specifically bind to different small molecules. Based on these aptamers, we seek to design riboswitches that respond to specific small molecules, ultimately achieving a “Patient-Centered Approach” personalized response system that precisely aligns with individual medical treatment and health management needs.
To this end, we developed the Mol2Aptamer model, which can de novo design riboswitch aptamer sequences targeting specific small molecules. This model provides high-value candidate molecules for subsequent wet-lab screening, laying a solid foundation for the research and development of functional riboswitches.
During the project's iteration, hippuric acid-a bioactive small molecule derived from black rice-was included as a key research target due to its significant application potential within the Medicine-Food Collaboration framework. Therefore, we selected hippuric acid as a case study to systematically validate and demonstrate the performance of the Mol2Aptamer model. Literature research and public database searches revealed that no specific aptamer for hippuric acid has yet been reported. However, studies have identified an aptamer screened via SELEX (Systematic Evolution of Ligands by Exponential Enrichment) against p-aminohippuric acid, with the sequence[1]:
CATCCGTCACACCTGCTCACGTCATCCGTCACACCTGCTCACGTCATCCGTCACACCTGCTCGGTGTTCGGTCCCGTATC.
This aptamer shows potential binding affinity toward hippuric acid, though its binding strength, interaction stability, and actual binding performance remain to be experimentally confirmed.
To verify the rationality of sequences generated by the Mol2Aptamer model, we conducted analysis and validation using thermodynamic methods.
To evaluate the binding efficacy of this aptamer with hippuric acid, we will employ molecular docking and molecular dynamics simulations to compare the p-aminohippuric acid aptamer with the Mol2Aptamer-designed aptamer. The analysis will focus on their respective binding affinities and interaction stabilities, providing quantitative metrics for aptamer selection. Furthermore, by calculating the binding free energy between the small molecule and the aptamer, we will derive the dissociation constant (Kd), which will be used to parameterize the Expiry Date Circuit model.
In addition, to promote the practical application of our model, we collaborated with the PekingHSC 2025 iGEM team to jointly develop an integrated platform, RNA-Factory, which incorporates various RNA-related tools and models. The Mol2Aptamer model from this study has been officially integrated into the platform, enabling standardized deployment and efficient utilization of the model.
Mol2Aptamer
1. Modeling Objective
Our goal is to perform de novo design of riboswitch aptamer sequences for specific small molecules, thereby providing high-value candidates for subsequent wet-lab screening. The adoption of generative modeling instead of pure computational chemistry or molecular dynamics is primarily based on two considerations: First, the number of publicly available, experimentally validated small molecule-riboswitch pairs is extremely limited, leading to difficulties in parameterization or calibration for physics-based modeling. Second, the recognition mechanism between small molecules and RNA is complex and may rely on implicit structural or sequence patterns; deep generative models are capable of learning weakly supervised or structured mapping relationships in a large-scale sequence space, thus generating new experimentally verifiable sequences. In summary, we model this task as a conditional sequence generation problem: using small molecule SMILES (Simplified Molecular-Input Line-Entry System) and global physicochemical features as conditions to generate RNA sequences, which simplifies it to a translation or conditional language modeling problem.
2. Overall Architecture
Mol2Aptamer adopts a typical Encoder-Decoder (conditional Transformer) architecture[2], with the overall workflow as follows:
- Input Encoding (Encoder): Encodes small molecule information into a continuous conditional vector (denoted as memory). The small molecule information is derived from two sources: SMILES sequences processed by a BPE (Byte Pair Encoding) tokenizer[3], and numerical features (e.g., molecular weight) extracted via RDKit-based calculation methods.
- Conditional Decoding (Decoder): Based on the output of the encoder, an autoregressive Transformer decoder gradually generates RNA tokens.
- Training Process: Multi-instance likelihood (MIL) is used to aggregate multiple authentic aptamer sequences corresponding to the same small molecule[4], enabling the model to learn the objective that "at least one authentic sequence can explain the molecule". Meanwhile, the model introduces sequence diversity through the latent variables of a conditional variational autoencoder (CVAE)[5].
- Inference and Post-Processing: Multiple candidate sequences are generated via multi-strategy sampling (including greedy algorithm, top-k, top-p, and temperature scaling). These sequences are then re-ranked and filtered using secondary structure prediction and physical indicators, and the final sorted predicted sequences are output.
3. Hyperparameters
Given the limited size of the training dataset, the model design balances expressiveness and overfitting risk: it must have sufficient capacity to learn complex mappings while using regularization to control overfitting. The hyperparameter settings are shown in the table below:
Table 1. model hyperparameter4. Model Construction Details
Firstly, We curated a dataset of experimentally validated small molecule-aptamer pairs by integrating three publicly available databases: AptaDB[16], Aptagen[17], and the Global Nucleic Acid Aptamer Database[18]. After filtering and removing duplicates, the final dataset comprised 238 unique small molecules and 794 corresponding aptamer sequences. Each entry contained the molecular structure in SMILES notation, molecular descriptors, and one or multiple experimentally verified aptamer sequences.
Because aptamer functionality is determined primarily by secondary structure rather than the DNA or RNA alphabet per se, we standardized all sequences into RNA format. This unification reduces modeling complexity and eliminates noise introduced by heterogeneous alphabets, while still preserving the structure-function mapping.
We then trained Byte Pair Encoding (BPE) tokenizers separately for SMILES and RNA sequences, yielding compact and biologically meaningful vocabularies. All sequences were padded or truncated to fixed lengths, with special tokens added for sequence start, end, and padding. Dataset splitting was conducted at the molecule level to prevent information leakage across training, validation, and test sets.
Our framework, termed Mol2Aptamer, formulates aptamer design as a conditional sequence generation task. The architecture integrates a conditional variational autoencoder (CVAE) with a multi-instance learning (MIL) objective to capture both sequence diversity and the one-to-many mapping between molecules and aptamers[4][5].
SMILES sequences were processed through a Transformer encoder with \( {L}_e = 4\) layers, hidden dimension \(d_{model} = 256\), and 8 attention heads. Global molecular descriptors were projected through a feed-forward network and fused with the encoder's pooled representation to form a conditional molecular vector \(c \in {R}^{256}\).
To enable latent variable modeling, we parameterized a conditional prior distribution over latent vectors z:
$$p_\psi( z\mid m) = N(\mu_p(c), diag(\sigma_p^2( c)))$$
For each aptamer sequence \(s\), a recognition encoder produced the approximate posterior:
$$ {q}_\phi( {z} \mid {s}, {m}) = {N}( {\mu}_q( {h}_s, {c}), {diag}( {\sigma}_q^2( {h}_s, {c})))$$
where \( {h}_s\) is the encoded sequence embedding. Latent dimension was set to \( {d}_z = 6\).
Given \( {z}\) and \( {c}\), a Transformer decoder (\( {L}_d = 6\) layers, hidden size 256, 8 heads, FFN dimension 1024) generated RNA sequences token by token. Conditioning was implemented by injecting \(( {z},\ {c})\) as prefix embeddings and as cross-attention context. Dropout was set to 0.1 across all layers.
The model was trained using a combination of CVAE evidence lower bound (ELBO) and a multi-instance likelihood (MIL) objective:
For each sequence-molecule pair (s, m), the loss was defined as:
$${L}_{ {ELBO}}( {s}, {m}) = - {E}_{ {q}_\phi( {z} \mid {s}, {m})}[\log {p}_\theta( {s} \mid {z}, {m})] + \beta \cdot {KL}(q_\phi(z \mid s, m) \| p_\psi(z \mid m))$$
The reconstruction term was computed as token-level cross-entropy with label smoothing \(\epsilon = 0.1\). KL annealing was applied, gradually increasing \(\beta\) from 0 to 1 to mitigate posterior collapse.
For a molecule m with \( {K}_m\) experimentally validated sequences \(s^{(i)}\), we optimized:
$${L}_{ {MIL}}( {m}) = -\log\left(\sum_{i=1}^{K_m} w_i \cdot p(s^{(i)} \mid m)\right)$$
where \(p(s^{(i)} \mid m)\) was approximated via Monte Carlo sampling of \(z \sim p_\psi(z \mid m)\). Uniform weights \(w_i = K_m^{-1}\) were used unless otherwise noted.
Final Objective
$$ L = \lambda_{ {ELBO}} \cdot \frac{1}{N_{ {pairs}}} \sum_{(s, m)} {L}_{ {ELBO}}(s, m) + \lambda_{ {MIL}} \cdot \frac{1}{N_{ {mol}}} \cdot {L}_{ {MIL}}( {m})$$
We might as well set \(\lambda_{ {ELBO}} = 1.0\) and \(\lambda_{ {MIL}} = 1.0\).
Then, optimization was performed using AdamW. The schedule consisted of linear warm-up over the first 2,000 steps, followed by cosine annealing decay. Gradient clipping at norm 1.0 was applied to prevent exploding gradients. Training was performed with a batch size of 16 molecules for up to 500 epoches, with early stopping based on validation loss.
At inference time, candidate aptamer sequences were generated by first sampling latent variables \(z \sim p_\psi(z \mid m)\), followed by autoregressive decoding. To balance diversity and biological plausibility, we employed multiple decoding strategies:
- Greedy decoding for baseline sequences.
- Top-k sampling (k = 20~50) to encourage lexical diversity.
- Nucleus sampling (Top-p) with p=0.9 to capture high-probability subspaces.
- Temperature scaling (T = 0.7~1.0) to control randomness in token selection.
Generated sequences were subsequently filtered by removing duplicates, clustering by edit distance, and secondary structure prediction (RNAfold) to ensure thermodynamic plausibility[6].
5. Model Evaluation

The top-left subplot shows the training loss (blue) and validation loss (orange), both decreasing and gradually converging, with the minimum observed at epoch 486 (red dashed line). The top-right subplot shows token-level accuracy (green), which steadily increases and reaches its highest value (0.6659) at epoch 486. The bottom-left subplot shows sequence-level accuracy (purple), which remains lower than token accuracy but still improves during training, reaching 0.2242 at epoch 486. The bottom-right subplot shows perplexity (brown), which consistently decreases.
During training, both the training and validation losses exhibited a steady downward trend without signs of overfitting, indicating that the model generalized well to unseen data. The training loss converged rapidly within the first 100 epochs, while the validation loss decreased more gradually and stabilized after approximately 400 epochs, suggesting that the optimization process effectively minimized the reconstruction error under the CVAE + MIL framework. Concurrently, token-level accuracy increased monotonically and reached approximately 65-70% at convergence, demonstrating the model's ability to capture fine-grained base-pair dependencies within aptamer sequences. Sequence-level accuracy improved more slowly and plateaued around 20-25%, reflecting the intrinsic difficulty of generating entire aptamer sequences with strict correctness. Nonetheless, this level of performance is encouraging given the limited dataset size and the inherent one-to-many mapping between small molecules and aptamers. Additionally, perplexity-representing sequence uncertainty-decreased consistently to below 50 by the final epochs, confirming that the model produced increasingly confident predictions. Collectively, these results demonstrate that Mol2Aptamer achieved stable convergence and learned a meaningful latent representation capable of guiding conditional RNA sequence generation.
Overall, the evaluation suggests that Mol2Aptamer effectively captures the mapping between small molecules and aptamer sequences. While token-level learning appears highly successful, the gap at the sequence level indicates that additional data augmentation, incorporation of structural priors, or improved decoding strategies may be necessary to further enhance the end-to-end generation of biologically functional aptamers.
Validation via Molecular Simulations
1. Thermodynamic Verification of Sequences Generated by Mol2Aptamer
To validate the sequences generated by Mol2Aptamer, we selected Hippuric Acid as a representative small molecule and analyzed the top-ranked aptamer candidates produced by the model. Candidate sequences were first sorted by thermodynamic stability (minimum free energy, ΔG). Among the predicted sequences, the most stable candidate exhibited ΔG = -5.90 kcal/mol, and was therefore chosen for secondary structure analysis.
-
Top 5 Aptamer Candidates (sorted by ΔG):
-
GCAGGGGUAGUGGGCGUGUCGUGGGGGGUAGGGGUCCUGGUGCCCGUAGCUGGGUUG
ΔG=-5.90 kcal/mol
-
GCACGUGGAAUUGCGGGCCGGUAUGUGGUGACGCAUCCGAGCGGGUGCUGGUCGUC
ΔG=-3.00 kcal/mol
-
GCAGGAGGGGGCGGGUCAGAUGAUGCCGGUGCCCCGGGGUGUCAGGGAAUUGUGU
ΔG=-3.00 kcal/mol
-
GCACGGGCGGGGGUGGGCCCGUAGGGU
ΔG=-2.80 kcal/mol
-
GAGUAAUACGACUCACUAUAGGGAGAUCGUGGCGCCACGGUGAAGGAGAG
ΔG=-2.70 kcal/mol
RNAfold predictions revealed that the optimal secondary structure (MFE structure) of this aptamer has a minimum free energy of -20.10 kcal/mol, with an ensemble free energy of -21.09 kcal/mol. The frequency of the MFE structure in the thermodynamic ensemble was approximately 20%, indicating that although the MFE conformation is the most stable, the aptamer spends nearly 80% of the time adopting alternative metastable conformations. The ensemble diversity score (6.03) further supports that the RNA molecule is structurally heterogeneous, sampling multiple possible folds instead of adopting a unique conformation[6].

A. Minimum Free Energy (MFE) structure. B. Centroid structure. (Each circle represents a nucleotide (A, U, G, or C), and the color scale indicates the base-pairing probability (from purple/blue = low to red = high). Both structures exhibit a typical stem-loop conformation. The MFE structure shows the most thermodynamically stable configuration, while the centroid structure represents the ensemble's most representative conformation. Minor differences in loop and unpaired regions reflect structural flexibility within the folding ensemble.)
In addition, the centroid structure, computed by weighting base pairs by their occurrence probability, showed a secondary structure highly consistent with the MFE prediction but with slight differences in loop regions. This suggests that the centroid structure may better reflect the “average” folding behavior of the aptamer across the ensemble.
The overlap between the free energy curves derived from MFE, partition function, and centroid analysis further confirmed that the overall folding trend is consistent, with only local instability in loop regions.

A. Comparison of predicted secondary structure heights for the Minimum Free Energy (MFE, red), partition function (pf, green), and centroid (blue) structures. The y-axis indicates the base-pairing “height” or stem depth at each nucleotide position, reflecting the overall folding topology. The three models show high structural consistency, with only minor deviations in loop regions, indicating a well-defined stem-loop conformation. B. Positional entropy plot showing the base-pairing variability at each nucleotide position. Higher entropy values correspond to structurally flexible or alternative pairing regions, whereas low entropy regions indicate stable and well-defined base pairs. The aptamer exhibits low overall entropy, suggesting a stable core structure with localized flexibility at loop regions.
The above analysis results indicate that the sequences generated by our Mol2Aptamer model exhibit correctness and rationality at the thermodynamic level, and their thermodynamic properties are consistent with the expected design logic.
2. Molecular docking
Currently, we have obtained the complete sequences of the p-aminohippuric acid aptamer and the aptamer designed by Mol2Aptamer (hereinafter referred to as DNA aptamer and RNA aptamer respectively in the subsequent introduction).We plan to analyze the secondary and tertiary structures of these two aptamers using bioinformatics prediction methods to provide structural insights for subsequent modeling work. Specifically, we first employed the mfold[7] tool to predict the secondary structures of both aptamers, identifying their base-pairing patterns and stem–loop structural features. On this basis, we further constructed the tertiary structure models of the two aptamers using the XIAO LAB[8] and RNA Composer[9] tools, respectively.

A. Secondary structure of DNA aptamer. B. Secondary structure of RNA aptamer. C. The tertiary structure of DNA aptamers. D. The tertiary structure of RNA aptamers.
We conducted molecular docking experiments using AutoDock Vina[10], with the primary objectives of identifying the key binding sites and interaction patterns of the two aptamers, as well as obtaining the three-dimensional structures of the aptamer-hippuric acid complexes. Based on these docking results, we further calculated the binding free energy between each aptamer and hippuric acid to derive the dissociation constant (Kd), providing quantitative parameters for the subsequent Expiry Date Circuit model configuration.
To ensure the accuracy and reliability of the docking experiments, structural preprocessing was performed prior to docking. Specifically, for the ligand (hippuric acid), its standard three-dimensional structure file was downloaded from the PubChem database and subsequently hydrogenated. For the receptor aptamers, the predicted tertiary structures were subjected to dehydration and hydrogen addition procedures in sequence.

A. The molecular docking results of DNA aptamers and hippuric acid. B. The molecular docking results of RNA aptamers and hippuric acid.

Molecular docking results suggest that the aptamer predicted by Mol2Aptamer exhibits slightly higher binding affinity to the ligand compared to the p-aminohippuric acid aptamer.
Structural visualization of the docking complexes shows that both the DNA aptamer for p-aminohippuric acid and the RNA aptamer designed by Mol2Aptamer specifically interact with hippuric acid (green small molecule) through hydrogen bonds and other noncovalent interactions (represented by yellow dashed lines). The identified binding sites involve specific nucleotides within the aptamers that provide a stable microenvironment for ligand binding.
According to the AutoDock Vina docking energy and conformational analysis results, the RNA_rank1 complex achieved a binding score (S) of -5.799, the lowest (most negative) among all conformations, indicating the strongest binding affinity to hippuric acid. Meanwhile, the optimized root-mean-square deviations (RMSD) for both RNA_rank1 and DNA_rank1 were 0.000, showing that the optimized conformations perfectly overlap with their reference structures, suggesting excellent structural stability. These findings indicate that the Mol2Aptamer-designed RNA aptamer exhibits more favorable intermolecular binding energy, total optimized energy, and internal conformational stability than the literature-reported p-aminohippuric acid DNA aptamer.
Furthermore, there exists a quantitative relationship between the dissociation constant (Kd) and the molar Gibbs free energy (ΔG, binding free energy):
$$\displaystyle \Delta G=RTlnK_{d} $$
From our calculations, the binding free energy of the aptamer-ligand complex is -29.2489848 kJ/mol, corresponding to a dissociation constant (Kd) of 7507.83 nM. This parameter serves as a rational quantitative input for the construction and optimization of the Expiry Date Circuit.
3. Molecular dynamics simulations
We performed GROMACS molecular dynamics simulations on the docked complexes to observe their conformational behavior and compare the binding stability of hippuric acid with a DNA aptamer versus our Mol2Aptamer-designed RNA aptamer.
Before the MD simulations, we performed the following preparations:
- Generated topologies for the protein and the small molecule using pdb2gmx and sobtop[13] respectively, then calculated the electrostatic potential with Gaussian[14], and finally assigned atomic charges using Multiwfn[15].
- Used the Amber99sb-ildn force field for the aptamers and the TIP3P water model for solvation.
- Ensured force field compatibility by merging hippuric acid's atomic parameters.Assembled the complete aptamer-ligand complex topology.
- Placed the complex in a cubic periodic water box with a minimum 1.0 nm solvent buffer.
- Neutralized the system's total charge by adding Na⁺ or Cl⁻ ions.
- Perform energy minimization to eliminate steric hindrance in the complex. Then perform NVT equilibration and NPT equilibration in sequence to stabilize the system's temperature and pressure respectively.
After preparation, we ran MD simulations for both complexes and analyzed the root-mean-square deviation (RMSD) to assess ligand and overall complex stability.

The RMSD analysis shows that the DNA aptamer (blue curve) was less stable, exhibiting higher and more fluctuant RMSD values. In contrast, the RNA aptamer (red curve) remained stable at a low RMSD level, indicating superior conformational stability.

The ligand's RMSD relative to the DNA aptamer (blue curve) was high and unstable. Conversely, its RMSD relative to the RNA aptamer (red curve) was low and stable, suggesting a stronger and more consistent binding interaction.
This comparative analysis confirms that the aptamer designed by Mol2Aptamer has superior binding and structural stability. This validates the model's effectiveness in designing high-value aptamer candidates for functional riboswitches.
However, the model has limitations, including scarce training data, the need for wet-lab validation (e.g., SELEX), and limited interpretability. Overcoming these requires expanding the dataset and establishing a Design-Build-Test-Learn (DBTL) feedback loop to continuously improve model performance.
RNA-Factory
We have collaborated with the PekingHSC 2025 iGEM team to develop RNA-Factory, a user-friendly, all-in-one, multifunctional RNA analysis platform. The platform is primarily focused on deep learning and integrates advanced large language model technologies, providing analytical tools for structure prediction, interaction prediction, and sequence and structure design.For detailed information, please visit https://2025.igem.wiki/pekinghsc/software.
Our Mol2Aptamer model has been integrated into the RNA-Factory platform. The following is the demonstration of the operation after our model is integrated into the platform.

This animation demonstrates how to use our model on RNA-Factory. First, input a small molecule sequence. You can either keep the default parameters unchanged or adjust the parameters manually before clicking "Start Analysis". After a short wait, you will obtain the model-predicted riboswitch aptamer sequence. Among these sequences, those with ΔG displayed in green are the usable ones. You can generate sequences multiple times to select the optimal one.
References
Fischer C, Wessels H, Paschke-Kratzin A, Fischer M. Aptamers: Universal capture units for lateral flow applications.Analytical Biochemistry, 2017, 522: 53-60. ISSN: 0003-2697.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention Is All You Need. Adv Neural Inf Process Syst. 2017;30:5998-6008.
Sennrich R, Haddow B, Birch A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). 2016;2:1715-1725.
Dietterich TG, Lathrop RH, Lozano-Pérez T. Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Artificial Intelligence. 1997;89(1-2):31-71.
Sohn K, Lee H, Yan X. Learning Structured Output Representation using Deep Conditional Generative Models. Advances in Neural Information Processing Systems (NeurIPS). 2015;28:3483-3491.
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie. 1994;125(2):167-188.
Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction.Nucleic Acids Research, 31(13), 3406-3415.
Xiao Lab, Huazhong University of Science and Technology (HUST). Xiao Lab [Online]. Wuhan: Huazhong University of Science and Technology (HUST).
Bojarski M, Antczak M, Zok A, et al. RNAComposer: a method for de novo prediction of RNA 3D structure[J].Nucleic Acids Research, 2014, 42(W1): W25-W31.
Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010 Jan 30;31(2):455-61.
van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A. E., & Berendsen, H. J. C. (2005). GROMACS: Fast, flexible and free. Journal of Computational Chemistry, 26(16), 1701-1718.
Abraham, M. J., Murtola, T., Schulz, R., Páll, S., Smith, J. C., Hess, B., & Lindahl, E. (2015). GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers.SoftwareX, 1-2, 19-25.
Lu T. Sobtop (Version 1.0) [Software].http://sobereva.com/soft/Sobtop, 2025-09-30.
Gaussian, Inc. Gaussian 09 (Revision D.01) [Software]. Wallingford, CT, USA: Gaussian, Inc., 2009.
Tian Lu, Feiwu Chen. Multiwfn: A multifunctional wavefunction analyzer.Journal of Computational Chemistry, 2012, 33(6): 580-592.
Chen L, Yu Z, Wu Z, et al. AptaDB: a comprehensive database integrating aptamer-target interactions[DB/OL]. RNA. 2024;30(3). Accessed 2025-10-08.
Aptagen Labs. Peptimers™: Next-Generation Aptamers [EB/OL]. Accessed 2025-10-08. https://www.aptagen.com/.
Institute of Hangzhou Medical Research, Chinese Academy of Sciences (IHMR-CAS). Global Aptamer Database (GAD) [DB/OL]. 2021-01-15. Accessed 2025-10-08. https://www.aptamer.org/aptamerdb.