To assess the compatibility and interaction strength between candidate siRNAs designed to silence Phytophthora capsici and chitosan, we performed two bioinformatics approaches: docking simulations (HDOCK and Maestro Glide) and machine learning-based modeling. Docking allowed us to evaluate the spontaneity and binding affinity of siRNA-nanoparticle complexes, while predictive modeling enabled us to generalize interaction patterns without repeatedly performing computationally intensive docking. Together, these methods provided a robust framework to screen and validate siRNA candidates before testing them in the lab.
Docking
1. HDOCK
Docking helps us check the binding affinities between the ligand and the receptor. Since chitosan is our nanoparticle carrier, it is essential to determine which siRNA can bind to our chitosan nanoparticle spontaneously without requiring high energy.
- The siRNA candidates designed to silence P. capsici have been docked with the chitosan monomer obtained from PUBCHEM to check for compatibility and interaction affinity using HDOCK.
- In HDOCK, the monomer of chitosan is input for the ligand, and the siRNA designed is the input for the receptor.
| Rank | Docking Score | Confidence Score | Ligand rmsd (Å) |
|---|---|---|---|
| 1 | -327.07 | 0.9718 | 17.78 |
| 2 | -322.18 | 0.9690 | 20.11 |
| 3 | -301.02 | 0.9535 | 17.31 |
| 4 | -296.22 | 0.9490 | 20.57 |
| 5 | -292.75 | 0.9456 | 19.95 |
| 6 | -281.01 | 0.9322 | 20.21 |
| 7 | -280.65 | 0.9317 | 21.87 |
| 8 | -269.61 | 0.9162 | 21.16 |
| 9 | -253.23 | 0.8874 | 20.08 |
| 10 | -250.97 | 0.8828 | 19.88 |
| Docking Score | Confidence Score | Ligand rmsd (Å) |
|---|---|---|
| -269.15 | 0.9155 | 16.93 |
| -266.94 | 0.912 | 18.05 |
| -254.32 | 0.8896 | 21.71 |
| -253.95 | 0.8888 | 19.83 |
| -249.05 | 0.8788 | 21.02 |
| -239.54 | 0.857 | 20.03 |
| -231.79 | 0.837 | 20.77 |
| -228.62 | 0.8281 | 19.91 |
| -222.11 | 0.8088 | 25.25 |
| -221.94 | 0.8083 | 24.58 |
| Docking Score | Confidence Score | Ligand rmsd (Å) |
|---|---|---|
| -245.17 | 0.8703 | 18.63 |
| -230.71 | 0.834 | 22 |
| -222.74 | 0.8107 | 20.6 |
| -217.21 | 0.7932 | 20.34 |
| -213.11 | 0.7794 | 20.8 |
| -211.74 | 0.7747 | 63.09 |
| -205.98 | 0.7539 | 55.58 |
| -201.5 | 0.7369 | 18.41 |
| -201.32 | 0.7362 | 19.77 |
| -190.58 | 0.6925 | 58.99 |
| Docking Score | Confidence Score | Ligand rmsd (Å) |
|---|---|---|
| -184.19 | 0.6646 | 16.67 |
| -183.8 | 0.6628 | 45.43 |
| -183.78 | 0.6628 | 47.96 |
| -182.15 | 0.6554 | 37.07 |
| -181.72 | 0.6535 | 45.49 |
| -180.18 | 0.6465 | 22.94 |
| -177.94 | 0.6362 | 20.23 |
| -177.69 | 0.635 | 46.94 |
| -177.35 | 0.6334 | 30.05 |
| -176.57 | 0.6298 | 43.13 |
From the docking scores provided above, it is clear that siRNA Candidate 1 and siRNA Candidate 2 have shown the highest negative score, implying that the binding affinity of these two siRNAs to chitosan is more spontaneous than the others. Hence, confirming the need to validate these siRNAs in wet lab to check for their silencing efficacy against P. capsici.
2. Maestro Glide
We also used Glide Ligand Docking of Maestro to increase the number of parameters for feature selection, picking the ones best suited for the model.
The procedure is as follows:
- Start with the Protein Preparation Wizard.
- Load the siRNA protein PDB file.
- Select the cap termini option.
- Load the prepared siRNA protein structure.
- In the "Receptor Grid Generation" panel.
- Deselect "Pick to identify ligand".
- Go to the site tab.
- Set the center to the centroid of selected residues.
- Select all residues for centering.
- Click Add and then OK.
- The selected residues will define the Active Site.
- The grid box appears and gets centered around these residues.
- Submit and generate the receptor grid file.
- Load the chitosan PDB file.
- Use LigPrep module.
- Select Epik to generate possible ionization/protonation states. Use the Ligand Docking panel.
- Go to the Ligand Docking panel. Input Ligands from LigPrep output and Receptor Grid from the Grid Generation step.
- Submit Docking.
From the procedure used above, the following scores were derived from our simulation:
siRNA1: Docking = -8.234 (Rank 1 pose)
siRNA2: Docking = -7.933 (Rank 1 pose)
siRNA3: Docking = -7.517 (Rank 1 pose)
siRNA4: Docking = -6.189 (Rank 1 pose)
siRNA candidate 1 exhibited the lowest score, indicating the strongest interaction affinity with chitosan, thereby validating our choice of siRNA candidates for the wet lab experiments.
Our Stability Model (S.E.N.S.E)
We have developed a software solution that utilizes siRNA-nanoparticle interaction to generate docking scores that symbolize stability based on our designed model, by simply inputting the siRNA sequence and required nanoparticle, without needing to perform actual docking.
The results of the approach and analysis are provided below:
Model Selection
The model comparison indicated that XGBoost outperformed the other models, achieving an RMSE of 33.30 and an R² value of 0.7459, explaining approximately 75% of the variance in the docking scores. Random Forest performed almost identically (RMSE 33.38, R² 0.7448), while Lasso regression gave the second-best performance with RMSE 33.78 and R² 0.7386. Since both tree-based models performed better than the linear approach, we inferred that the relationship between siRNA sequences and their binding stability is not linear. There are likely complex interactions between nucleotide positions and nanoparticle properties that the non-linear models can capture better.
Feature Engineering
Through feature engineering, we generated 217 features that integrated sequence characteristics such as AU and GC content, sequence entropy, and dinucleotide frequencies, together with nanoparticle type and positional nucleotide encoding. This hybrid strategy proved effective as it captured both the chemical attributes of the sequences and the positional details that influence binding.
-
One-hot encoded sequences (215 features):
An siRNA sequence can be up to 43 nucleotides long, including the guide and passenger strands. For each of the 43 positions, we used five binary variables (0 or 1) representing whether that position contains A, U, G, C, or is empty (for shorter sequences). Only one of the 5 is marked as 1; the rest are 0. So 43 positions × 5 options = 215 features. This tells the model exactly which nucleotide appears at each specific location in the sequence, which helps in analyzing the position of the base or nucleotide in the given siRNA sequence. -
Sequence composition features (18 features):
GC content: 1 feature measuring the percentage of G and C nucleotides. AU content: 1 feature measuring the percentage of A and U nucleotides. Dinucleotide counts: 16 features counting how often each pair of adjacent nucleotides appears (AA, AU, AG, AC, UA, UU, UG, UC, GA, GU, GG, GC, CA, CU, CG, CC) Sequence entropy: A variable or feature that defines the variation, frequency, and positions of the codons or bases in an input siRNA sequence. -
Nanoparticle type (2 features):
The model assigned two binary variables (0 and 1) to chitosan-siRNA complexes and lipid-siRNA complexes to distinguish the data between the two nanoparticles when binding with the siRNA sequences. Total: 215 + 1 + 1 + 16 + 1 + 2 = 236 features (the number 217 suggests some features were removed during preprocessing, possibly redundant padding positions). These features provide the model with precise position-specific details and broader sequence characteristics that influence binding stability.
The most striking observation was that the nanoparticle type had the most significant influence; specifically, using chitosan increased the docking score by about 68.4 points compared to lipid.
This makes sense given that chitosan and lipid nanoparticles have fundamentally different surface chemistries and charge properties. AU content came in second with a coefficient of 37.1, followed by GC content at 31.0, and CG dinucleotide frequency at 1.5.
These findings align well with scientific literature. AU-rich regions in RNA are more flexible and can adopt different conformations, which affect how they interact with the nanoparticle surface. GC content influences the overall stability and rigidity of the RNA structure since GC base pairs are stronger than AU pairs. The CG dinucleotide frequency is interesting because CG steps in nucleic acids have unique structural properties that could influence binding interfaces. The substantial impact of the nanoparticle type highlights that different delivery vehicles cannot be used interchangeably; the carrier’s chemistry is just as critical as the sequence itself.