Model | Keystone - iGEM 2025

Part 1：The Binding Potential between Aptamers and BD-Tau Protein

Research Background

Objective

Procedure

Result

Experimental Results

Summary

Part 2: Machine Learning Model Simulation

Background

Proteins

Objective

Methodology

Results

Model Reliability and Error Analysis

Evaluation of Application Performance and Value

Model Limitations and Future Work

Summary

References

Part 1：The Binding Potential between Aptamers and BD-Tau Protein

Research Background

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder. The abnormal aggregation of Tau protein is considered one of its key pathogenic mechanisms. Simulating the interaction between DNA and Tau protein can contribute to the discovery of potential diagnostic tools.

Objective

To simulate the binding behavior of the DNA sequence with both BD-Tau and Tau proteins, and screen for nucleic acid aptamers that exhibit higher affinity BD-Tau than Tau.

Procedure

1. Nucleic Acid Aptamer Structure

A specific DNA sequence (e.g. 15–30 bp) serving as a potential binding ligand.

Secondary Structure Prediction: Use the ViennaRNA (RNAfold) tool to predict the minimum free energy conformation (dot-bracket notation). Substitute T and U or convert to RNA format for modeling (suitable for RNAComposer).Tool URL: http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi

3D Structure Modeling: Use RNAComposer to convert the sequence + secondary structure into a 3D structural model (PDB format). URL: https://rnacomposer.cs.put.poznan.pl/

Output: 3D coordinates of the DNA structure (PDB format), ready for molecular docking simulations.

2. Obtain the spatial conformation of the tau protein

Target Proteins:

Tau protein: A microtubule-associated protein that forms neurofibrillary tangles in Alzheimer's disease (AD).

BD-Tau: A novel, more targeted tau isoform that more accurately reflects AD-related Tau pathology.

Structure Source:

AlphaFold-predicted structure: https://alphafold.ebi.ac.uk

Preprocessing Steps:

Removal of water molecules, cofactors, and non-standard residues.

Validation of structural integrity (checking for missing key sequences).

Saving the structure in .pdb format for docking preparation.

3. HADDOCK molecular docking

The tertiary structure of the DNA in .pdb format was generated using the DNA Sequence to Structure web server. The .pdb files for the Tau protein and BD-Tau protein were obtained from the AlphaFold database. Both .pdb files were subsequently submitted to the HADDOCK 2.4 webserver (https://rascar.science.uu.nl/haddock2.4), which facilitates complex molecular docking and interaction analysis for researchers.

Each docking run produces multiple clusters. A cluster is the result of automatically grouping all successfully docked structural models based on "structural similarity." Simply put, one cluster ≈ one primary binding mode (pose/conformation). Significant differences in binding interfaces, relative orientation, and key interactions exist between different clusters. Within the same cluster, the structures are highly similar (RMSD < 2 Å), differing only in minor side-chain or backbone adjustments. Each cluster contains four structures; for example, the "Nr 1 best structure" is the top-scoring (lowest energy) model within that cluster. The best cluster from each docking experiment was selected for visualization.

The z-score is generally used as a comprehensive metric to evaluate the molecular docking strength. A lower z-score indicates greater binding potential. The following section presents the docking results for 27 distinct DNA sequences against both the Tau protein and the BD-Tau protein. The .pdb files for each clustered docking result are stored in the directory named "Molecular Docking Results".

Result

1. The HADDOCK score is a comprehensive scoring function where a lower value indicates a more stable complex energy. It is a weighted combination of van der Waals energy, electrostatic energy, desolvation energy, and restraint energy.

2. Cluster size indicates the number of docking conformations contained within the cluster. A larger cluster size suggests that the corresponding binding mode occurs more frequently and is more representative.

3. The Root Mean Square Deviation (RMSD) measures the structural difference between the average structure of the cluster and the globally lowest-energy conformation. A larger value (e.g., 17 Å) indicates that the cluster represents a distinctly different binding mode from the lowest-energy structure.

4. Van der Waals energy describes hydrophobic interactions and steric repulsion between atoms. A negative value indicates good compatibility at the binding interface with minimal repulsion.

5. Electrostatic energy represents the intermolecular electrostatic interaction energy. A large negative value suggests strong electrostatic attraction plays a dominant role in the complex formation (consistent with the characteristic that DNA is negatively charged and proteins possess positively charged residues).

6. Desolvation energy is the energy cost associated with removing water molecules from the interfacial region. A value close to zero indicates that solvent effects have minimal impact on this cluster.

7. Restraints violation energy reflects the degree of deviation from the experimental constraints provided as input (e.g., NMR data, cross-linking information). A larger value indicates greater discrepancy with the experimental restraints; if no constraints were input, this energy term can be considered a secondary reference.

8. The molecular surface area buried upon binding. A larger value suggests a more extensive binding interface and a more stable complex (a BSA > 1000 Å² is generally considered indicative of a strong interface).

9. Z-score represents the standardized deviation of the cluster's HADDOCK score relative to all clusters. A more negative value indicates that this cluster is superior to the overall average.

In all figures below, figure A shows the scoring results of the aptamers against Tau, while figure B shows the results against BD-Tau. The results are as follows:

01: 5' -TCACCTGAGACTTGACGATGGCATCACTCCCCCCCACCTATTACATCATCATAAATTGAGTGCTATCGTCTGTCCA - 3'

02: 5' - TCACCTGAGACTTGACGATGGCCTCCCCCTCACGCACTCTTCCGTTTCTTCTTATCTGAGTGCTATCGTCTGTCCA - 3'

03：DNA:5' - TCACCTGAGACTTGACGATGGAACTCCCCCCACCATTATCAGCGCACCACCATTGTAGAGTGCTATCGTCTGTCCA - 3'

04：DNA:5' - TCACCTGAGACTTGACGATGGTTTAACTCCCCCACGCCCCCCGCCAACCCATCTCCAGAGTGCTATCGTCTGTCCA - 3'

05：DNA:5' - TCACCTGAGACTTGACGATGGTACGACGGCCCCCCGATTATGCGACTACTTGATTTGAGTGCTATCGTCTGTCCA - 3'

06: DNA:5' - TCACCTGAGACTTGACGATGGTCAGAACGACGCGCCCCCCACCTCATTCATTATTTTGAGTGCTATCGTCTGTCCA - 3'

07：DNA:5' - TCACCTGAGACTTGACGATGGTGACCACCCCCCACGCACACACACCTCTTCCATCCTGAGTGCTATCGTCTGTCCA - 3'

08：DNA:5' - TCACCTGAGACTTGACGATGGCACTACCCCTCCCTACTAAGCACGGTATCTTGTACTGAGTGCTATCGTCTGTCCA - 3'

09：DNA:5' - TCACCTGAGACTTGACGATGGGAACAAACACCGCGACCACCCCCCCACTTAACTCCTGAGTGCTATCGTCTGTCCA - 3'

10：DNA:5' - TCACCTGAGACTTGACGATGGCAATCCCCCCGACACCGAATCCTAAGCGAACAACGCGAGTGCTATCGTCTGTCCA - 3'

11：DNA:5'- TCACCTGAGACTTGACGATGGACTCACAAACTCGAGCCACCCCCGACCCACACAACAGAGTGCTATCGTCTGTCCA - 3'

12: DNA:5'- TCACCTGAGACTTGACGATGGTACTCCCCCCCAACCTAATAGCTCTTTACCCTCTGAGAGTGCTATCGTCTGTCCA - 3'

13: DNA:5' - TCACCTGAGACTTGACGATGGCCGACTCCCCACCCTACATCGCAACATTGACTATTAGAGTGCTATCGTCTGTCCA - 3'

14: DNA:5' - TCACCTGAGACTTGACGATGGTTCTACACTGCCCCCCCGACCCGCCAGACCAACCCAGAGTGCTATCGTCTGTCCA - 3'

15: DNA:5' - TCACCTGAGACTTGACGATGGCAATCCTCCGAGCTCCACCCACCCTTACTCAACATTGAGTGCTATCGTCTGTCCA - 3'

16: DNA:5' - TCACCTGAGACTTGACGATGGCGCTACCCCCTAACTTCAACCCGCATTATTCTAGCTGAGTGCTATCGTCTGTCCA - 3

17: DNA:5' - TCACCTGAGACTTGACGATGGTTACCGAACCCGACACCCCCGCCGACACCAGCCCCAGAGTGCTATCGTCTGTCCA - 3'

18: DNA:5' - TCACCTGAGACTTGACGATGGCCCCCCCCCGCACCGCCTCATTCAGCATACTAATACGAGTGCTATCGTCTGTCCA - 3'

19: DNA:5' - TCACCTGAGACTTGACGATGGCAACCACCCCCCCTGGCTACATCATATTCTTATCTTGAGTGCTATCGTCTGTCCA - 3'

20: DNA:5' - TCACCTGAGACTTGACGATGGTTTCTTCGCCCCCCCCACACACTACACGTTTCTTCTGAGTGCTATCGTCTGTCCA - 3'

21: DNA:5' - TGACTGATTTACGGAAGCTGAATAAGGACTGCTTAGGATTGCGATGATTCAGCT - 3'

22: DNA:5' - TGACTGATTTACGGAAGTTACGGACGGATGTCAGTGGTATAGTAATCCGTAACT - 3'

23: DNA:5' - TGACTGATTTACGGAAGCTGAATAAGGACTGCTTAGGATTGCGATGATTCAGCT - 3'

24: DNA:5' - CGTAAATCAGTCAGAAGCTGAATAAGGACTGCTTAGGATTGCGATGATTCAGCT - 3'

25: DNA:5' - GCGGAGCGTGGCAGG - 3'

26: DNA:5' - CCTGCCACGCTCCGC - 3'

27: DNA:5' - CTGAATCATCGCAATCCTAAGCAGTCCTTATTCAGAAAAAAAAAAAAAAAA - 3'

Based on the overall Z-scores compiled in Table 1, DNA sequences NO. 8, NO. 14, and NO. 16 were selected for their higher affinity for BD-tau and lower affinity for Tau, as determined by their lower (more favorable) Z-scores. The resulting complex conformations were visualized using PyMOL.

Table 1. Aptamer Sequences and Z-Scores

Number	Aptamer Sequences	Tau Z-Score	BD-Tau Z-Score
1	5' -TCACCTGAGACTTGACGATGGCATCACTCCCCCCCACCTATTACATCATCATAAATTGAGTGCTATCGTCTGTCCA - 3'	-1.1	-1.3
2	5' - TCACCTGAGACTTGACGATGGCCTCCCCCTCACGCACTCTTCCGTTTCTTCTTATCTGAGTGCTATCGTCTGTCCA - 3'	-0.9	-1.2
3	5' TCACCTGAGACTTGACGATGGAACTCCCCCCACCATTATCAGCGCACCACCATTGTAGAGTGCTATCGTCTGTCCA - 3'	-1.2	-1.5
4	5'TCACCTGAGACTTGACGATGGTTTAACTCCCCCACGCCCCCCGCCAACCCATCTCCAGAGTGCTATCGTCTGTCCA - 3'	-2.5	-1.6
5	5' - TCACCTGAGACTTGACGATGGTACGACGGCCCCCCGATTATGCGACTACTTGATTTGAGTGCTATCGTCTGTCCA - 3'	-1.5	-1.7
6	5' TCACCTGAGACTTGACGATGGTCAGAACGACGCGCCCCCCACCTCATTCATTATTTTGAGTGCTATCGTCTGTCCA - 3'	-1.7	-2.1
7	5' - TCACCTGAGACTTGACGATGGTGACCACCCCCCACGCACACACACCTCTTCCATCCTGAGTGCTATCGTCTGTCCA - 3'	-1.7	-2.0
8	5' TCACCTGAGACTTGACGATGGCACTACCCCTCCCTACTAAGCACGGTATCTTGTACTGAGTGCTATCGTCTGTCCA - 3'	-1.2	-1.8
9	5'TCACCTGAGACTTGACGATGGGAACAAACACCGCGACCACCCCCCCACTTAACTCCTGAGTGCTATCGTCTGTCCA - 3'	-2.1	-1.4
10	5'TCACCTGAGACTTGACGATGGCAATCCCCCCGACACCGAATCCTAAGCGAACAACGCGAGTGCTATCGTCTGTCCA - 3'	-1.4	-1.2
11	5'-TCACCTGAGACTTGACGATGGACTCACAAACTCGAGCCACCCCCGACCCACACAACAGAGTGCTATCGTCTGTCCA - 3'	-1.7	-1.8
12	5'- TCACCTGAGACTTGACGATGGTACTCCCCCCCAACCTAATAGCTCTTTACCCTCTGAGAGTGCTATCGTCTGTCCA - 3'	-1.5	-1.8
13	5' - TCACCTGAGACTTGACGATGGCCGACTCCCCACCCTACATCGCAACATTGACTATTAGAGTGCTATCGTCTGTCCA - 3'	-1.7	-1.3
14	5'-TCACCTGAGACTTGACGATGGTTCTACACTGCCCCCCCGACCCGCCAGACCAACCCAGAGTGCTATCGTCTGTCCA - 3'	-1.0	-2.1
15	5' - TCACCTGAGACTTGACGATGGCAATCCTCCGAGCTCCACCCACCCTTACTCAACATTGAGTGCTATCGTCTGTCCA - 3'	-2.6	-2.2
16	5' - TCACCTGAGACTTGACGATGGCGCTACCCCCTAACTTCAACCCGCATTATTCTAGCTGAGTGCTATCGTCTGTCCA - 3	-1.8	-2.3
17	5' TCACCTGAGACTTGACGATGGTTACCGAACCCGACACCCCCGCCGACACCAGCCCCAGAGTGCTATCGTCTGTCCA - 3'	-2.2	-1.5
18	5' - TCACCTGAGACTTGACGATGGCCCCCCCCCGCACCGCCTCATTCAGCATACTAATACGAGTGCTATCGTCTGTCCA - 3'	-1.9	-1.2
19	5' - TCACCTGAGACTTGACGATGGCAACCACCCCCCCTGGCTACATCATATTCTTATCTTGAGTGCTATCGTCTGTCCA - 3'	-2.1	-1.3
20	5' - TCACCTGAGACTTGACGATGGTTTCTTCGCCCCCCCCACACACTACACGTTTCTTCTGAGTGCTATCGTCTGTCCA - 3'	-2.2	-2.1
21	5' - TGACTGATTTACGGAAGCTGAATAAGGACTGCTTAGGATTGCGATGATTCAGCT - 3'	-2.6	-1.5
22	5' - TGACTGATTTACGGAAGTTACGGACGGATGTCAGTGGTATAGTAATCCGTAACT - 3'	-1.7	-1.5
23	5' - TGACTGATTTACGGAAGCTGAATAAGGACTGCTTAGGATTGCGATGATTCAGCT - 3'	-1.4	-1.6
24	5' - CGTAAATCAGTCAGAAGCTGAATAAGGACTGCTTAGGATTGCGATGATTCAGCT - 3'	-1.4	-1.2
25	5' - GCGGAGCGTGGCAGG - 3'	-2.3	-1.4
26	5' - CCTGCCACGCTCCGC - 3'	-2.3	-1.6
27	5' - CTGAATCATCGCAATCCTAAGCAGTCCTTATTCAGAAAAAAAAAAAAAAAA - 3'	-2.1	-2.0

Visualization Results

Aptamer 08

The Figure 1 demonstrates that the macro-binding mode of bd-Tau to DNA involves local residues "anchoring" to the DNA, while the remaining segments remain free and extended. A detailed view in Figure 1 reveals the specific locations of the binding hotspot residues and their potential interaction patterns. BD-Tau binds to the DNA grooves through localized segments, likely driven by interactions between positively charged residues and the negatively charged DNA backbone, while retaining a largely disordered structure. Tau anchors via a small number of amino acid residues.

Figure 1. Binding site diagram of protein and DNA aptamer 08

Aptamer 14

In the figure 2, the orange backbone and blue base pairs represent the DNA double helix, while the BD-Tau protein is depicted as a green thread-like structure adhering to one side of the DNA. The bd-Tau protein is not extensively wrapped around the DNA but instead attaches locally, exhibiting characteristics of fragmentary binding. The figure shows close contacts between certain residues of BD-Tau and the DNA bases, suggesting potential key binding hotspot regions that may involve hydrogen bonds or electrostatic interactions.

Portions of the Tau protein fit into the grooves of the DNA helix, revealing a broad potential contact surface. Tau residues form close contacts with either the phosphate backbone or the edges of DNA bases, which may involve hydrogen bonds or electrostatic interactions. Rather than contracting entirely to wrap around the DNA, Tau establishes stable point-like interactions via specific key residues. This indicates that the binding between Tau and DNA is mediated by a limited number of hotspot residues.

Figure 2. Binding site diagram of protein and DNA aptamer 14

Aptamer 16

In Figure 3, the binding of BD-Tau to DNA occurs through a combination of localized hotspots and extensive flexible surface adhesion. The figure 3 shows that certain residues of the Tau protein (represented as yellow sticks) penetrate into the DNA groove regions, forming close contacts with base pairs or the phosphate backbone. The binding is likely stabilized by hydrogen bonds and electrostatic interactions, particularly between positively charged amino acid residues (such as Lys and Arg) and the negatively charged phosphate groups of DNA.

Figure 3. Binding site diagram of protein and DNA aptamer 16

Experimental Results

To further validate the affinity of the aptamers for BD-Tau, the dissociation constant (Kd) values of these three aptamers were determined using Surface Plasmon Resonance (SPR) to characterize their binding affinity. Different concentrations of the aptamers were serially diluted with running buffer (e.g., 0 μM, 5 μM, 10 μM, 30 μM, 100 μM, 150 μM). The serially diluted aptamer solutions were sequentially injected over the sensor chip surface immobilized with the target protein in order of increasing concentration, at a constant flow rate, to monitor the binding process. The Kd values were calculated based on the measured Response Units (RU). A lower Kd value indicates higher binding affinity.

Table 2. Aptamer Sequences and Dissociation Constants.

Aptamer	Sequences	Kd (μM)
Aptamer-08	5' TCACCTGAGACTTGACGATGGCACTACCCCTCCCTACTAAGCACGGTATCTTGTACTGAGTGCTATCGTCTGTCCA - 3'	11.32 ±1.1μM
Aptamer-14	5'-TCACCTGAGACTTGACGATGGTTCTACACTGCCCCCCCGACCCGCCAGACCAACCCAGAGTGCTATCGTCTGTCCA - 3'	6.38 ±0.78μM
Aptamer-16	5' - TCACCTGAGACTTGACGATGGCGCTACCCCCTAACTTCAACCCGCATTATTCTAGCTGAGTGCTATCGTCTGTCCA - 3	21.83±1.6μM

The figure 3 and Table 2 illustrate the relationship between aptamer concentration and the energy required to dissociate the aptamer from its target protein. Five different concentrations of the aptamer were utilized for the SPR analysis. Subsequently, the second graph re-plotted the data from the Figure 8 with the x-axis representing aptamer concentration, and a logistic best-fit line was derived. The midpoint of this graph was identified as the dissociation constant (Kd) value. In our analysis of various aptamers using SPR, we observed that a lower Kd value corresponds to a higher energy requirement for dissociation, indicating greater affinity. Notably, the Aptamer-14 exhibited a Kd value of approximately6.38 ±0.78μM, signifying a high affinity for the target protein. We employed the high-affinity Aptamer-14 for ELISA validation of specificity.

核酸适配体-14

Figure 3. Affinity assessed by using surface plasmon resonance

Model Limitations and Optimization

Molecular docking models are more suitable for large-scale screening; however, the experimental results of models have certain limitations. The outcomes of molecular docking require validation through experimental data to confirm their reliability. In our design, we utilized the Surface Plasmon Resonance (SPR) method to verify the affinity of the aptamers for BD-tau, thereby enhancing the credibility of the experimental findings.

Summary

First, we utilized the SELEX system to screen 27 aptamers and performed molecular docking with BD-tau, T-tau, and the aptamers to obtain Z-Score values. Three aptamers demonstrating high affinity for BD-tau but low affinity for T-tau were selected, and their Kd values were determined to characterize binding affinity. Based on the SPR results, Aptamer-14 was selected for subsequent ELISA experiments and sensor detection.

Part 2: Machine Learning Model Simulation

Background

Alzheimer's Disease (AD) is a neurological disorder closely associated with neuronal degeneration. The abnormal phosphorylation and aggregation of Tau protein is one of the hallmark pathological features of AD.

Proteins

BD-Tau: A novel, more targeted tau isoform that more accurately reflects AD-related Tau pathology.

total-Tau: A traditional biomarker reflecting overall changes in Tau levels.

Composite Protein: An integrated measure derived from multiple phenotypes/detection assays.

Objective

To utilize expression level data of these three proteins to develop a machine learning model for predicting an individual's AD status.

Methodology

Three different machine learning models were selected to complete the task of binary classification (confirming AD or non-AD) based on BD-tau levels in artificially simulated blood: Random Forest (RF), Support Vector Machine (SVM), and Lasso logistic regression. Each model has undergone parameter optimization based on its characteristics.

Random Forest (RF)

Random Forest (RF) is an ensemble learning method that constructs multiple decision trees during training and merges their results to improve predictive performance and robustness. It is particularly advantageous in biomedical classification tasks where input features may be noisy, nonlinear, and prone to overfitting when modeled by simpler algorithms.In our project, RF was utilized to classify samples based on BD-Tau levels — a key biomarker for Alzheimer's disease (AD) — to distinguish between AD and non-AD individuals. To ensure the model’s stability and reduce variance, we set the number of decision trees (n_estimators) to 300. This allows the model to average across a sufficiently large number of trees, mitigating the risk of overfitting to any particular subset of the training data.

A common challenge in medical datasets is class imbalance, where one class (e.g., AD patients) significantly outnumbers the other (e.g., non-AD patients). To address this, we enabled the class_weight="balanced" parameter. This setting automatically adjusts weights inversely proportional to class frequencies in the input data, ensuring that the model pays equal attention to both classes during training.Additionally, to ensure reproducibility and consistent performance across runs, we set a fixed random_state=42. This seeds the random number generator used for bootstrapping and feature selection, allowing the same model structure to be generated every time the code is executed.

The strength of RF lies in its non-parametric nature and ability to capture complex, nonlinear relationships between features — which is especially useful when classifying patients based on subtle differences in biomarker expression. Each decision tree in the forest is trained on a random subset of the data and a random subset of features, promoting diversity among the trees and enhancing generalization ability.

Ultimately, RF produces a classification output by aggregating the predictions of all individual trees (majority voting). This results in a model that is not only robust to outliers and noise, but also well-suited for high-dimensional biological data where interpretability and stability are critical.

By incorporating Random Forest into our modeling pipeline, we ensured a strong baseline performance, particularly in scenarios where the distribution of BD-Tau levels among AD and non-AD individuals is complex and overlapping. The ensemble approach strengthens the model’s ability to generalize from training data to unseen samples — a key requirement for any practical diagnostic tool.

Support Vector Machine(SVM)

The core idea of SVM is to find an optimal hyperplane to separate samples of different categories. This hyperplane should not only be able to separate two types of samples, but also maximize the distance between the two types of samples and this plane. The larger the interval, the higher the confidence level of classification, and the stronger the generalization ability of the model is usually. The goal of support vector machines is not merely classification, but to find an optimal dividing line that is as far away as possible from all types of data points. In other words, it wants to find a line that can separate the two types of data while maintaining the farthest distance from the points of the two types of data.

In the nonlinear separable case we are facing, that is, attempting to distinguish AD patients from non-AD patients based on features such as plasma BD-tau levels, if a linear SVM without a kernel function is used, a series of serious problems will be encountered, resulting in the complete failure of the model. The most crucial issue is that the essence of linear SVM is to find a linear hyperplane (a straight line in two dimensions) to separate the data. If the BD-tau features of AD patients and non-AD patients present a complex and interwoven distribution state in the feature space (for example, some AD patients have high indicators, and some non-AD patients also have high indicators for other reasons), then there is simply no straight line that can separate the two types of samples at an acceptable error rate. In this case, using a linear model would yield a result with extremely poor classification performance, and its accuracy might approach random guessing.

The fundamental solution of kernel functions is to map data from the original feature space to a higher-dimensional (or even infinite-dimensional) feature space. The original features may only have a few indicators such as BD-tau levels. Kernel functions (such as RBF kernels) map each sample point to a very high-dimensional new space. In this new space, the sample distribution patterns of AD patients and non-AD patients may have undergone an essential change, from being inseparable to being linearly separable. Then, SVM can easily find an optimal linear hyperplane in this high-dimensional space. Ultimately, the decision boundary seen in the original feature space of "BD-tau concentration" is no longer a straight line, but may be a complex curve. This curve can wind its way through the data points, surrounding the areas where AD patients gather while excluding those of non-AD patients, thereby achieving high-precision classification. This boundary is smooth and optimal, which not only ensures the fitting ability to the training data but also guarantees good generalization ability through maximizing the interval.

Lasso logistic regression

Lasso logistic regression is a combination of logistic regression and L1 regularization.The core of logistic regression is to map the output of linear regression to the interval [0,1] through a Sigmoid function, interpreting it as the probability of belonging to the positive class (such as AD).Ordinary logistic regression fits the data by minimizing the Loss function, but it is prone to overfitting when there are many features or collinearity.

Lasso (Least Absolute Shrinkage and Selection Operator) solves the overfitting problem by adding an L1 norm penalty term to the loss function. During the optimization process, it tends to precisely compress the coefficients of unimportant features to zero. Features with a coefficient of zero are completely excluded by the model. Therefore, Lasso is not merely a regularizer, but an embedded feature selector built into the optimization process. It automatically outputs a sparse model, retaining only the most critical predictor variables.

In studies that distinguish AD from non-AD patients based on plasma biomarkers (such as BD-tau), Lasso logistic regression plays an irreplaceable and crucial role. Plasma proteomics data are usually high-dimensional. You may have measured the concentrations of dozens or even hundreds of proteins, peptides or metabolites. Many of them may not be related to the pathology of AD or highly correlated with other features. L1 regularization will cause the coefficients of many unimportant or redundant biomarkers to be set to zero, meaning they are not taken into account in model predictions.

To evaluate the generalization ability of the model, 5-fold stratified cross-validation (StratifiedKFold) was employed. This method ensures that the class distribution in each fold remains consistent with the entire dataset, effectively avoiding evaluation bias caused by uneven data partitioning.

In our experiment, we only input the bd tau levels and Tau levels in simulated plasma into the model and had the model perform L1 regularization based on the data of these two biomarkers. However, after obtaining real patient plasma in the future, our model can have even more functions. For example: If the coefficient of BD-tau is large and non-zero, this provides strong, data-driven statistical evidence that BD-tau is a core plasma biomarker for AD. It may also screen out other markers with non-zero coefficients, which could be other proteins that interact in synergy with BD-tau, or entirely new and insufficiently recognized potential AD associations, providing important clues for subsequent biological research.

Results

1. ROC Curve (Receiver Operating Characteristic Curve)

This chart for evaluating the performance of a binary classification model shows how the model's performance changes under different classification thresholds. The closer the curve is to the top-left corner, the better the model's performance.

The x-axis of the chart represents the False Positive Rate (FPR), also known as the fall-out rate, which indicates the proportion of all actually negative instances that are incorrectly predicted as positive. The y-axis represents the True Positive Rate (TPR), also known as recall or sensitivity, which indicates the proportion of all actually positive instances that are correctly predicted as positive.

The AUC (Area Under Curve) is a key metric for measuring a model's ability to distinguish between classes. Its value ranges from 0 to 1; a value closer to 1 indicates better model performance in distinguishing between the positive and negative classes.

1.1 Random Forest (RF) Model

Figure 4. The ROC curve of Random forest model

The figure 4 indicate that the RF model achieved a mean AUC of 0.789 with a standard deviation of 0.131 under 5-fold cross-validation, indicating that the model possesses reasonably good overall discriminative ability.

1.2 Support Vector Machine (SVM) Model

Figure 5. The ROC curve of SVM model

The figure 5 indicate that the SVM model achieved a mean AUC of 0.860 with a standard deviation of 0.119 under 5-fold cross-validation, demonstrating relatively strong overall discriminative ability.

1.3 Lasso Model

Figure 6. The ROC curve of Lasso model

The figure 6 indicate that the Lasso model achieved a mean AUC of 0.844 with a standard deviation of 0.104 under 5-fold cross-validation, demonstrating good and relatively consistent overall discriminative ability.

ROC Curve Cross Comparison

A comparative analysis of the three models reveals a performance hierarchy. The SVM model achieved the highest mean AUC (0.860), indicating superior overall discriminative ability. It was followed closely by the Lasso model (0.844), which also demonstrated the lowest standard deviation (0.104), signifying the most stable and reliable performance across cross-validation folds. While the Random Forest (RF) model possessed reasonably good discriminative ability (AUC 0.789), its higher standard deviation (0.131) points to greater performance variability, and its mean AUC was lower than the other two models. In conclusion, the SVM model is the recommended choice for predictive power, though the Lasso model presents a compelling alternative due to its consistency.

2. The Precision-Recall (PR)

The Precision-Recall (PR) curve is a performance evaluation tool for binary classification models. Its particular value lies in its ability of dealing with imbalanced datasets since it focuses on the model’s performance on the positive class rather than being skewed by a large number of negative examples. As shown in the graph, its plots the relationship between two models across all possible classification thresholds. The Y-axis represents the precision, which answers the question “Of all the samples we labeled as positive, how many are actually positive?” Thus, it demonstrates the fraction of correctly correctly predicted positive instances among all instances predicted as positive. On the other hand, the X-axis represents the recall, also known as sensitivity. This answers the question “Of all the actual positive samples, how many did we successfully find?" Thus, it demonstrates the fraction of correctly predicted positive instances among all actual positive instances.

Each point on the curve represents a unique (Precision, Recall) pair achieved at a specific decision threshold. This allows the model evaluators to visually assess the trade-off between these two metrics. A model that has both high precision and high recall will have a curve that leans toward the top-right corner of the plot. In addition, the Average Precision (AP) score, which is the area under the PR curve, demonstrates a single-figure summary of model performance across all thresholds. Specifically, an AP score that is closer to 1 showcases excellent performance. On the contrary, a score near the baseline (representing a random classifier) suggests poor discriminative ability. In this study, the model was evaluated using 5-fold stratified cross-validation to ensure robustness and generalizability.

2.1 Random Forest

For the Random Forest method, the PR curves for each fold are displayed, along with their mean AP value of 0.818 and a standard deviation of 0.123(Figure 7). This indicates a strong overall performance and a relatively stable behavior across different data splits. The model outperforms significantly over the random classifier baseline (dashed line). This confirms its practical utility in identifying positive cases even under imbalanced conditions.

rf_cv_report_02

Figure 7. Precision-Recall Curve of the Random Forest method

2.2 Support Vector

For the Support Vector Method, the mean AP value of 0.891 and a standard deviation of 0.86 in the figure 8. This indicates a very strong ability to correctly identify positive instances while maintaining high precision across various decision thresholds. Similar to the Random Forest measurement, this model also demonstrates excellent performance in distinguishing between the positive and negative classes, as evaluated by the Precision-Recall curve, confirming the model is far superior to random guessing.

Figure 8. Precision-Recall Curve of the Support Vector Method

2.3 Lasso method

Figure 9. Precision-Recall Curve of Lasso method

For the Lasso logistic regression method, the PR curves for each fold are displayed, along with their mean AP value of 0.839 and a standard deviation of 0.091 in the figure 9. Similarly, this indicates a strong overall performance and a relatively stable behavior across different data splits. The model significantly outperforms the random classifier baseline (dashed line), confirming its practical utility in identifying positive cases. This result demonstrates that Lasso regression, with its inherent feature selection capability, is a highly effective model for this classification task, even under imbalanced conditions.

Model Reliability and Error Analysis

1. Out-of-Fold (OOF)

First, the Total Dataset is randomly but stratified into five mutually exclusive subsets of approximately equal size (Folds). Stratification ensures that the proportion of AD patients to non-AD patients in each subset is consistent with the overall dataset. Next, we will conduct five rounds of repetitive loop training. In each round, a training set will be selected to train the model and make predictions. For example: Round 1: Take Fold 1 as the test set, and combine the remaining Fold 2, 3, 4, and 5 as the training set. Fit (train) a random forest Model (Model 1) using the training set data. Then, use the trained Model 1 to predict all the samples in the test set (Fold 1), and record the true label and predicted label of each sample. The key point is that this model did not use the data from Fold 1 at all during training, so its prediction of Fold 1 is purely "unprecedented".

Round 2: Fold 2 is used as the test set, and the remaining Fold 1, 3, 4, and 5 are combined as the training set. Train the new Model Model 2 and use it to predict all the samples in Fold 2. Repeat this process until each subset (Fold) is exactly regarded as a test set.

After five rounds of cycles, every sample in the dataset has been predicted by a model that "has never seen it before". By concatenating all the 5-fold of the prediction results in the order of the original samples, we obtain a complete set of prediction labels based on OOF.

1.1 RF

The figure 10 shows that the predicted probabilities of the vast majority of samples that are truly Normal are concentrated around 0.0, while those of the vast majority of samples that are truly AD are concentrated around 1.0. The separation degree of the two distributions is very high, which indicates that the random forest model has a strong discrimination ability, can effectively separate positive and negative samples, and has a high confidence level in the prediction results.

rf_cv_report_03

Figure 10. Out-of-Fold curve of RF method

1.2 SVM

The figure 11 shows that the predicted probabilities of the vast majority of samples that are truly Normal are concentrated around 0.2, while those of the vast majority of samples that are truly AD are concentrated around 1.0. even though this performance is a bit lower compared to random forest, but the separation degree of the two distributions is very high, which indicates that our SVM model also has a strong discrimination ability, can effectively separate positive and negative samples, and has a high confidence level in the prediction results.

Figure 11. Out-of-Fold curve of SVM method

1.3 Lasso

The figure 12 shows that the predicted probabilities of samples that are truly Normal are concentrated around 0.4, while those of the vast majority of samples that are truly AD is not highly concentrated. This is indeed a relatively poor performance, but it is partially because our input only contain the Tau and BD tau level, which in the future will include way more other biomarker level in the input and led to better performance.

Figure 12. Out-of-Fold curve of Lasso method

2. Confusion Matrix

Firstly, through the above 5-fold OOF process, we have obtained the prediction labels for all samples in the entire dataset. Meanwhile, we have the real labels corresponding to all the samples. Next, we will conduct a one-to-one and comprehensive comparison between these OOF prediction labels (y_pred_oof) and the true labels (y_true). The confusion matrix is a standardized summary and statistics of such comparison results.

Based on the comparison results, all samples are divided into four quadrants:

True example: The truth is AD, and the OOF prediction is also AD.

True negative example: If the truth is Normal, the OOF prediction is also Normal.

False positive example: The true value is Normal, but the OOF prediction is AD.

False negative example: The true value is AD, but the OOF prediction is Normal.

Fill these four statistics into a 2x2 matrix and visualize it, and you will obtain the OOF confusion matrix shown in Figure 4 of the document.

2.1 RF

The random forest model correctly identified 35 out of 50 real Normal samples and 36 out of 48 real AD samples in the figure 13. This result helps us understand the types of errors made by the model. And further optimize the model according to specific requirements.

rf_cv_report_05

Figure 13. The OOF of RF method

2.2 SVM

The random forest model correctly identified 43 out of 50 real Normal samples and 35 out of 48 real AD samples. This figure 14 helps us understand the types of errors made by the model. And further optimize the model according to specific requirements.

A blue squares with white text

Description automatically generated

Figure 14. The OOF of SVM method

2.3 Lasso

The random forest model correctly identified 46 out of 50 real Normal samples and 26 out of 48 real AD samples. This result helps us understand the types of errors made by the model. And further optimize the model according to specific requirements.

A blue squares with black numbers

Description automatically generated

Figure 15. The OOF of Lasso method

3. Calibration Curve

Calibration Curve, also known as a Reliability Diagram, is used to evaluate the reliability of a model's predicted probabilities. The dashed line in the graph represents a perfectly calibrated model. If the curve coincides with the diagonal line, it indicates that the model's predictions are perfectly reliable.

The x-axis of the calibration curve represents the binned average of predicted probabilities, while the y-axis represents the proportion of actual positive samples within each probability bin. When the curve lies above the diagonal, it indicates that the model's predictions are under-confident (i.e., the predicted probabilities are lower than the actual proportion of positive samples). Conversely, when the curve lies below the diagonal, it suggests that the model's predictions are over-confident.

3.1 RF

rf_cv_report_04

Figure 16. The calibration curve of RF

As can be seen from the figure 16, at the predicted probability levels of 0.2 and 0.6, the data points lie below the diagonal line, indicating that the model's predictions are over-confident within this range. Conversely, at the probability levels of 0.4 and 0.8, the points lie above the diagonal, suggesting under-confident predictions.

3.2 SVM Lift Curve

Figure 17. The calibration curve of SVM

As can be seen from the figure 17, at the predicted probability levels of 0.2 and 0.4, the data points lie below the diagonal line, indicating that the model's predictions are over-confident within this range.In the regions where the predicted probability is 1.0, the curve points coincide with the diagonal, demonstrating that the model's predictions are well-calibrated and highly reliable within this range.

3.3 Lasso Lift Curve

Figure 18. The calibration curve of Lasso

Figure 18 shows that at predicted probabilities of 0.6 and 0.8 , the curve points are located above the diagonal, suggesting under-confident predictions. This model will require adjustments and optimizations in a later phase.

Evaluation of Application Performance and Value

Cumulative gain curve

The cumulative gain curve assesses the model's ability to identify high-risk populations by comparing the effects of the model and the random selection method. The horizontal axis represents the percentage of samples selected from the model's predicted probabilities in descending order among the total samples, while the vertical axis represents the percentage of actual positive samples among the selected samples among the total positive samples. The testing principle of the cumulative gain curve is based on a core question: If we sort the samples from high to low according to the probability predicted by the model and check them one by one, can the model identify positive cases (such as AD patients) more efficiently than random sampling? The starting point of the test is the probability that the model predicts each sample to belong to the positive class. This probability comes from the out-of-of-fold prediction probability generated by the random forest model through 5-fold cross-validation, which ensures the unbiased and reliable probability estimation.

To evaluate the performance of the model, a baseline is needed, namely the performance of the stochastic model. The principle of a stochastic model is that if there is no model and we randomly select X% of the samples, then it is expected that X% of these X% samples will contain positive examples. For instance, if 20% of the samples are randomly selected, it is expected that 20% of AD patients can be identified. If 100% of the samples are selected, it is natural to find 100% of AD patients. Therefore, the baseline is a diagonal line from (0,0) to (1,1).

The core of the testing principle of the cumulative gain curve lies in comparing the gap between the model curve and the baseline. The steeper and closer the model curve is to the upper left corner, the stronger the model's sorting ability is. This means that the model successfully assigned a high probability to the true positive example.

1.1 RF

As can be seen from the figure 19, when the top 20% of samples are selected, the random forest model can cover nearly 40% of the positive samples, and when the top 40% of samples are selected, the model can already cover more than 70% of the positive samples. This indicates that the model has a strong ability to identify high-risk groups and will be very efficient in screening under limited resources.

rf_cv_report_06

Figure 19. The cumulative gain curve of RF

1.2 SVM

As can be seen from the figure 20, when the top 40% of samples are selected, the SVM model can already cover more than 60% of the positive samples. Moreover, there is a significant higher efficiency compared the random baseline. This indicates that the model has a strong ability to identify high-risk groups and will be very efficient in screening under limited resources.

A graph with a line

Description automatically generated

Figure 20. The cumulative gain curve of SVM

1.3 Lasso

As can be seen from the figure 21, when the top 50% of samples are selected, the Lasso model can already cover more than 80% of the positive samples. Moreover, there is a significant higher efficiency compared the random baseline. This indicates that the model has a strong ability to identify high-risk groups and will be very efficient in screening under limited resources.

Figure 21. The cumulative gain curve of Lasso

Lift Curve

The Lift Curve is a practical tool used to evaluate a classification model’s ability to prioritize true positive cases — in this context, individuals with Alzheimer’s disease (AD). Unlike metrics that consider the model's overall accuracy, the lift curve focuses on how well the model performs when only a subset of the population can be selected for further testing or intervention — a common scenario in clinical settings where resources are limited.

The x-axis of the lift curve represents the proportion of the sample population selected, sorted by descending predicted probability. The y-axis indicates the lift, which is the ratio between the number of true positives identified by the model and the number expected by random selection. A lift of 2.0 at 10% means that the model identifies twice as many true positives as random guessing within the top 10% of highest-scoring predictions. As more samples are included and lower-confidence predictions are considered, the lift typically declines toward 1.0, where the model performs no better than chance.

2.1 RF Lift Curve

rf_cv_report_07

Figure 22: RF Lift Curve

The Random Forest model, which aggregates the output of 300 decision trees, showed a maximum lift of approximately 1.9 within the top 20% of selected samples in the figure 22. This indicates that in this high-confidence range, the model is nearly twice as effective as random selection at identifying AD patients based on BD-Tau levels. Beyond this point, the lift curve gradually declines, eventually reaching 1.0 at the rightmost end, where all samples are included. This pattern suggests that the model performs best when targeting a smaller, high-risk subset and that its discriminative power diminishes as it expands to the broader population. The use of class_weight="balanced" and a fixed random_state ensured robustness and reproducibility across folds, while the model's ensemble nature contributed to stable performance in the presence of noisy or overlapping features.

2.2 SVM Lift Curve

A graph of a graph

Description automatically generated with medium confidence

Figure 23: SVM Lift Curve

As shown in Figure 23, the SVM model achieves its maximum lift of 2.0 within the top 10% of samples with the highest predicted probability. This indicates that the SVM’s capacity to effectively separate positive from negative cases in high-dimensional feature space, especially in the most confident region. The lift remains consistently high — above 1.8 — within the top 20% of samples, suggesting that the model prioritizes high-risk individuals effectively. Beyond this point, the lift gradually decreases, approaching 1.0 as more low-probability samples are included, indicating that the model’s advantage diminishes and eventually aligns with random selection.

This performance pattern is typical of a well-calibrated classifier. The smooth decline in lift demonstrates that the model assigns probabilities in a meaningful way and does not overfit to noise or outliers in the data. Compared to other models, SVM appears to maintain strong discrimination in the top-ranked predictions, which is particularly important for clinical applications, where only a subset of patients may be selected for further testing or early intervention.

2.3 Lasso Lift Curve

A graph of a line

Description automatically generated

Figure 24: Lasso Lift Curve

In this case, the Lasso model achieves its maximum lift of approximately above 1.7 at around 30% of the sample population, indicating that within this top fraction, the model identifies 75% more true positives than a random classifier. Notably, the lift remains consistently above 1.5 for up to approximately 50% of the sample set, demonstrating that the model retains strong discriminative ability well beyond just the top decile of predictions. Beyond this point, the lift gradually declines, approaching 1.0 as predictions become less confident, which reflects the natural convergence toward random performance when all samples are included. This behavior suggests that the Lasso model is particularly effective in mid-range targeting scenarios, where up to half of the population may be screened or prioritized. While it may not achieve the same peak lift as more complex non-linear models, its stable performance across a broader range — combined with its simplicity, interpretability, and intrinsic feature selection — makes it a highly valuable tool for scalable and resource-conscious diagnostic pipelines.

Model Limitations and Future Work

1.Expand dataset: The current model is based on only 48 samples. Future direction includes expanding the dataset with more AD/non-AD experimental data and performing parameter tuning to optimize accuracy.

2. Incorporate multimodal data: The model currently uses simulated plasma data with homogeneous variables. A key optimization is to incorporate additional predictive indicators to improve robustness.

Summary

Utilizing expression level data from these three proteins, we developed a machine learning model to predict individual Alzheimer's Disease (AD) status. The results demonstrate the feasibility of this approach. The Support Vector Machine (SVM) model, in particular, demonstrated superior performance, while the Random Forest (RF) model requires further optimization. Additionally, we employed methods such as Out-of-Fold (OOF) prediction to thoroughly evaluate the model's accuracy and reliability.

References

Parameter optimization of random forest based on grid algorithm and particle swarm optimization. Journal of Advanced Computational Intelligence, *12*(3), 45-60.

RFMSU: A new strategy for high-dimensional data classification based on multivariate symmetric uncertainty. IEEE Transactions on Knowledge and Data Engineering, *37*(8), 1234-1247.

Research on the limitations of random forests in pure interaction scenarios and improved algorithms. Machine Learning, 114(2), 289-310.

Roy. Statist. Soc. Ser. Regression shrinkage and selection via the lasso.B 58 (1996), no. 1, 267–288.

Siyuan Zhang, Qianfei Liu, Mengyang Fan, Weisong Mu, Jianying Feng,Multi-view least squares support vector classifiers with the principles of complementarity and consensus,Neurocomputing,Volume 657,2025,131647,ISSN 0925-2312,https://doi.org/10.1016/j.neucom.2025.131647.

Xiaoqi Jiao, Heng Lian, Jiamin Liu, Yingying Zhang,Linear Convergence of Proximal Gradient Method for Linear Sparse SVM, Neural Networks,2025,108162,ISSN 0893-6080,https://doi.org/10.1016/j.neunet.2025.108162.