This study utilized MCF-7 and MDA-MB-231 cell lines to conduct experiments across groups with varying plasmid concentrations: Control, 1 μg, 2 μg, and 4 μg.
In the CCK8 assay, each plasmid concentration group included five replicates, with CCK8 levels measured at 0, 24, and 48 hours.
In the wound healing assay, each group included three replicates to assess cell migration.
Similarly, in the Transwell assay, each group included three replicates to evaluate cell migration capacity.
2.1 Data Distribution Analysis
Following data preprocessing, box plots were generated using the ggboxplot function from the ggpubr package to visualize variations in CCK8 proliferation, wound healing migration, and Transwell migration across plasmid concentration groups. The analysis was conducted using the following code snippets:
2.1.1 CCK8 Assay Plotting Code
ggboxplot(data, x="timing",y="value", color="treatment", add="jitter", ylab = "CCK8 Cell proliferation level", xlab = "timing", main="CCK8 experiment of MSF7 cell line", ylim=c(0,700))
ggboxplot(data, x="timing",y="value", color="treatment", add="jitter", ylab = "CCK8 Cell proliferation level", xlab = "timing", main="CCK8 experiment of MDA-MB-231 cell line", ylim=c(0,700))
2.1.2 Wound Healing Assay Plotting Code
ggboxplot(data, x="celltype",y="value", color="treatment", add="jitter", ylab = "Wound healing Cell migration level", xlab = "cell type", main="Wound healing experiment of two cell line")
2.1.3 Transwell Assay Plotting Code
ggboxplot(data, x="celltype",y="value", color="treatment", add="jitter", ylab = "Transwell Cell migration level", xlab = "Cell type", main="Transwell experiment of two cell line")
2.2 Model Construction
The modeling process was divided into four steps: 1) Calculation of CCK8 time series; 2) Integration of static data; 3) Machine learning model construction; 4) Model evaluation.
2.2.1 CCK8 Time Series Calculation
Since the CCK8 data in this project includes measurements at 24 and 48 hours, this study first analyzed the relationship between cell proliferation and time, and extracted temporal dynamic features. Specifically, the proliferation rate, the area under the curve (AUC) value, and maximum growth value were calculated for subsequent analysis. The analysis was conducted using the following code:
#first step
dynamic_features <- data1 %>%
group_by(SampleID, Concentration) %>%
summarise(
Slope_0_48 = (value[timing==48] - value[timing==0])/48, # proliferation rate
AUC = MESS::auc(timing, value), # AUC value
Max_Growth = max(value[timing %in% c(24,48)]), # Max growth value
.groups = 'drop'
)
2.2.2 Integration of Static Data
Since the CCK8 assay included five replicate samples, while the wound healing and Transwell assays included only three, only the first three replicates of CCK8 assay were included in the analysis. After extracting the data from the wound healing and Transwell assays, the temporal dynamic features of CCK8 assay from the previous step were merged using the merge function based on sample IDs, and stored as the full_data object. In this step, the factor function was used to convert plasmid concentration groups into categorical variables to facilitate subsequent analysis. The analysis was conducted using the following code:
data2 = read.csv("01figure/transwell.csv",header=T,row.names=1)
data3 = read.csv("01figure/woundhealing.csv",header=T,row.names=1)
other_data=data.frame(Transwell=data2$value[13:24], WH=data3$value[1:12], Treatment = data2$treatment)
other_data$SampleID = c(1:12)
other_data$Concentration = c(rep(0,3),rep(1,3),rep(2,3),rep(4,3))
other_data$Treatment = factor(other_data$Treatment, levels=c("Control","Ab=1","Ab=2","Ab=4"))
full_data = merge(dynamic_features,
distinct(other_data[,c("SampleID","WH","Transwell","Treatment")]),
by="SampleID")
saveRDS(full_data,"01figure/train_data.RData")
2.2.3 Construction of the Machine Learning Model
In this study, a machine learning approach was employed to model the relationship between plasmid concentration and experimental measurements. First, the data from the MDA-MB-231 cell line, after undergoing the aforementioned preprocessing steps, was designated as the training set, while the processed data from the MCF-7 cell line was designated as the test set.
Second, a random forest algorithm was selected to evaluate the contribution of each variable to the model, including proliferation rate (Slope_0_48), area under the curve (AUC), maximum growth (Max_Growth), wound healing assay results (WH), and Transwell assay results (Transwell).
Third, the model was trained on the training set and subsequently validated using the test set data. The final model was stored as “rf_final.rda”.
The analysis was conducted using the following code:
#second step
effit = train(
Concentration ~ Slope_0_48 + AUC + Max_Growth + WH + Transwell,
data = full_data,
method = "rf",
trControl = train_control,
verbose=FALSE
)
ggplot(varImp(effit)) + theme_minimal()
#third step
if(file.exists("01figure/rf_final.rda")){
rf_final <- readRDS("01figure/rf_final.rda")
} else {
trControl <- trainControl(method="none", classProbs=T)
set.seed(1516)
rf_final <- train(
Concentration ~ Slope_0_48 + AUC + Max_Growth + WH + Transwell,
data = full_data,
method="rf",
tuneGrid = effit$bestTune,
trControl=trControl)
saveRDS(rf_final, "01figure/rf_final.rda")
}
2.2.4 Evaluation of Model Accuracy
The reliability of the model was assessed using the ROC (Receiver Operating Characteristic) curve generated with the roc function from the pROC package. Visualization of the ROC curve was performed using the ggplot function from the ggplot2 package. The analysis was conducted using the following code:
prediction_prob = predict(rf_final, newdata=test_data, type="prob")
library(pROC)
roc <- roc(test_data, prediction_prob[,1])
roc
ROC_data <- data.frame(FPR=1-roc$specificities, TPR=roc$sensitivities)
ROC_data <- ROC_data[order(ROC_data$FPR),]
p=ggplot(data=ROC_data, mapping=aes(x=FPR, y=TPR))+
geom_step(fill="blue", size=1, direction="mid")+
geom_segment(aes(x=0, xend=1, y=0, yend=1))+ theme_classic()+
xlab("Specificity")+
ylab("Sensitivity")+
coord_fixed(1)+
xlim(0,1)+
ylim(0,1)+
annotate('text',x=0.5, y=0.25, label=paste("AUC:", round(roc$auc,2)))
3.1 Data Distribution
After recording and organizing the experimental data, box plots were used to visualize the distribution of values. In the MCF-7 cell line, CCK8 cell proliferation levels increased over time. At 24 hours, there was no significant difference among plasmid concentration groups; however, at 48 hours, the 1 μg plasmid group showed significantly lower CCK8 levels compared to the other groups (Figure 1A). A similar trend was observed in the MDA-MB-231 cell line, where at 48 hours, plasmid-treated groups exhibited significantly lower CCK8 levels than the control group (Figure 1B). These findings suggest that the plasmid exerts an inhibitory effect on cell proliferation in both MCF-7 and MDA-MB-231 cell lines, with more pronounced effects at 48 hours.
In the wound healing assay, all plasmid-treated groups showed significantly reduced cell migration compared to the control group, consistent across both MCF-7 and MDA-MB-231 cell lines (Figure 2). Higher plasmid concentrations demonstrated stronger inhibitory effects on migration. Notably, in the MCF-7 cell line, even the 1 μg plasmid group showed substantial inhibition, whereas in the MDA-MB-231 cell line, the 2 μg and 4 μg plasmid groups exhibited similar levels of suppression.
The Transwell assay results were largely consistent with those of the wound healing assay. In both MCF-7 and MDA-MB-231 cell lines, all plasmid-treated groups exhibited significantly reduced migration compared to the control group (Figure 3). Moreover, the inhibitory effect of the plasmid on cell migration appeared to follow a concentration-dependent gradient.
3.2 Data Modeling and Analysis
In this study, a random forest machine learning model was used to analyze the data. The results indicated that Transwell cell migration levels and proliferation rate (Slope_0_48) contributed the most to the model (greater than 50%), while wound healing migration levels (WH) and maximum growth (Max_Growth) had lower contributions (less than 25%). These findings reflect differing levels of contribution to the dependent variable (plasmid concentration). After we constructed the model, we stored it as an “rf_final.rda” file in R, which can be called elsewhere.
After constructing the model, its performance was evaluated using a ROC (Receiver Operating Characteristic) curve. The area under the curve (AUC) value was 0.771, suggesting that the model possesses moderate predictive power. Additionally, it is hypothesized that heterogeneity between the MCF-7 and MDA-MB-231 cell lines may have introduced variability in the experimental outcomes.