Software header

Overview


Situating data within the proper context supports a quality interpretation of that data. Our main goal is contextualizing every piece of health information a person has in order to give them a better understanding of their BV progression. Consequently, our interpretation of a user’s data needs to be based on their health statistics, not just statistics available in literature. The vaginal microbiome is unique to each individual, and so our project must personalize explanations for everyone precisely. Therefore, the primary intent of our software is to create a working baseline based on a user’s body dynamics. Furthermore, we also need a graphical user interface (GUI) to communicate with and display information to users. Easy, and accessible communication is the cornerstone of increasing product availability. By communicating our science in an ethical, and responsible way, we further reaffirm our value of empathy and integrity. To see our GUI, please click here

Requirement Analysis


In order for us to create our software, we need to consider a few things. We need our algorithm to be easily integrable with a GUI, update daily, and both learn and adapt to a user’s features. Our algorithm has to be accurate and fast, and this is something we prioritized after examining our survey responses (read more about our survey here). We chose to use Random Forest Classification as our machine-learning algorithm to interpret the data. This was because this is an ensemble model and can improve generalization on noisy or sparse data. Furthermore, a random forest’s decision trees are based on a bootstrap sample of the original dataset, which increases the strength of the model. Originally, we were going to use a Random Forest Regression, but we could not find data with discrete numerical values to represent BV states, so we therefore chose to switch to a classifier.

Coding


We wrote 2 classifiers to predict BV status. The first included the features age, ph, and a normalized putrescine value. The second included ph and age. Because our biosensor detects putrescine, we constructed a classifier to include putrescine, and one without to capture the effect of it on accuracy and prediction. Furthermore, we used matplot to create a representation of feature importance. In our first classifier, putrescine was determined to be the most important feature suggesting the relevance of putrescine in predicting BV status. The accuracy for the first classifer was 85% whereas the second one was 70%. The 15% difference between the two suggests the importance of putrescine in detecting BV status. This also is our contribution to the understanding of BV.

Figure 1. Random Forest Feature Importance
Figure 2. Random Forest Decision Tree

The actual code uses scikit-learn to train and evaluate the machine learning model.

Impact


Because our first classifier found putrescine to be the most important feature in predicting BV status, we put forth the idea that putrescine should be valued in the diagnostic process of BV. Our contribution in our software is the importance of putrescine in detecting BV. This gives way for new diagnostic tools and approaches using putrescine to arise that can be used to track the progression of BV. Furthermore, by calling attention to putrescine in relation to BV, we can hopefully improve the focus of treatment research.