As the pandemic with coronavirus disease 2019 (COVID-19) has spread around the world, huge amounts of bioinformatics data have been created and analyzed, and logistic regression models have been the key to many articles that have helped illuminate important features of the disease, such as for example. mutations are linked to more serious disease outcomes.
Linear regression models are used for binary classification, which can then be generalized to multi-class classification and usually work very well. Researchers from the US Air Force Medical Readiness Agency have studied how training in the logistic regression model affects performance and what features are best included when examining datasets of people suffering from COVID-19.
A pre-printed version of the survey is available at medRxiv* server while the article is undergoing peer review.
The initial export of raw Global Initiative on Sharing Avian Influenza Data (GISAID) data was cured using shell scripts with FASTA sequences parsed from the export. Samples with patient outcome metadata attached were separated, with ~ 30,000 samples with severe outcomes and ~ 25,000 samples with mild outcomes used in the analyzes. Scikit-learn was used to adapt logistic regression models, and a train / test split was created on the data, where test data was only used to evaluate the models’ performance. A total of five different logistic regression models were created with different input functions.
To begin with, the researchers reproduced previous results using the same data set to validate the accuracy and area under the curve (AUC) of the logistic regression models – a measure of good fit. The model that used age, gender, region, and the variant of COVID-19 as features showed both the highest AUC of 0.91 and the highest accuracy of 91%. This was followed by models that used fewer features. The models identified the same mutations associated with the severity of the disease as in the previous experiment.
Following this, the classification performance of the logistic regression models used in the previous experiment was examined using the more recent data set. The mutations included in the updated dataset were limited to match the function space of the trained models, without new mutations not being included in the original 2020 dataset. In general, the earlier models showed a decrease in performance when applied to the later data set, especially for models that included the region function.
The embedded logistic regression models were then retrained on the new data set, with rehabilitation performed using the train division of the extended data set and performance evaluated using the test division. The retrained models were then compared with the models trained on the original data set. As expected, the models using age, gender, region, and variants (AGRV) continued to show the best performance, and the models trained on the original data set outperformed the models trained on the later data set.
The decrease in the performance of the rehabilitated model could indicate a reduction in the ability to distinguish between severe and mild outcomes in the extended data set or could be explained by an inconsistent definition of case severity between the two data sets. The mutations most commonly associated with severe and mild outcomes in the 2020 dataset are not identified in the 2021 dataset, with no overlap in the top 40 mutations. However, 10 of the top 20 mutations associated with severe outcomes in the previous study were also associated with severe outcomes in the 2021 dataset.
Other binary machine learning classifiers were also explored, including Random Forest, Naive Bayes, and Neural Network algorithms. When these performances were compared with the logistic regression model, 3,386 samples were used for the analysis, and 2,694 of these were associated with severe outcomes and 692 were associated with mild outcomes.
AGRV was again used as functions for all the tested models with a stratified 67% train and 33% test-divided data set. 5-fold cross-validation was performed to select the best parameters for each model before Sci-kit learning ensemble modules were used to run each of the models. The random forest model performed significantly better than all other models, including the logistic regression model, which the entire paper focuses on, with a possible AUC of 0.936 and an accuracy of 0.918.
The researchers found that Random Forest was the best performing algorithm for classification, which could indicate the presence of non-linear interactions between functions.
In addition to this, they have identified the most efficient features for examining COVID-19 data with linear regression models, which should be helpful for bioinformatics students studying data sets where Random Forest is an appropriate analysis method.
medRxiv publishes preliminary scientific reports that are not peer-reviewed and therefore should not be considered as crucial, guide clinical practice / health-related behavior or be treated as established information.