Setting and study data
The Israeli Ministry of Health publicly released data of individuals who were tested for SARS-CoV-2 via RT-PCR assay of a nasopharyngeal swab11. The dataset contains initial records, on a daily basis, of all the residents who were tested for COVID-19 nationwide. In addition to the test date and result, various information is available, including clinical symptoms, sex and a binary indication as to whether the tested individual is aged 60 years or above. Based on these data, we developed a model that predicts COVID-19 test results using eight binary features: sex, age 60 years or above, known contact with an infected individual, and five initial clinical symptoms.
The training-validation set consisted of records from 51,831 tested individuals (of whom 4769 were confirmed to have COVID-19), from the period March 22th, 2020 through March 31st, 2020. The test set contained data from the subsequent week, April 1st through April 7th (47,401 tested individuals, of whom 3624 were confirmed to have COVID-19). The training-validation set was further divided to training and validation sets at a ratio of 4:1 (Table 1).
See more: Covid 19 ml project
The following list describes each of the dataset’s features used by the model:
Age ≥60 years (true/false)
View more: COVID-19 Vault Saliva Testing at Home
Sore throat (true/false).
Shortness of breath (true/false).
Known contact with an individual confirmed to have COVID-19 (true/false).
Development of the model
Predictions were generated using a gradient-boosting machine model built with decision-tree base-learners20. Gradient boosting is widely considered state of the art in predicting tabular data21 and is used by many successful algorithms in the field of machine learning22. As suggested by previous studies23, missing values were inherently handled by the gradient-boosting predictor24. We used the gradient-boosting predictor trained with the LightGBM25 Python package. The validation set was used for early stopping26, with auROC as the performance measure.
To identify the principal features driving model prediction, SHAP values27 were calculated. These values are suited for complex models such as artificial neural networks and gradient-boosting machines28. Originating in game theory, SHAP values partition the prediction result of every sample into the contribution of each constituent feature value. This is done by estimating differences between models with subsets of the feature space. By averaging across samples, SHAP values estimate the contribution of each feature to overall model predictions.
Evaluation of the model
The model was scored on the test set using the auROC. In addition, plots of the PPV against the sensitivity (precision-recall curve) were drawn across different thresholds. Metrics were calculated for all the thresholds from all the ROC curves, including sensitivity, specificity, PPV and negative predictive value, false-positive rate, false-negative rate, false discovery rate and overall accuracy. Confidence intervals (CI) for the various performance measures were derived through resampling, using the bootstrap percentile method29 with 1000 repetitions.
The Tel-Aviv University review board (IRB) determined that the Israeli Ministry of Health public dataset used in this study does not require IRB approval for analysis. Therefore, the IRB determined that this study is exempted from an approval.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.