Application of Machine Learning and Predictive Modeling to Viral Retention and Suppression in South African HIV Treatment Cohorts


A total of 445,636 patients were included in the retention model and 363,977 in the CV model. Almost a third (30%) of patients were male, with a median age of 39 years (IQR 31–49 years) at the time of visit. In the retention dataset, patients had a median of 18 (IQR 10–25) visits since entering care and were in care for a median of 31 (IQR 15–43) months. The vast majority (91%) of patients visited only one facility during the period analyzed.

Predictor variables and baselines

We generated 75 potential predictor variables per visit and 42 predictor variables per VL test. VL retention and deletion models were built using the AdaBoost and random forest15 binary classification algorithms, respectively, from scikit-learn16 open source project and tested against unpublished data to assess predictive performance.

For the retention model, the test set consisted of 1,399,145 randomly selected unpublished visits between 2016 and 2018. The baseline prevalence of missed visits for the test set was 10.5% (n=146,881 visits), which corresponds to the prevalence of LTFU observed in both the full data set and the training set. This observed baseline was comparable to meta-studies of LTFU at 1 year in South Africa 2011-201517. For the CV deletion model, the dataset was split into training and test sets, with the test set consisting of 30% (n=211,315) of the randomly selected unseen original tests during the study period. In the VL test set, there were 21,679 unsuppressed viral load results (>400 copies/mL) for a baseline prevalence of unsuppressed VL results of 10.3%.

Retention model results

We selected two approaches for the training sets: first, the sample was balanced in terms of exit classes (50% missed visits and 50% non-missed visits); and second, with an unbalanced sample (60% non-missed visits and 40% missed visits). The AdaBoost classifier was trained with a 50:50 balanced sample of the modeling ensemble, resulting in 343,078 of each visit classification (lack Where not missed visits) in the training set. Using the test set, the retention model correctly classified 926,814 of the test set (~1.4m visits) correctly, yielding an accuracy of 66.2% (Table 2A). A total of 89,140 patients missed their scheduled visit and were correctly identified out of 146,881 known missed visits available, giving a sensitivity of 60.6% for all positives. Conversely, 837,674 visits were correctly identified as not missed out of a total of 1,252,264 visits observed as not missed for a specificity of 67% and a negative predictive value of 94%.

Table 2 Late visit model metrics based on (A) balanced and (B) unbalanced training sets.

Next, the AdaBoost classifier was trained with an unbalanced 60:40 sample from the modeling ensemble. This translated to 343,180 missed visits and 514,770 on-time visits to the training set. The retention model trained on the unbalanced sample correctly classified 1,100,341 of the test set (~1.4 m), for an accuracy of 78.6% (Table 2B). However, only 59,739 of the missed visits were correctly identified, giving a sensitivity of 40.6% for all positives and a false negative rate of 59.3%. The negative predictive value of the model remained high at 92%, further suggesting that tracked scheduled visits are easier to identify than missed visits.

Both models demonstrated the potential trade-off of accuracy, precision, and sensitivity that can be manipulated when training the models.18. However, the predictive power or usefulness of the model to separate the classes – represented by the AUC metric – remained consistent across models. The two ROC curves are shown in Figures 2A, B with the same AUC and identical shapes. While this difference in sampling approach demonstrates manipulation of metrics, it is important to note that this rebalancing and resampling of the training set can also introduce underrepresentation or misrepresentation of subclasses, each dataset being particularly susceptible to imbalance issues, especially at smaller sample sizes19.20.

Figure 2

ROC curve of (A) Balanced 50:50 Late Visit Classifier, (B) 60:40 unbalanced late visit classifier and (VS) Balanced 50:50 unsuppressed VL classifier.

VL model results removed

For the removed VL model, the final training set was downsampled to 101,976 tests, so it had a 50:50 balanced sample. The model correctly classified 153,183 CV results out of the test set of 211,315 correctly, yielding an accuracy of 72.5% (Table 3). A total of 14,225 unsuppressed viral load tests were correctly predicted out of 21,679 possible unsuppressed test results, giving a sensitivity of 65.6%. The negative predictive value of the model was very high at 95%, again suggesting that suppressed CV results (i.e. lower risk) are easier to recognize. Overall, the model had an AUC of 0.758 (Table 3, Figure 2C).

Table 3 Unsuppressed VL model metrics based on balanced training sets (50:50).

Importance of the predictor

The original set of over 75 input predictor variables for the retention model (and 42 for the undeleted VL model) has been reduced to a more practical number through feature selection using an algorithm random forest on all inputs. Random Forest permutes entries in trees of different predictor groups, and the change in predictive power (as measured by AUC) of the model for each permutation was calculated. This process prioritizes groups of predictor variables that together improve predictive power and deprioritizes those that contribute little or nothing to improving AUC. Random Forest was able to rank the relative importance of the features of the total input set for each model. Figures 3A,B illustrate their relative importance in helping to correctly and repeatedly classify a particular observation as a correct or incorrect prediction of the target outcome. Predictor variables of higher importance help the algorithm distinguish its classifications more often and more correctly than those of lower importance. For example, in the retention model (Fig. 3A), the gender represented in the Boolean variable “Is male” has some correlation with the missed visit target outcome and measurably more than the eliminated predictor variables that had a correlation. nothing. However, it is clear that the algorithm relied on correlations in prior patient behavior (frequency of delays, length of treatment, etc.) to segment the risk of outcome, and together these further described the difference. than sex alone.

picture 3
picture 3

(A) Final entry features included in the late visit model ranked by importance. (B) Final entry features included in the unremoved VL model ranked by importance.

Our results indicated that prior patient behavior and treatment history were extremely important in predicting both visit attendance and viral load outcomes in these datasets and that traditional demographic predictors were less useful than behavioral indicators. These more powerful predictor variables can also be used to further stratify populations by risk and segment more granularly for targeted interventions and differentiated care.

During feature selection, we investigated the overfitting of particular features through comparative feature permutation importance tests with the aim of identifying all highly correlated overfitting but erroneous features in the training set. which was not a phenomenon reflected in the test set (Supplementary Fig. 1). We also performed correlation checks on the candidate input features. Rather than assuming that multicollinearity in the input variables necessarily leads to loss of information, during the feature selection phase we tried several combinations of feature groups to test the relationship of certain groups to metrics. of prediction. The matrix of these feature correlation checks is shown in Supplementary Figure 2.

We also report model performance metrics considering various subsets of the ranked input features to determine if narrowing the model down to the 10 most important features had an impact on performance metrics. As shown in Supplementary Table 1, the overall model accuracy changed by only 5% when comparing a model including only the 5 most important features (62%) to a model including all 75 features (67%). The difference in AUC between these two models was less than 0.04 (Supplementary Fig. 3).


Comments are closed.