Machine learning approach for predicting under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey

There is a dearth of literature on the use of machine learning models to predict important under-five mortality risks in Ethiopia. In this study, we showed spatial variations of under-five mortality and used machine learning models to predict its important sociodemographic determinants in Ethiopia. The study data were drawn from the 2016 Ethiopian Demographic and Health Survey. We used three machine learning models such as random forests, logistic regression, and K-nearest neighbors as well as one traditional logistic regression model to predict under-five mortality determinants. For each machine learning model, measures of model accuracy and receiver operating characteristic curves were used to evaluate the predictive power of each model. The descriptive results show that there are considerable regional variations in under-five mortality rates in Ethiopia. The under-five mortality prediction ability was found to be between 46.3 and 67.2% for the models considered, with the random forest model (67.2%) showing the best performance. The best predictive model shows that household size, time to the source of water, breastfeeding status, number of births in the preceding 5 years, sex of a child, birth intervals, antenatal care, birth order, type of water source, and mother’s body mass index play an important role in under-five mortality levels in Ethiopia. The random forest machine learning model produces a better predictive power for estimating under-five mortality risk factors and may help to improve policy decision-making in this regard. Childhood survival chances can be improved considerably by using these important factors to inform relevant policies.

countries was 69 deaths per 1000 live births in 2017-almost 14 times the rate in high-income countries (5 deaths per 1000 live births) (UNICEF, WHO, World Bank Group, and United Nations, 2018). It has been observed that more than half of these deaths are due to infectious diseases (such as pneumonia and diarrhea) that are preventable and treatable through simple, affordable interventions (World Health Organization, 2017).
Despite the considerable improvements over the past decades, sub-Saharan Africa remains the region with the highest level of under-five mortality in the world, with about half of the global under-five mortality burden (UNICEF, WHO, World Bank Group, and United Nations, 2018). Ethiopia appears to have the fifth-highest number of newborn deaths in the world, following India, Pakistan, Nigeria, and the Democratic Republic of Congo (UNICEF, 2017). It is estimated that about 472,000 children die in Ethiopia each year before their fifth birthday, which places Ethiopia sixth among the countries in the world in terms of absolute numbers of under-five deaths (Federal Ministry of Health, 2005). In Ethiopia, the under-five mortality rate has declined by twothirds from the 1990 figure of 204 per 1000 live births to 58 per 1000 live births in 2016, and thus, achieving the target for Millennium Development Goal 4 (MDG 4) (You, Hug, Ejdemyr, Idele, et al., 2015). Despite this achievement, the under-five mortality rate in Ethiopia remains higher than those of many low and middle-income countries (LMIC).
Previous studies have provided much evidence on the socioeconomic and demographic factors that are associated with under-five mortality in Ethiopia (Ayele & Zewotir, 2016;Ayele, Zewotir, & Mwambi, 2017;Bereka, Habtewold, & Nebi, 2017), using traditional regression models. In this study, we predict the important determinants of under-five mortality in Ethiopia using non-traditional regression models drawing on nationally representative data. Specifically, we employed machine learning techniques to predict under-five mortality risks in the study sample. The main aim of this study is to show the spatial distribution of under-five mortality and the potential of machine learning algorithms in predicting important sociodemographic factors underlying the spatial variations in under-five mortality. As such, we initially develop a spatial visualization of the underfive mortality rate by region in Ethiopia. This is to visually highlight the spatial disparities in under-five mortality in the country while predicting the most important factors underlying these disparities. This study informs and strengthens appropriate extant policies or intervention strategies aimed at reducing under-five mortality in the country. It also underscores the potential role of the machine learning approach in demographic research.

Data source
This study is based on data from the 2016 Ethiopian Demographic and Health Survey (EDHS), the most recent in the demographic and health survey series that is conducted every five years. The EDHS is a nationally representative household survey that collects data on a wide range of population, health, and nutrition indicators to improve maternal and child health in Ethiopia (Central Statistical Agency (CSA) [Ethiopia],, and ICF International, 2016). The survey used a multi-stage stratified sampling technique based on the 2007 National Population and Housing Census of Ethiopia to select respondents from a total of 624 clusters (187 urban and 437 rural) (Central Statistical Agency (CSA) [Ethiopia],, and ICF International, 2016). The unit of analysis is under-five children with a total sample size of 10,641 selected from 645 clusters across Ethiopia. This is based on children's data obtained from retrospective information from mothers about their children that died before age 5 within the 5 years preceding the survey (2011 to 2016).

Study variables and measurements
In this study, the outcome variable-under-five mortality-was measured as a binary outcome. Thus, under-five mortality was measured as being alive (coded as 0) or dead (coded as 1) for all the models.
The predictors (features) used in this study include individual, household, community, and health service factors. The individual-level factors consisted of maternal and child characteristics. Maternal factors include mother's age at birth (< 20, > 20), education (no education, primary, secondary/higher), contraceptive use (yes/no), and mother's body mass index (BMI) (underweight/overweight and normal). Child factors include whether the child was wanted (child wanted then, wanted later, not at all), sex of the child, birth order (1-2, 3/later), births in last 5 years, and previous birth interval (< 2, 2-4, > 4 years), as well as whether the child was breastfed within 1 h of birth. The household factors used are the source of drinking water (improved/unimproved), time to the water source, toilet facility (improved/unimproved), and household wealth index (low, middle, high), and household size. The community factors comprised residence type (urban/rural) and geographical region (Tigray, Afar, Amhara, Oromia, Somali, Benishangul-Gumuz, Southern Nations Nationalities and People Region (SNNPR), Gambella, Harari, Dire Dawa, and Addis Ababa). The health service factors include antenatal visits (0, 1-4, 5+ visits), place/mode of delivery services (facility with caesarean section (CS) services, facility without CS, home), and postnatal visits within 2 months after delivery (yes/no). The selection of these predictor variables was based on information from existing literature on the subject (Aheto, 2019;Bereka et al., 2017;Yaya, Bishwajit, Okonofua, & Uthman, 2018).

Analytic strategy
The R programming language (version 3.6.0) and the caret package (Kuhn, 2020) was used to perform the data processing and analysis. We first developed a spatial map for crude under-five mortality rates by regions in Ethiopia to document the regional disparities in under-five mortality in the country. In this regard, we estimated the rates under-five mortality by region and then merged them with an Ethiopian regional shapefile before mapping it.
We also used three widely used machine learning (ML) algorithms-logistic regression, a Random Forest (RF), K-nearest neighbors (KNN) models-to predict under-five mortality determinants in Ethiopia and compared the results of the best algorithm to the results of the traditional logistic regression model. These three models were selected for various reasons. The logistic regression is typically used to analyze binary data and is commonly used as an inferential tool in population health research, but can be also used as a binary classification model. The KNN model is chosen based on its ability to detect linear and nonlinear boundaries between groups. The K is a value that represents the number of nearest neighbors which is the core deciding factor in this classifier. It relies on finding the best value of k so that the k closest observations are used to predict the value of a given observation. Thus, when k = 1 then the new data object is simply assigned to the class of its nearest neighbor. The "nearness" of observations is widely measured using Euclidean distance between observations even though there are various numerical measures (Ali, Neagu, & Trundle, 2019;Larose, 2015). The main concept behind KNN depends on calculating the distances between the tested, and the trained data samples to identify its nearest neighbors. The RF model is commonly used in machine learning situations because it is highly flexible and provides better predictive performance. The RF model repeatedly samples the variables in the training data set several times, each time using a random set of predictor variables to produce decision trees. After many of these trees are formed, the forest is examined to see which variable consistently produce a better prediction. These groups of relatively uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. This is because the trees protect each other from their errors (as long as they do not all constantly err in the same direction).
ML was born from pattern recognition and the theory that computers can learn without being programmed to perform specific tasks. It allows computers to learn from complex data sources, to potentially find previously unseen insights without being explicitly programmed where to look (Elisa, 2018). It can also be used to automate tasks by building analytical models using algorithms that iteratively learn from data. In demographic parlance, ML appears to address some of the major challenges in demographic research by helping to draw insights using available datasets collected for different purposes at different points in time, which in most cases may be challenging to incorporate in the traditional techniques. It may be also used to predict future occurrences of the principal components of population change (fertility, mortality, and migration) and associated factors. As such, ML techniques can be both used to predict previously identified proximate correlates and new "significant" demographic variables, and also shed light on how important previously used variables are in terms of prediction.
In this regard, we randomly sampled and trained 80% of the total sample, which was eventually used for 10-fold cross-validation to tune the model parameters. The remaining 20% random sample was used as test data to predict the measures of model performance. Because the outcome is unbalanced (there is a low fraction of under-five mortality in the data), the data were down-sampled so the proportions of data in the training set are equivalent to the cases who were alive after 5 years, and those who had died before 5 years. Model accuracy metrics such as sensitivity, specificity, positive predictive value, and negative predictive values were calculated to show how well the models perform in terms of predicting the dead and alive cases. Sensitivity ("positivity in health") refers to the proportion of subjects who have dead cases (reference standard positive) and give positive test results. Specificity ("negativity in health") is the proportion of subjects that are alive (reference standard negative) and give negative test results. Positive predictive value is the proportion of positive results that are true positives (i.e., truly dead) whereas negative predictive value is the proportion of negative results that are true negatives (i.e., truly alive). Predictive values vary depending on the prevalence of the target condition in the population being studied, even if the sensitivity and specificity remain the same (Price & Christenson, 2007).
Metrics such as the area under curve (AUC) and receiver operating characteristic (ROC) curve were also used to evaluate model performance in distinguishing between the dead and alive cases. The ROC curves compare sensitivity versus specificity across a range of values to determine the ability to predict a dichotomous outcome. The AUC is a measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve (Florkowski, 2008). Thus, the higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes (Florkowski, 2008).
The results of all the models were weighted using person weights provided by the data. For the traditional logistic regression model, we infer the importance and significance of predictors using odds ratios and confidence intervals derived from the model estimation, while for the ML models, the Mean Decrease in Gini was calculated for each variable, which is a measure of variable importance for these models. The top 10 categories of variables based on their Mean Decrease in Gini were automatically generated and then presented in diagrams for each ML model.

Results
Descriptive results of the background characteristics Table 1 shows the results of under-five mortality by the sample characteristics. Of the 10, 641 under-five children in the sample, there appears to be a significant difference in mortality prevalence between both sexes with female children experiencing higher (6.7%) than males (4.2%). There were also considerable differences by birth intervals with under-five mortality being more prevalent among children with less than 2 years of birth intervals (9.3%) than children with 2-4 and over 4 years of birth intervals (4.45% and 4.53%, respectively). Under-five mortality was also significantly prevalent among children using unimproved water sources (5.8%) than those who used improved water sources (2.9%). Significant differences were also observed regarding antenatal visits and postnatal care, with under-five mortality being considerably prevalent among children whose mothers did not receive antenatal (5.6%) and postnatal care (4.2%). Children who were breastfed within more than 1 h of birth had a significantly higher prevalence of death (9.8%) than those breastfed within 1 h of birth (4.5%) while there was also evidence of a significant difference in underfive mortality regarding the number of people in the household. The rest of the characteristics did not show any significant difference in mortality prevalence among their categories.
Spatial distribution of under-five mortality Figure 1 shows the spatial distribution of crude under-five mortality rates by regions in Ethiopia. The under-five mortality rate in the map is presented as the number of under-five deaths per 1000 live births. The Afar region recorded the highest under-five mortality rate of 125 per 1000 live births, followed by Benshangul-Gumuz, and Somali, which recorded 98 and 94 per 1000 live births, respectively. The lowest under-five mortality rate is recorded in Addis Ababa, with a rate of 39 per 1000 live births.

Predicting under-five mortality
Below, we report the results of the three machine learning models (logistic regression, random Forests, and the K-nearest neighbor models) ( Table 2). The under-five mortality prediction accuracy was found to be low for all models, between 46.3 and 67.2% accuracy on the test dataset, with the RF model having the highest overall accuracy. The RF model had high sensitivity, meaning that it was more accurate in identifying dead cases, but had low specificity, meaning that it was poor in identifying the alive cases. However, the model correctly identified 70% of the real dead cases (28/(28+12)) and 67% of real alive cases (698/(698+343)), which means that the RF model is relatively better at predicting both real positive (dead) and negative (alive) cases. The logistic and KNN models both show lower overall accuracy (59.9 and 46.3%, respectively), and lower sensitivity, specificity, and positive as well as negative predictive values. A visualization of the receiver operating characteristics (ROC) curve is shown in Fig. 2. Among the three machine learning models employed in this study, the curve of the RF model shows the highest AUC value, indicating it is the best at classifying dead and alive cases, among the models.
Figures 3, 4, and 5 show the variable importance measures, measured by the scaled mean decrease in the Gini coefficient for each variable, as calculated during the k-fold cross-validation process. This is an effective measure of how important a variable is for predicting under-five mortality across all the cross-validation estimates. The three  (births5_ys), birth interval (b_interval), and child sex (male). Unlike the ML model results presented above, the traditional logistic regression model is the only one that allows direct interpretation of the model coefficients (Table 3). Table 3 shows the estimated odds ratios and confidence intervals for the model parameters. Factors associated with under-five mortality were sex, birth order, birth interval, water source, place of delivery, antenatal visit, postnatal care, breastfeeding, and household size. Increased risks of under-five mortality were found among males, higher birth order children, and children born in a facility without C-section services. On the contrary, reduced risks were found among children with longer birth intervals, improved water sources, children who received antenatal and postnatal care as well as those from larger households.

Discussion
This study briefly described spatial variations in under-five mortality and predicted under-five mortality risks in Ethiopia using machine learning techniques. The spatial map provides evidence of considerable regional disparities in under-five mortality rates in Ethiopia similar to what has been observed in Ghana (Aheto, 2019). Tigray and some regions in the central part of the country show the lowest under-five mortality rates whereas regions in the eastern and western parts of the country have the highest under-five mortality rates. Providing evidence on the spatial variations of under-five mortality in the country may provide the need to better understand the underlying risk factors. Regarding the predictive analysis, the prediction accuracies and AUC statistics are found to be highest for the RF model. The RF model shows a higher predictive power compared to the other ML models included in this study. In this regard, the RF model shows that household size, time to the water source, breastfeeding behavior, births in the preceding 5 years, sex of a child, birth intervals, birth order, antenatal   visits, type of water source, and mother's BMI are the top 10 important predictors of under-five mortality in Ethiopia. The important role played by some of these factors in under-five mortality levels is widely documented in the literature (Abir, Agho, Page, Milton, & Dibley, 2015;Dendup, Zhao, & Dema, 2018;Howell, Holla, & Waidmann, 2016;Yaya et al., 2018). In comparison, the findings of the best performing ML model appear to be virtually consistent with the traditional logistic regression analysis which also shows that a child's sex, birth interval, birth order, water source, place of delivery, antenatal visits, postnatal care, household size, and breastfeeding behavior play a significant role in under-five mortality levels in Ethiopia. Only the number of births in the preceding 5 years and the mother's BMI appear to play an important role in the ML models but play an insignificant role in the traditional logistic regression analysis. This is an indication that ML models may produce some "new variables" or previously unseen insights by the traditional regression models which may play a crucial role in policy decision making. From the traditional logistic regression findings, male children have shown a significantly higher risk of dying before age 5 compared with female children. This is consistent with the finding of a cross-sectional study conducted in Bangladesh (Abir et al., 2015). It has been shown that male children have an increased risk of dying in the first month of life because of high vulnerability to infectious disease. This may be because female neonates are more likely to develop early fetal lung maturity in the first week of life, which may result in a lower incidence of respiratory diseases in female compared with male neonates (Khoury, Marks, McCarthy, & Zaro, 1985). Also, higher birth order of children appears to be associated with a significantly higher risk of under-five mortality. Analogously, the unfavorable effect of higher birth order on childhood survival chances has been well documented in Africa (Howell et al., 2016) as well as some parts of Asia (Dendup et al., 2018;Hong & Hor, 2013) and may provide a better understanding of the spatial variations in the country.
Furthermore, the risk of under-five mortality has increased significantly among children with less than 2 years of birth interval than children with more than 2 years of birth interval. Affirmatively, there is much evidence that longer birth intervals improve the survival chance of succeeding children (Kozuki & Walker, 2013;Yaya et al., 2018). A short preceding birth interval can be said to influence under-five mortality through three main mechanisms: first, closely spaced births may cause depletion of the mother. The second mechanism is through competition for scarce household resources among children, while the third is the transmission of infectious diseases between the closely spaced children (Majumder, May, & Pant, 1997). While the first mechanism is biological, the last two are said to be behavioral effects of a short preceding birth interval (Koenig, Phillips, Campbell, & Dsouza, 1990). Additionally, this study finds that the use of unimproved drinking water is associated with an increased risk of under-five mortality. Lack of access to clean water has been considered as one of the important factors that contribute to more than 80% of child deaths in the world (UNICEF, 2018). There is also considerable evidence from studies in developing countries that show that household sanitation and a clean water supply promote child health and survival (Ezeh, Agho, Dibley, Hall, & Page, 2014;Mugo, Agho, Zwi, Damundu, & Dibley, 2018). In Ethiopia, the proportion of the population using improved drinking water sources is only 57%, and those who use improved sanitation are less than 5% (World Health Organization, 2017). This may have serious implications for variations in under-five mortality in the country. This study further provides evidence that children whose mothers do not use any contraceptives have a significantly higher risk of under-five mortality than their counterparts whose mothers use modern contraceptives.
This study also finds that delivery in health facilities without CS services and at home is associated with a higher under-five mortality risk. This may be mainly related to dealing with delivery complications that may raise under-five mortality risks. Health facilities with CS services are very scarce in Ethiopia, and where they are available, transportation challenges encourage women to deliver at home even when facility-based delivery is available at a minimal cost (Shiferaw, Spigt, Godefrooij, Melkamu, & Tekie, 2013). Moreover, the study finds a positive effect of antenatal and postnatal care checkups on under-five survival chances. This is consistent with the significant association observed between antenatal and postnatal care and lower under-five mortality risk in the literature (Bitew & Nyarko, 2019;Machio, 2018). The implication is that children whose mothers do not receive antenatal and postnatal care services may experience several proximate under-five mortality risk factors, such as congenital and infectious diseases, than their counterparts. This study has also shown a considerable positive effect of early timing of breastfeeding on childhood survival chances. Breastfeeding has long been shown as an important protective factor against under-five mortality, particularly among developing countries (Azuine, Murray, Alsafi, & Singh, 2015;Nyarko, Tanle, & Kumi-Kyereme, 2014) and may play a key part in childhood survival interventions in Ethiopia. Quite surprisingly, larger household size appears to be associated with reduced under-five mortality risk in this study, contrary to what is documented in the literature (Dendup et al., 2018). However, this may well be underscored by some household-level contextual factors in the country such as availability of considerable social support from parents and siblings.
This study is not without limitations. The survey comprised only surviving women, and since neonatal and maternal mortalities may occur concurrently, this may have led to an underestimation of the under-five mortality rates. Ultimately, unlike the traditional regression models, the ML results appear to be mostly uninterpretable because they have no regression coefficients and for that matter no direction of effect. In effect, ML models generally predict or classify specific variables based on the level of importance of their role in determining the under-five mortality levels in the current study. In this case, extant empirical literature from studies using the traditional methodologies may be used to determine the direction of these important variables. There are also possible biases in the memorization or non-disclosure of deaths by mothers which may underestimate the number of deaths. Nevertheless, machine learning techniques are considered to be very useful in predicting population health and other phenomena and lead to better policy decisions (Ashrafian & Darzi, 2018;Holzinger, 2017).

Conclusions
The findings show that considerable regional disparities in under-five mortality rates persist in Ethiopia, with the highest rates being found in the Afar, Benishangul-Gumuz, and Somali regions. Also, the RF model provides a moderately better predictive power than the logistic regression and KNN ML models in predicting under-five mortality determinants in Ethiopia. Even though the RF model and the traditional logistic regression model have shown similar factors, the RF model appears to reveal some important factors that may not be identified by the traditional logistic regression model. This model may, therefore, proffer better policy directions regarding under-five childhood survival. Thus, household size, time to the water source, breastfeeding behavior, number of births in the past 5 years, sex of a child, birth intervals, antenatal visits, birth order, type of water source, and mother's BMI may play an important role in underfive survival chances in Ethiopia. This study highlights the use of machine learning algorithms to predict and better understand very important under-five mortality risk factors to improve crucial policy directions. As a corollary, ML methods may also apply to other areas of demographic research including fertility and migration studies. Our findings reinforce the need to focus on the most important predicted factors including breastfeeding, birth interval control, and antenatal care among others in developing policies aimed at enhancing childhood survival chances. Also, based on the findings, expanding access to improved drinking water will help to substantially reduce future under-five mortality levels in Ethiopia. Authors' contributions FHB conceived and designed the study. FHB and CSS performed the analysis with technical support from SHN. FHB wrote the initial draft of the manuscript with technical support from SHN, LP, and CSS. All authors critically reviewed the manuscript for important intellectual content and then approved the final version of the manuscript for publication.

Funding
No funding was received for this study

Availability of data and materials
The datasets analyzed in this study are freely available at the DHS Program repository