Data source
This study is based on data from the 2016 Ethiopian Demographic and Health Survey (EDHS), the most recent in the demographic and health survey series that is conducted every five years. The EDHS is a nationally representative household survey that collects data on a wide range of population, health, and nutrition indicators to improve maternal and child health in Ethiopia (Central Statistical Agency (CSA) [Ethiopia],, and ICF International, 2016). The survey used a multi-stage stratified sampling technique based on the 2007 National Population and Housing Census of Ethiopia to select respondents from a total of 624 clusters (187 urban and 437 rural) (Central Statistical Agency (CSA) [Ethiopia],, and ICF International, 2016). The unit of analysis is under-five children with a total sample size of 10,641 selected from 645 clusters across Ethiopia. This is based on children’s data obtained from retrospective information from mothers about their children that died before age 5 within the 5 years preceding the survey (2011 to 2016).
Study variables and measurements
In this study, the outcome variable—under-five mortality—was measured as a binary outcome. Thus, under-five mortality was measured as being alive (coded as 0) or dead (coded as 1) for all the models.
The predictors (features) used in this study include individual, household, community, and health service factors. The individual-level factors consisted of maternal and child characteristics. Maternal factors include mother’s age at birth (< 20, > 20), education (no education, primary, secondary/higher), contraceptive use (yes/no), and mother’s body mass index (BMI) (underweight/overweight and normal). Child factors include whether the child was wanted (child wanted then, wanted later, not at all), sex of the child, birth order (1–2, 3/later), births in last 5 years, and previous birth interval (< 2, 2–4, > 4 years), as well as whether the child was breastfed within 1 h of birth. The household factors used are the source of drinking water (improved/unimproved), time to the water source, toilet facility (improved/unimproved), and household wealth index (low, middle, high), and household size. The community factors comprised residence type (urban/rural) and geographical region (Tigray, Afar, Amhara, Oromia, Somali, Benishangul-Gumuz, Southern Nations Nationalities and People Region (SNNPR), Gambella, Harari, Dire Dawa, and Addis Ababa). The health service factors include antenatal visits (0, 1–4, 5+ visits), place/mode of delivery services (facility with caesarean section (CS) services, facility without CS, home), and postnatal visits within 2 months after delivery (yes/no). The selection of these predictor variables was based on information from existing literature on the subject (Aheto, 2019; Bereka et al., 2017; Yaya, Bishwajit, Okonofua, & Uthman, 2018).
Analytic strategy
The R programming language (version 3.6.0) and the caret package (Kuhn, 2020) was used to perform the data processing and analysis. We first developed a spatial map for crude under-five mortality rates by regions in Ethiopia to document the regional disparities in under-five mortality in the country. In this regard, we estimated the rates under-five mortality by region and then merged them with an Ethiopian regional shapefile before mapping it.
We also used three widely used machine learning (ML) algorithms—logistic regression, a Random Forest (RF), K-nearest neighbors (KNN) models—to predict under-five mortality determinants in Ethiopia and compared the results of the best algorithm to the results of the traditional logistic regression model. These three models were selected for various reasons. The logistic regression is typically used to analyze binary data and is commonly used as an inferential tool in population health research, but can be also used as a binary classification model. The KNN model is chosen based on its ability to detect linear and nonlinear boundaries between groups. The K is a value that represents the number of nearest neighbors which is the core deciding factor in this classifier. It relies on finding the best value of k so that the k closest observations are used to predict the value of a given observation. Thus, when k = 1 then the new data object is simply assigned to the class of its nearest neighbor. The “nearness” of observations is widely measured using Euclidean distance between observations even though there are various numerical measures (Ali, Neagu, & Trundle, 2019; Larose, 2015). The main concept behind KNN depends on calculating the distances between the tested, and the trained data samples to identify its nearest neighbors. The RF model is commonly used in machine learning situations because it is highly flexible and provides better predictive performance. The RF model repeatedly samples the variables in the training data set several times, each time using a random set of predictor variables to produce decision trees. After many of these trees are formed, the forest is examined to see which variable consistently produce a better prediction. These groups of relatively uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. This is because the trees protect each other from their errors (as long as they do not all constantly err in the same direction).
ML was born from pattern recognition and the theory that computers can learn without being programmed to perform specific tasks. It allows computers to learn from complex data sources, to potentially find previously unseen insights without being explicitly programmed where to look (Elisa, 2018). It can also be used to automate tasks by building analytical models using algorithms that iteratively learn from data. In demographic parlance, ML appears to address some of the major challenges in demographic research by helping to draw insights using available datasets collected for different purposes at different points in time, which in most cases may be challenging to incorporate in the traditional techniques. It may be also used to predict future occurrences of the principal components of population change (fertility, mortality, and migration) and associated factors. As such, ML techniques can be both used to predict previously identified proximate correlates and new “significant” demographic variables, and also shed light on how important previously used variables are in terms of prediction.
In this regard, we randomly sampled and trained 80% of the total sample, which was eventually used for 10-fold cross-validation to tune the model parameters. The remaining 20% random sample was used as test data to predict the measures of model performance. Because the outcome is unbalanced (there is a low fraction of under-five mortality in the data), the data were down-sampled so the proportions of data in the training set are equivalent to the cases who were alive after 5 years, and those who had died before 5 years. Model accuracy metrics such as sensitivity, specificity, positive predictive value, and negative predictive values were calculated to show how well the models perform in terms of predicting the dead and alive cases. Sensitivity (“positivity in health”) refers to the proportion of subjects who have dead cases (reference standard positive) and give positive test results. Specificity (“negativity in health”) is the proportion of subjects that are alive (reference standard negative) and give negative test results. Positive predictive value is the proportion of positive results that are true positives (i.e., truly dead) whereas negative predictive value is the proportion of negative results that are true negatives (i.e., truly alive). Predictive values vary depending on the prevalence of the target condition in the population being studied, even if the sensitivity and specificity remain the same (Price & Christenson, 2007).
Metrics such as the area under curve (AUC) and receiver operating characteristic (ROC) curve were also used to evaluate model performance in distinguishing between the dead and alive cases. The ROC curves compare sensitivity versus specificity across a range of values to determine the ability to predict a dichotomous outcome. The AUC is a measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve (Florkowski, 2008). Thus, the higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes (Florkowski, 2008).
The results of all the models were weighted using person weights provided by the data. For the traditional logistic regression model, we infer the importance and significance of predictors using odds ratios and confidence intervals derived from the model estimation, while for the ML models, the Mean Decrease in Gini was calculated for each variable, which is a measure of variable importance for these models. The top 10 categories of variables based on their Mean Decrease in Gini were automatically generated and then presented in diagrams for each ML model.