Synergy in fertility forecasting: improving forecast accuracy through model averaging

Accuracy in fertility forecasting has proved challenging and warrants renewed attention. One way to improve accuracy is to combine the strengths of a set of existing models through model averaging. The model-averaged forecast is derived using empirical model weights that optimise forecast accuracy at each forecast horizon based on historical data. We apply model averaging to fertility forecasting for the first time, using data for 17 countries and six models. Four model-averaging methods are compared: frequentist, Bayesian, model confidence set, and equal weights. We compute individual-model and model-averaged point and interval forecasts at horizons of one to 20 years. We demonstrate gains in average accuracy of 4–23% for point forecasts and 3–24% for interval forecasts, with greater gains from the frequentist and equal weights approaches at longer horizons. Data for England and Wales are used to illustrate model averaging in forecasting age-specific fertility to 2036. The advantages and further potential of model averaging for fertility forecasting are discussed. As the accuracy of model-averaged forecasts depends on the accuracy of the individual models, there is ongoing need to develop better models of fertility for use in forecasting and model averaging. We conclude that model averaging holds considerable promise for the improvement of fertility forecasting in a systematic way using existing models and warrants further investigation.


Introduction
Fertility forecasts are a vital element of population and labour force forecasts, and accurate fertility forecasting is essential for government policy, planning and decision-making regarding the allocation of resources to multiple sectors, including maternal and child health, childcare, education and housing. Though the level of fertility in industrialised countries has at times been of major concern (e.g., Kuczynski (1937); Booth (1986); Longman (2004)), models for forecasting fertility have been developed relatively recently (for an earlier review, see Booth (2006)). Indeed, many of the models used in fertility forecasting were originally designed to smooth age-specific fertility rates or complete the incomplete experience of a single cohort, and relatively few have been used in longerterm forecasting of period fertility rates; further, many are deterministic providing no indication of uncertainty (Bohk-Ewald et al. 2018).
Stochastic models for forecasting period age-specific fertility rates (ASFRs) make use of various time-series extrapolation models and generally involve extensive simulation to take account of covariance in the estimation of uncertainty. The most straightforward approach uses univariate time-series models (see, e.g., Box et al. (2008)) to extrapolate the trend at each age. The main limitation of this approach is inconsistency among age-specific forecasts, possibly producing an implausible age pattern of future fertility. To remedy this, the age pattern is first modelled, and its parameters are then forecast. Parametric models used in forecasting include the beta, gamma, double exponential and Hadwiger functions (see, e.g., Thompson et al. (1989); Congdon (1990); Congdon (1993); Knudsen et al. (1993); Keilman and Pham (2000)), while semi-parametric models include the Coale-Trussell and Relational Gompertz models (see, e.g., Coale and Trussell (1974); Brass (1981); Murphy (1982); Booth (1984); Zeng et al. (2000)). The use of these models is variously limited by parameter un-interpretability, over-parameterization and the need for vector autoregression. Structural change also limits their utility, especially where vector autoregression is involved (Booth 2006).
The strengths and weaknesses of fertility forecasting models have not been thoroughly evaluated. Noting the absence of guidance on model choice, Bohk-Ewald et al. (2018) compared the point and interval forecast accuracy of 20 major models for fertility forecasting with 162 variants. In the context of completing cohort fertility, their evaluation found that only four methods were consistently more accurate than the constant (no change) model and, among these, complex Bayesian models did not outperform simple extrapolative models. These findings were mostly universal. In earlier research, Shang (2012b) compared several models for fertility forecasting for point and interval forecast accuracies, finding the weighted Hyndman-Ullah model (see the "Six selected models" section) to be marginally more accurate on both counts. While this research is useful in identifying models that perform well based on extensive data, such models may not perform well in every circumstance. Moreover, there is undoubtedly scope for fertility forecasting improvement.
In other areas of forecasting, model averaging (Bates and Granger 1969;Dickinson 1975;Clemen 1989) has been employed to improve point and interval forecast accuracy. However, with the exceptions of Shang (2012a) and Shang and Haberman (2018), model averaging has been neglected in forecasting demographic rates. This paper aims to empirically assess the extent to which forecast accuracy can be improved through model averaging in the context of age-specific fertility forecasting. Our assessment covers 17 countries with varied fertility experience and is based on data series extending back to 1950 or before. We consider six selected models and four methods for selecting model averaging weights.
The structure of this article is as follows. In the "Model averaging" section, we introduce model averaging and briefly present the four methods for selecting weights; the technical details of these methods are given in Appendix A. The data, six selected models and study design are described in the "Data and design" section, and technical details of the six models appear in Appendix B. Based on the accuracy measures discussed in Appendix C, in the "Results" section, we evaluate the point and interval forecast accuracies of the six models and of the four model-averaged forecasts. In the "Model averaging in practice" section, we provide an example of model averaging in forecasting age-specific fertility, using data for England and Wales. Finally, the discussion appears in the "Discussion" section.

Model averaging
The idea of model averaging has been often studied in statistics, dating back to the seminal work by Bates and Granger (1969). A flurry of articles then appeared dedicated to the topic; see Clemen (1989) for a review from a frequentist viewpoint and Hoeting et al. (1999) for a review from a Bayesian perspective. More recent developments in model averaging are collected in the monograph by Claeskens and Hjort (2008). In demographic forecasting, there has been limited usage, but notable exceptions include Smith and Shahidullah (1995), Ahlburg (1998Ahlburg ( , 2001 and Sanderson (1998) in the context of census tract forecasting.
In essence, the model-averaging approach combines forecasts from a set of two or more models. Because these models may reflect different assumptions, model structures and degrees of model complexity, it is expected that better forecast accuracy can be achieved through averaging. The forecasts from each model are averaged using weights that are specific to each forecast horizon. This is designed to achieve accuracy in the year of the forecast horizon, and not in all years up to and including the horizon year. These model weights are applied at all ages.
In all model-averaging methods, the model-averaged point forecast for horizon h is computed as the weighted mean: where M represents the model average, M , for = 1, . . . , L, represents the individual models, y n+h|n,M represents the point forecast at horizon h obtained from model ; y n+h|n,M represents the model averaged forecast; and (w 1 , w 2 , . . . , w L ) are empirical point forecast weights that sum to 1.
For the model-averaged point forecast y n+h|n,M , its prediction interval is constructed from its variance assuming a normal distribution. Following Burnham and Anderson (2002); Burnham and Anderson (2004), the unconditional variance of the estimator y n+h|n,M based on the models weighted by empirical interval forecast weights is given by Var( y n+h|n,M ) + ( y n+h|n,M − y n+h|n,M ) 2 1 2 2 , wherew represents the interval forecast weight under model M ,ỹ n+h|n,M represents the model-averaged point forecast obtained using interval forecast weights. The first term inside the square brackets measures the variance conditional on model M , while the second term measures the squared bias.
The crucial ingredient in model averaging is the empirical weights. The computation of the point forecast weights and interval forecast weights is based on measures of forecast accuracy. The derivation of weights differs by model-averaging method. We employ four methods chosen on the basis of their ability to perform well in mortality forecasting (Shang 2012a;Shang and Haberman 2018). These four methods for computing the weights are briefly described in the following sections. Further details are given in Appendix A.
It should be noted that in this study, the measures of accuracy are further averaged as part of the study design (see the "Study design" section), increasing the reliability and stability of the weights.

A frequentist approach
Under the frequentist approach employed, the accuracy of a point forecast is measured by mean absolute forecast error (MAFE) while the accuracy of an interval forecast is measured by the mean interval score. These measures are described in Appendix C.
In the simple case of a single point forecast for a particular horizon h, MAFE is averaged over age for the year n + h, and the weight is taken to be equal to the inverse of MAFE. Weights are obtained for the set of forecasting models and standardised to sum to 1, giving the standardised weights (w 1 , w 2 , . . . , w L ). Similarly, interval forecast accuracy is based on the mean interval score for year n + h, and the weight is equal to its inverse. For the set of models, weights are standardised to achieve proportionality, giving the standardised weights (w 1 , w 2 , . . . , w L ). Further averaging of MAFE and the mean interval score takes place as a result of study design, before weight calculation.

A Bayesian approach
In Bayesian model averaging, a single set of model weights is used for the point and interval forecasts. The weights are derived from the Bayesian information criterion (BIC) and are thus proportional to model goodness-of-fit and incorporate a penalty for more parameters. Further details appear in Appendix A.

Model confidence set (MCS)
To examine statistical significance among the set of models and select a set of superior models, we consider the MCS procedure. The MCS procedure proposed by Hansen et al. (2011) consists of a sequence of tests of the hypothesis of equal predictive ability (EPA) of two models, eliminating the worse-performing model at each step and resulting in a smaller set of superior models for which the hypothesis is universally accepted. After determining this set of superior models, their forecasts are averaged without weights.

Equal weights
As a baseline approach, we also consider assigning equal weights to all six models. Prior research has shown that including all models in an equal weights (or unweighted) model can yield substantial gains in forecast accuracy. Several studies provide analytical solutions for the conditions under which equal weights provide more accurate out-ofsample forecasts than regression weights (e.g., Einhorn and Hogarth (1975); Davis-Stober (2011)). Equal weights can be advantageous when models fit the data poorly (Graefe 2015).

Data sets
The study is based on fertility data for 17 countries. For all but one country, the data were taken from the Human Fertility Database (2020). Australian fertility rates and populations were obtained from Australian Bureau of Statistics (Cat. No. 3105.0.65.001,Table 38); this data set is also available in the rainbow package (Shang and Hyndman 2019) in the statistical software R (R Core Team 2020). The data consist of annual fertility rates by single-year age of women aged 15 to 49 and corresponding population (births to women aged 50 and older are included in births at age 49). The 17 selected countries all have reliable data series commencing in 1950 or earlier, as shown in Table 1, and extending to 2011.
The overall level of fertility has declined in all countries considered. Figure 1 shows that the general decline in total fertility (births per woman calculated as the sum of ASFRs in each year) has not been monotonic. It also indicates that the substantial decline in fertility since 1960 has been stalled or reversed since about 2000. While trends in ASFRs may differ from this overall pattern, total fertility provides the average trend.

Six selected models
We select six models used in forecasting fertility. These include three functional timeseries models for modelling the schedule of ASFRs, where age is treated as a continuum; and three univariate time-series models to model fertility at each age individually. All models are applied to transformed ASFRs so as ensure non-negative forecast rates.

Functional time-series models
The three models included here stem from the work of Hyndman and Ullah (2007). These models use principal component decomposition with time series forecasting of the time parameter and are similar to the well-known method by Lee and Carter (1992) for modelling and forecasting mortality rates.
The Hyndman-Ullah (HU) model uses smoothed transformed ASFRs. The functional principal component decomposition produces smooth age parameters and corresponding time parameters. Several components are required to adequately describe the data. The time parameters are each modelled using time series methods, providing the forecast. Two variants of the HU model are also selected. The robust HU model (HUrob) is robust to outliers. The weighted HU model (HUw) gives greater weight to more recent data in order to reduce the potential gap between the last year of observed data and the first year of the forecast. The HU model and its two variants are described in detail in Appendix B.

Univariate time-series models
The three selected univariate time-series models (Box et al. 2008) include two randomwalk models and an optimal autoregressive integrated moving average (ARIMA) model and are applied to transformed ASFRs at each age. They are described in detail in Appendix B.
The random-walk models include the random walk (RW) and the random walk with drift (RWD). Each model is applied independently to each age-specific rate. While the RW model assumes a constant underlying rate over time with all deviations assumed to be error, the RWD assumes a constant underlying change in the transformed rates at each age. The optimal ARIMA model is the model best describing the time series of an agespecific fertility rate based on its own past values, including its own lags and lagged forecast errors. The equation is then used to forecast future values of the series.

Study design
We implement a two-stage design. At the first stage, the six forecasting models are applied to determine model weights based on forecast errors. At the second stage, the six models and their model averages are evaluated. These two stages use different periods of data as illustrated in Fig. 2. We divide the data for each country into an in-sample period ending in 1991 and a holdout sample period, 1992 to 2011.
The in-sample data are used in the first stage. We further divide these data into a fitting period and a forecasting period. The initial fitting period ended in 1971, and the forecasting period was 1972 to 1991. Using the data in this fitting period, we fit each model, compute 1-year-ahead to 20-year-ahead forecasts of 35 ASFRs, and calculate forecast errors or information criterion values by comparing the forecasts with observed data in the relevant year. Figure 3 provides an example. Then, we expand the fitting period by one year, and compute 1-year-ahead to 19-year-ahead forecasts, and again calculate forecast errors or information criterion values. This process is repeated until the fitting period extends to 1990. In so doing, we have 20 sets of 1-year-ahead forecast errors, 19 sets of 2-year-ahead forecast errors, . . . and one set of 20-year-ahead forecast errors. These forecast errors are used in the calculation of appropriate weights for the model-averaging approach. In practice, the model-averaging weights are derived from forecast errors averaged over horizon-specific forecasts and over the 17 countries in the study. There is thus one set of six weights for each model-averaging method.
These forecast errors are used in the calculation of appropriate weights for the modelaveraging approach. In the HUw model where geometrically decaying weights are used, the decaying parameter, λ, is estimated from data for 1960 to 1981, for h = 1, . . . , 20. The overlap with the in-sample forecasting period (1970 to 1991) is inevitable given short data series for some countries.
At the second stage, the entire dataset is used. The initial fitting period ended in 1991, and the forecasting period was 1992 to 2011. An expanding fitting period is again Fig. 2 Fitting and forecasting periods for weight estimation (stage 1) and model-averaged forecasting (stage 2). The start of the fitting period is determined by data availability or purposive choice Implementation of the HU models presented in this paper is straightforward with the readily available R package demography (Hyndman 2019). Point and interval forecasts based on the RW, RWD and ARIMA models can be obtained via the rwf and auto.arima functions in the forecast package (Hyndman 2020). The computational code in R for the entire analysis is available upon request.

Point forecast accuracy
In Table 2, we present the out-of-sample point forecast accuracy based on 1-year-ahead to 20-year-ahead MAFEs, which are averaged over ages, years in the forecasting period and countries. Irregularities at longer horizons are due to two factors: the small number of forecasts involved (see Fig. 3) and the fact that they are based entirely on fitting periods ending in 1986 to 1990, a period of changing trends in several countries (see Fig. 1).
As shown in Table 2, we find that gains in forecast accuracy due to model averaging have been achieved at longer (h > 7) rather than shorter forecast horizons. The median value of MAFE is 0.0096 for the model-averaged forecasts with weights selected by the frequentist approach and equal weights, which indicates gains in accuracy over the six models of 4 to 23%. By contrast, the model averaging methods based on the Bayesian approach and MCS do not perform well.

Interval forecast accuracy
In Table 3, we present the mean interval scores for the 1-year-ahead to 20-year-ahead forecasts. As expected, the mean interval score increases with the forecast horizon, reflecting the loss of interval forecast accuracy as the horizon increases. For short horizons, the model-averaged interval forecasts perform less well than individual models, and the Bayesian and MCS approaches are briefly superior. For horizons of 7 or more years, however, the frequentist and equal weights approaches consistently out-perform the individual models and the more complex model-averaging approaches. Across all horizons, the smallest median (5.11) of the mean interval scores (×100) occurs for the model-averaged forecasts based on the frequentist and equal weights approaches.

Model averaging in practice
To illustrate the application of model averaging to potentially achieve greater forecast accuracy, we use data for England and Wales for 1938 to 2016 and produce frequentist model-averaged point and interval forecasts for horizons of one year and 20 years, based on the six models. At the first stage, horizon-specific weight estimation is based on fitting periods commencing in 1938 and ending successively in 1996 to 2015, with the corresponding forecasting periods starting in 1997 to 2016. In Table 4, we present empirical weights for producing point and interval frequentist model-averaged forecasts. These weights vary little across models, indicating that the six models do not differ appreciably in accuracy. Further, for each model, the point forecast weights and interval forecast weights are roughly equal, showing that the point and interval forecasts are compatible with each other. The fact that the weights differ little across horizons indicates that, in this example, the models do not gain or lose relative advantage over the forecasting period. At the second stage, the fitting period is 1938 to 2016, and the forecasts are produced using the empirical weights obtained at the first stage. The 1-year-ahead and 20-yearahead model-averaged point and interval forecasts of age-specific fertility are shown in The mean interval scores are averaged over ages, years in the forecasting period and countries Fig. 4. The 20-year-ahead point forecast portrays a shift in fertility rates to older ages, continuing the previous trend. However, the shift is more pronounced before than after the mode, which changes little, indicating that the tempo effect is nearing the end of its course. Additionally, this forecast substantially reduces the bulge in fertility rates at young ages seen in recent decades (Chandola et al. 1999).

Discussion
This paper has used computational methods to demonstrate that the accuracy of point and interval forecasts for age-specific fertility can be improved through model averaging. The investigation involved four methods for the empirical determination of the weights to be used in model averaging for both the point forecast and interval forecast. Six models produced the initial set of age-specific fertility forecasts. Among the four model averaging approaches compared, the frequentist and equal weights approaches performed best on average, over 17 countries and forecast horizons of 1 to 20 years. This evaluation used weights derived from the accuracy of forecasts for 1992 to 2011 based on a long series of observations up to 1991. The frequentist and equal weights approaches produced gains in point forecast accuracy of 4 to 23% over the six individual models. However, the Bayesian and MCS approaches produced reductions of up to 10%, though some gains were achieved. For the interval forecast, the frequentist and equal weights approaches also produced the greatest improvement in accuracy, with gains of 3 to 24%. Again, the Bayesian and MCS approaches produced the least gains and some losses in forecast accuracy. These results hold considerable promise for the improvement of fertility forecasting in a systematic way using existing models. The frequentist approach and the equal weights approach both offer a simple way to implement model averaging. Combined with the findings of Bohk-Ewald et al. (2018) that simple forecasting models are as accurate as the most complex, it would seem that improved accuracy can be readily achieved through simple model averaging of simple models. The often cited impediment to the implementation of model averaging, namely complexity, is thus largely removed. To further assist the uptake of model averaging, we look forward to software facilitating its application in fertility forecasting.

Model-averaging approaches
Among the four approaches for the selection of weights, the frequentist and equal weights approaches performed much better than the more complex Bayesian and MCS approaches. The superior performance of the frequentist approach can be attributed to the fact that it assigns weights based on in-sample forecast accuracy. Since ASFRs change in limited ways across age and time, it can be expected that a model that produces smaller in-sample point and interval forecast errors will also produce smaller out-ofsample errors. The performance of the equal weights approach is consistent with previous research (Graefe 2015). It is noted that variation in accuracy across models in Tables 2  and 3 is somewhat limited, partially explaining the superior performance of the equal weights approach. The poor performance of the Bayesian approach can be attributed to the fact that the weights are determined based on in-sample goodness of fit rather than in-sample forecast errors, given that model goodness of fit is not a good indicator of forecast accuracy (Makridakis et al. 2020). The poor performance of the MCS approach can be attributed to the 85% confidence level employed. With a higher confidence level, the MCS method would have selected more accurate models.

Advantages of model averaging
An essential advantage of model averaging is its synergy. In other words, improvements in forecast accuracy can be achieved while using existing methods. Further, in order to achieve optimal accuracy, the 'mix' of the set of models employed is allowed to change over the forecasting period by using horizon-specific weights favouring more accurate methods. Thus, a model that is highly accurate in the short-term but inaccurate in the long term (or vice versa) can be included in the set of models without jeopardising the accuracy of the model-averaged forecast. This allows the forecaster to make use of a wide range of models and approaches including those that might otherwise be regarded as unsuitable because of their horizon-limited accuracy. Moreover, the weights determining the mix of models is empirical, removing an element of subjectivity in model choice, though a model must of course be included in the initial set of models if it is to be considered at all.
In theory, forecast accuracy is improved by taking relevant additional information into account, most often in the form of other data, such as in the case of joint forecasting of the fertility of a group of countries (Ortega and Poncela 2005). In model averaging, given a single set of data, the additional information encompasses the features of multiple forecasting models. Model averaging can be expected to improve forecast accuracy because the strengths of different assumptions, model structures and degrees of model complexity are essentially combined through weighting that assigns greater weight to more accurate models. In other words, the model-averaged forecasts are more robust to model misspecification.
This advantage might also be expected to increase with the forecast horizon. For any model, uncertainty increases with forecast horizon because the relevant experience in the increasingly distant fitting period becomes more and more attenuated. The effects of model misspecification may be cumulated over time or increasingly accentuated. By using model averaging, the different effects of model misspecification are averaged rather than cumulated, potentially resulting in reduced uncertainty, especially when counterbalancing occurs, and curtailing the extent to which uncertainty increases with forecast horizon. Our results provide some evidence of a greater model-averaging advantage at longer horizons. The model-averaged point and interval forecasts gain ground as forecast horizon increases in terms of accuracy in relation to their six constituent forecasts (see Tables 2  and 3).
Model averaging can also be expected to improve the reliability of forecast accuracy because as an average its range is limited. The model averaging approaches may not be the most accurate forecast among the set, but they are not likely to be the least accurate. In effect, model averaging imposes a lower bound on point and interval forecast accuracy. When the weights are derived empirically based on forecast accuracy from past data, model average accuracy can be expected to be relatively high and can be more accurate than any individual model. An example is seen in Table 3.
It is important to note that these advantages of model averaging hold for any set of models designed to predict period or cohort fertility behaviour at any level of detail. In other words, it is not a requirement that the forecasting models are purely extrapolative, as in this paper. Models based on other approaches such as expert judgement (e.g., Lutz et al. (1996)) and explanation (e.g., Ermisch (1992)) as well as Bayesian modelling (e.g., Schmertmann et al. (2014)) can also be used in model averaging in order to take into account their different strengths in predicting fertility behaviour. Indeed, the traditional 'medium' fertility assumption could be included in the set of models.

Limitations
A limitation of this study is that only six models from two families were considered. The set of models included in model averaging will influence the outcome. In this study, we included the RWD, although its appropriateness for longer-term fertility forecasting is limited because of changing trends ) and potential inconsistencies arising from independent forecasting of age-specific rates (Bell 1988). The study confirms the RWD to be least accurate among the six models for both point and interval forecasts. Similarly, the RW is less appropriate for longer horizons because of changing trends. Despite these limitations of the selected models, gains in forecast accuracy were achieved through model averaging. With a larger set of models, greater gains may be achieved because a more diverse range of assumptions and model structures would be taken into account.
A further limitation, arising from data availability, is the focus on period fertility rates, although it is period rates that are used in population projection. Cohort fertility rates have the advantage of describing the fertility behaviour of women over their life course, while parity-specific period rates would provide useful detail to improve model accuracy. Unfortunately, lengthy annual time series of cohort data and parity-specific period data are not available for most of the selected countries, precluding their use in forecasting.

Future research
The findings of this research suggest that model averaging is a potentially fruitful approach to improving the accuracy of fertility forecasting, using models that already exist. A useful technical extension in the case of the frequentist approach to model averaging would be to develop a single set of weights based on a collective criterion incorporating both point and interval forecast accuracy. Additionally, a potential direction for further technical development is to examine other ways of combining models (bagging, boosting and stacking, for example) to achieve greater accuracy than any of the individual models (Bates and Granger 1969).
Based on our findings, we conclude that in the context of fertility forecasting the model-averaging approach warrants further examination using a more extensive range of models and considering both period and cohort data. The choice of potential models is extensive (Booth 2006;Bohk-Ewald et al. 2018) and could usefully include short-term models and those based on explanation and expert opinion. Examination of the weights assigned to different models over time may help to increase understanding of fertility change. Further insights may be gained from the application of model averaging to forecasting total fertility (Saboia 1977;Ortega and Poncela 2005;Alkema et al. 2011), with comparisons of accuracy between direct and indirect forecasting of this aggregate measure.
Finally, we note that optimal forecast accuracy is unlikely to be achieved by preselecting an approach or particular models, but rather is to be found in applying model averaging to a wide range of models encompassing differing strengths in predicting human behaviour. However, while model averaging can be expected to improve forecast accuracy beyond that of the individual models, it is also the case that the accuracy of the individual models broadly determines the range of accuracy that the model-averaged forecast might achieve. Thus, while the prospect of increased accuracy in fertility forecasting is enhanced by the potential of model averaging, the addition of model averaging to the forecasting toolset does not absolve demographers from striving to better understand and model fertility behaviour.

Appendix A: Model averaging methods: technical details
Model averaging involves the computation of weighted means for the point forecast and the interval forecast. Depending on the model-averaging method employed, the weights may be the same for point and interval forecasting, or they may differ. The derivation of weights is described below for the two more complex methods employed in this study, the Bayesian and model confidence set approaches. Full details of the frequentist approach and the use of equal weights appear in the "A frequentist approach" and "Equal weights" sections.

A Bayesian viewpoint
Among the models in the set, let one be considered the correct model, and let θ 1 , θ 2 , . . . , θ L be the vector of parameters associated with each model. Let be the quantity of interest, such as the combined forecast of ASFRs; its posterior distribution given the observed data, D, is Pr ( where BIC = 2L + log(n)η , L is the negative of the log-likelihood, η is the number of parameters in model , n represents the number of years in the fitting period, and BIC is the Bayesian information criterion for model which can be used as a simple and accurate approximation of the log Bayes factor (see also Kass and Wasserman (1995); Kass and Raftery (1995); Raftery (1995); Burnham and Anderson (2002); Burnham and Anderson (2004); Ntzoufras (2009), Section 11.11). In the case of least squares estimation with normally distributed errors, BIC can be expressed as BIC = n log( σ 2 ) + log(n)η , where σ 2 = n i=1 ( ε i ) 2 /n and ( ε 1 , . . . , ε n ) are the residuals from the fitted model. The BIC values are not interpretable, as they contain arbitrary constants and are much affected by sample size. We thus adjust BIC by subtracting the minimum {BIC ; = 1, . . . , L}. The adjusted BIC is used in (1).

Model confidence set (MCS)
As the equal predictive ability (EPA) test statistic can be evaluated for any loss function, we adopt the tractable absolute error measure. The procedure begins with an initial set of models of dimension L encompassing all the models considered, M 0 = {M 1 , M 2 , . . . , M L }. For a given confidence level, a smaller set, the superior set of models M * 1−α is determined where m * ≤ m. The best scenario is when the final set consists of a single model, i.e., m = 1. Let l ,t denote forecast error for model at time t, and let d τ ,t denote the loss differential between two models and j, that is , τ = 1, . . . , L, t = 1, . . . , n, and calculate are first transformed using the Box-Cox transformation which is defined as where f t (x i ) represents the observed ASFR at age x i in year t, m t (x i ) represents the transformed ASFR, and κ is the transformation parameter. Following Hyndman and Booth (2008) and Shang (2015), we use κ = 0.4 as it gave relatively small holdout sample forecast errors on the untransformed scale.

Hyndman-Ullah (HU) model and two variants
The HU model (Hyndman and Ullah 2007) is described below, along with the defining features of its two variants, the robust HU model (HUrob) and the weighted HU model (HUw).
1) The transformed ASFRs are first smoothed using a concave regression spline (for details, see Hyndman and Ullah (2007); Shang et al. (2016)). We assume an underlying continuous and smooth function {s t (x); x ∈[ x 1 , x p ] } that is observed with error at discrete ages, where σ t (x i ) allows the amount of noise to vary with x i in year t, and ε t,i is an independent and identically distributed (iid) standard normal random variable. As an example, Fig where t = 1/n represents equal weighting in the HU and HUrob models. In the HUw model, { t = λ(1 − λ) n−t , t = 1, 2, . . . , n} represents a set of geometrically decaying weights (on the selection of optimal λ, see Hyndman and Shang (2009)). The distinguishing feature of the HUw model is that forecasts are based more heavily on recent data. 3) Using functional principal components analysis, the set of continuous and smooth curves {s t (x); t = 1, 2, . . . , n} is decomposed into orthogonal functional principal components and their uncorrelated scores: where b 1 (x), b 2 (x), . . . , b J (x) represents a set of weighted (as above) functional principal components; k t,1 , k t,2 , . . . , k t,J is a set of uncorrelated scores; e t (x) is the estimated error function with mean zero; and J < n is the number of retained functional principal components. Following Hyndman and Booth (2008), we use J = 6 as this has been shown to be sufficiently large to produce iid residuals with a mean of zero and finite variance. The HUrob model uses e t (x) to provide information about possible outlying curves.
If a curve has a more considerable value of integrated square error, it indicates that this curve may be considered as an outlier that generates from a different data generating process than the rest of the observations. An example of this is fertility rates during the baby boom when sharp temporal increases occurred. Computationally, we use the hybrid algorithm of Hyndman and Ullah (2007)  where k n+h|n,j denotes the h -year-ahead forecast of k n+h,j using a univariate time-series model, such as an optimal autoregressive integrated moving average (ARIMA) model (see the "Optimal ARIMA model" section). Note that multivariate time-series forecasting models, such as vector autoregressive and vector autoregressive moving average models, can also be used to forecast principal component score (see, e.g., Aue et al. (2015)). Because of the orthogonality of the functional principal components, the overall forecast variance can be approximated by the sum of four variances: where σ 2 a (x) is the variance of the mean function, estimated from the difference between the sample mean function and the smoothed mean function; b j (x) 2 u n+h|n,j is the variance of j th estimated functional principal component decomposition where u n+h|n,j = Var k n+h,j | k 1,j , . . . , k n,j can be obtained from the univariate time-series models used for forecasting scores; v(x) is the model error variance estimated by averaging { e 2 1 (x), . . . , e 2 n (x)} for each x, and σ 2 n+h (x) is the smoothing error variance estimated by averaging { σ 2 1 (x), . . . , σ 2 n (x)} for each x. The prediction interval is constructed via (6) on the basis of normality.