Synergy in fertility forecasting: Improving forecast accuracy through model averaging

Accuracy in fertility forecasting has proved challenging and warrants renewed attention. One way to improve accuracy is to combine the strengths of a set of existing models through model averaging. The model-averaged forecast is derived using empirical model weights that optimise forecast accuracy at each forecast horizon based on historical data. We apply model averaging to fertility forecasting for the first time, using data for 17 countries and six models. Four model-averaging methods are compared: frequentist, Bayesian, model confidence set, and equal weights. We compute individual-model and model-averaged point and interval forecasts at horizons of one to 20 years. We demonstrate gains in average accuracy of 4-23\% for point forecasts and 3-24\% for interval forecasts, with greater gains from the frequentist and equal-weights approaches at longer horizons. Data for England \&Wales are used to illustrate model averaging in forecasting age-specific fertility to 2036. The advantages and further potential of model averaging for fertility forecasting are discussed. As the accuracy of model-averaged forecasts depends on the accuracy of the individual models, there is ongoing need to develop better models of fertility for use in forecasting and model averaging. We conclude that model averaging holds considerable promise for the improvement of fertility forecasting in a systematic way using existing models and warrants further investigation.

Fertility forecasts are a vital element of population and labour force forecasts, and accurate fertility forecasting is essential for government policy, planning and decision-making regarding the allocation of resources to multiple sectors, including maternal and child health, childcare, education and housing.
Though the level of fertility in industrialised countries has at times been of major concern (e.g., Kuczynski 1937, Booth 1986, Longman 2004, models for forecasting fertility have been developed relatively recently (for an earlier review, see Booth 2006). Indeed, many of the models used in fertility forecasting were originally designed to smooth age-specific fertility rates or complete the incomplete experience of a single cohort, and relatively few have been used in longer-term forecasting of period fertility rates; further, many are deterministic providing no indication of uncertainty (Bohk-Ewald et al. 2018).
Stochastic models for forecasting period age-specific fertility rates (ASFRs) make use of various time-series extrapolation models and generally involve extensive simulation to take account of covariance in the estimation of uncertainty. The most straightforward approach uses univariate time-series models (see, e.g., Box et al. 2008) to extrapolate the trend at each age. The main limitation of this approach is inconsistency among age-specific forecasts, possibly producing an implausible age pattern of future fertility. To remedy this, the age pattern is first modelled, and its parameters are then forecast. Parametric models used in forecasting include the beta, gamma, double exponential and Hadwiger functions (see, e.g., Thompson et al. 1989, Congdon 1990, Knudsen et al. 1993, Keilman & Pham 2000, while semi-parametric models include the Coale-Trussell and Relational Gompertz models (see, e.g., Coale & Trussell 1974, Brass 1981, Murphy 1982, Booth 1984, Zeng et al. 2000. The use of these models is variously limited by parameter un-interpretability, overparameterization and the need for vector autoregression. Structural change also limits their utility, especially where vector autoregression is involved (Booth 2006).
Nonparametric methods use a dimension-reduction technique, such as principal components analysis, to linearly transform ASFRs to extract a series of time-varying indices to be forecast (see, e.g., Bozik & Bell 1987, Bell 1992, Lee 1993, Hyndman & Ullah 2007. This approach parallels the Lee-Carter model and its many variants and extensions in mortality forecasting (see, e.g., Shang et al. 2011), but has received far less attention. Other contributions include statespace, Bayesian and stochastic diffusion approaches (see, e.g., Rueda & Rodríguez 2010, Schmertmann et al. 2014).
The strengths and weaknesses of fertility forecasting models have not been thoroughly evaluated.
Noting the absence of guidance on model choice, Bohk-Ewald et al. (2018) compared the point and interval forecast accuracy of 20 major models for fertility forecasting with 162 variants. In the context of completing cohort fertility, their evaluation found that only four methods were consistently more accurate than the constant (no change) model and, among these, complex Bayesian models did not outperform simple extrapolative models. These findings were mostly universal. In earlier research, Shang (2012a) compared several models for fertility forecasting for point and interval forecast accuracies, finding the weighted Hyndman-Ullah model (see Section 3.2) to be marginally more accurate on both counts. While this research is useful in identifying models that perform well based on extensive data, such models may not perform well in every circumstance. Moreover, there is undoubtedly scope for fertility forecasting improvement.
In other areas of forecasting, model averaging (Bates & Granger 1969, Dickinson 1975, Clemen 1989) has been employed to improve point and interval forecast accuracy. However, with the exceptions of Shang (2012b) and Shang & Haberman (2018), model averaging has been neglected in forecasting demographic rates. This paper aims to empirically assess the extent to which forecast accuracy can be improved through model averaging in the context of age-specific fertility forecasting.
Our assessment covers 17 countries with varied fertility experience and is based on data series extending back to 1950 or before. We consider six selected models and four methods for selecting model averaging weights.
The structure of this article is as follows. In Section 2, we introduce model averaging and briefly present the four methods for selecting weights; the technical details of these methods are given in Appendix A. The data, six selected models and study design are described in Section 3, and technical details of the six models appear in Appendix B. Based on the accuracy measures discussed in Appendix C, in Section 4, we evaluate the point and interval forecast accuracies of the six models and of the four model-averaged forecasts. In Section 5, we provide an example of model averaging in forecasting age-specific fertility, using data for England & Wales. Finally, the discussion appears in Section 6.

Model averaging
The idea of model averaging has been often studied in statistics, dating back to the seminal work by Bates & Granger (1969). A flurry of articles then appeared dedicated to the topic; see Clemen (1989) for a review from a frequentist viewpoint and Hoeting et al. (1999) for a review from a Bayesian perspective. More recent developments in model averaging are collected in the monograph by Claeskens & Hjort (2008). In demographic forecasting, there has been limited usage, but notable exceptions include Smith & Shahidullah (1995), Ahlburg (1998Ahlburg ( , 2001 and Sanderson (1998) in the context of census tract forecasting.
In essence, the model-averaging approach combines forecasts from a set of two or more models. Because these models may reflect different assumptions, model structures and degrees of model complexity, it is expected that better forecast accuracy can be achieved through averaging. The forecasts from each model are averaged using weights that are specific to each forecast horizon. This is designed to achieve accuracy in the year of the forecast horizon, and not in all years up to and including the horizon year. These model weights are applied at all ages.
In all model-averaging methods, the model-averaged point forecast for horizon h is computed as the weighted mean: where M represents the model average, M , for = 1, . . . , L, represents the individual models, y n+h|n,M represents the point forecast at horizon h obtained from model ; y n+h|n,M represents the model averaged forecast; and (w 1 , w 2 , . . . , w L ) are empirical point forecast weights that sum to 1.
For the model-averaged point forecast y n+h|n,M , its prediction interval is constructed from its variance assuming a normal distribution. Following Burnham & Anderson (2002, 2004, the unconditional variance of the estimatorỹ n+h|n,M based on the models weighted by empirical interval forecast weights is given by Var (  The crucial ingredient in model averaging is the empirical weights. The computation of the point forecast weights and interval forecast weights is based on measures of forecast accuracy. The derivation of weights differs by model-averaging method. We employ four methods chosen on the basis of their ability to perform well in mortality forecasting (Shang 2012b, Shang & Haberman 2018.
These four methods for computing the weights are briefly described in the following sections. Further details are given in Appendix A.
It should be noted that in this study, the measures of accuracy are further averaged as part of the study design (see Section 3.3), increasing the reliability and stability of the weights.

A frequentist approach
Under the frequentist approach employed, the accuracy of a point forecast is measured by mean absolute forecast error (MAFE) while the accuracy of an interval forecast is measured by the mean interval score. These measures are described in Appendix C.
In the simple case of a single point forecast for a particular horizon h, MAFE is averaged over age for the year n + h, and the weight is taken to be equal to the inverse of MAFE. Weights are obtained for the set of forecasting models and standardised to sum to 1, giving the standardised weights (w 1 , w 2 , . . . , w L ). Similarly, interval forecast accuracy is based on the mean interval score for year n + h, and the weight is equal to its inverse. For the set of models, weights are standardised to achieve proportionality, giving the standardised weights (w 1 , w 2 , . . . , w L ). Further averaging of MAFE and the mean interval score takes place as a result of study design, before weight calculation.

A Bayesian approach
In Bayesian model averaging, a single set of model weights is used for the point and interval forecasts.
The weights are derived from the Bayesian information criterion (BIC) and are thus proportional to model goodness-of-fit and incorporate a penalty for more parameters. Further details appear in Appendix A.

Model confidence set (MCS)
To examine statistical significance among the set of models and select a set of superior models, we consider the MCS procedure. The MCS procedure proposed by Hansen et al. (2011) consists of a sequence of tests of the hypothesis of equal predictive ability (EPA) of two models, eliminating the worse-performing model at each step and resulting in a smaller set of superior models for which the hypothesis is universally accepted. After determining this set of superior models, their forecasts are averaged without weights.

Equal weights
As a baseline approach, we also consider assigning equal weights to all six models. Prior research has shown that including all models in an equal-weights (or unweighted) model can yield substantial gains in forecast accuracy. Several studies provide analytical solutions for the conditions under which equal weights provide more accurate out-of-sample forecasts than regression weights (e.g., Einhorn & Hogarth 1975, Davis-Stober 2011. Equal weights can be advantageous when models fit the data poorly (Graefe 2015).

Data sets
The study is based on fertility data for 17 countries. For all but one country, the data were taken from the Human Fertility Database (2020). Australian fertility rates and populations were obtained from Australian Bureau of Statistics (Cat. No. 3105.0.65.001,Table 38); this data set is also available in the rainbow package (Shang & Hyndman 2019) in the statistical software R (R Core Team 2020). The data consist of annual fertility rates by single-year age of women aged 15 to 49 and corresponding population (births to women aged 50 and older are included in births at age 49). The 17 selected countries all have reliable data series commencing in 1950 or earlier, as shown in Table 1, and extending to 2011.  1921Austria 1951-2017Canada 1921-2011Czech Republic 1950-2016Denmark 1916-2016England & Wales 1938-2016Finland 1939France 1946-2016Hungary 1950-2017Japan 1947-2016Netherlands 1950-2016Scotland 1945-2016Slovakia 1950Spain 1922-2016Sweden 1891-2016Switzerland 1932-2016USA 1933-2016 The overall level of fertility has declined in all countries considered. Figure 1 shows that the general decline in total fertility (births per woman calculated as the sum of ASFRs in each year) has not been monotonic. It also indicates that the substantial decline in fertility since 1960 has been stalled or reversed since about 2000. While trends in ASFRs may differ from this overall pattern, total fertility provides the average trend.

Six selected models
We select six models used in forecasting fertility. These include three functional time-series models for modelling the schedule of ASFRs, where age is treated as a continuum; and three univariate time-series models to model fertility at each age individually. All models are applied to transformed ASFRs so as ensure non-negative forecast rates.

Functional time-series models
The three models included here stem from the work of Hyndman & Ullah (2007). These models use principal component decomposition with time series forecasting of the time parameter, and are similar to the well-known method by Lee & Carter (1992) for modelling and forecasting mortality rates. Several components are required to adequately describe the data. The time parameters are each modelled using time series methods, providing the forecast.
Two variants of the HU model are also selected. The robust HU model (HUrob) is robust to outliers. The weighted HU model (HUw) gives greater weight to more recent data in order to reduce the potential gap between the last year of observed data and the first year of the forecast. The HU model and its two variants are described in detail in Appendix B.

Univariate time-series models
The three selected univariate time-series models (Box et al. 2008) include two random-walk models and an optimal autoregressive integrated moving average (ARIMA) model, and are applied to transformed ASFRs at each age. They are described in detail in Appendix B.
The random-walk models include the random walk (RW) and the random walk with drift (RWD).
Each model is applied independently to each age-specific rate. While the RW model assumes a constant underlying rate over time with all deviations assumed to be error, the RWD assumes a constant underlying change in the transformed rates at each age.
The optimal ARIMA model is the model best describing the time series of an age-specific fertility rate based on its own past values, including its own lags and lagged forecast errors. The equation is then used to forecast future values of the series.

Study design
We implement a two-stage design. At the first stage, the six forecasting models are applied to determine model weights based on forecast errors. At the second stage, the six models and their model averages are evaluated. These two stages use different periods of data as illustrated in Figure 2.
We divide the data for each country into an in-sample period ending in 1991 and a holdout sample period, 1992 to 2011.
Forecast errors are again calculated by comparison with observed data and used in the evaluation of point and interval forecast accuracy as measured by MAFE and the mean interval score, respectively.
Implementation of the HU models presented in this paper is straightforward with the readily available R package demography (Hyndman 2019). Point and interval forecasts based on the RW, RWD and ARIMA models can be obtained via the rwf and auto.arima functions in the forecast package (Hyndman 2020). The computational code in R for the entire analysis is available upon request.

Point forecast accuracy
In Table 2, we present the out-of-sample point forecast accuracy based on one-year-ahead to 20- year-ahead MAFEs, which are averaged over ages, years in the forecasting period and countries.
Irregularities at longer horizons are due to two factors: the small number of forecasts involved (see Figure 2) and the fact that they are based entirely on fitting periods ending in 1986 to 1990, a period of changing trends in several countries (see Figure 1).

Interval forecast accuracy
In Table 3, we present the mean interval scores for the one-year-ahead to 20-year-ahead forecasts.
As expected, the mean interval score increases with the forecast horizon, reflecting the loss of interval forecast accuracy as the horizon increases. For short horizons, the model-averaged interval forecasts perform less well than individual models, and the Bayesian and MCS approaches are briefly superior. For horizons of 7 or more years, however, the frequentist and equal weights approaches consistently out-perform the individual models and the more complex model-averaging approaches.
Across all horizons, the smallest median (5.11) of the mean interval scores (×100) occurs for the model-averaged forecasts based on the frequentist and equal weights approaches. other. The fact that the weights differ little across horizons indicates that, in this example, the models do not gain or lose relative advantage over the forecasting period. At the second stage, the fitting period is 1938 to 2016, and the forecasts are produced using the empirical weights obtained at the first stage. The one-year-ahead and 20-year-ahead model-averaged point and interval forecasts of age-specific fertility are shown in Figure 4. The 20-year-ahead point forecast portrays a shift in fertility rates to older ages, continuing the previous trend. However, the shift is more pronounced before than after the mode, which changes little, indicating that the tempo effect is nearing the end of its course. Additionally, this forecast substantially reduces the bulge in fertility rates at young ages seen in recent decades (Chandola et al. 1999). achieved. For the interval forecast, the frequentist and equal-weights approaches also produced the greatest improvement in accuracy, with gains of 3% to 24%. Again, the Bayesian and MCS approaches produced the least gains and some losses in forecast accuracy.
These results hold considerable promise for the improvement of fertility forecasting in a systematic way using existing models. The frequentist approach and the equal weights approach both offer a simple way to implement model averaging. Combined with the findings of Bohk-Ewald et al. (2018) that simple forecasting models are as accurate as the most complex, it would seem that improved accuracy can be readily achieved through simple model averaging of simple models. The often cited impediment to the implementation of model averaging, namely complexity, is thus largely removed. To further assist the uptake of model averaging, we look forward to software facilitating its application in fertility forecasting.

Model-averaging approaches
Among the four approaches for the selection of weights, the frequentist and equal-weights approaches performed much better than the more complex Bayesian and MCS approaches. The superior per-formance of the frequentist approach can be attributed to the fact that it assigns weights based on in-sample forecast accuracy. Since ASFRs change in limited ways across age and time, it can be expected that a model that produces smaller in-sample point and interval forecast errors will also produce smaller out-of-sample errors. The performance of the equal-weights approach is consistent with previous research (Graefe 2015). It is noted that variation in accuracy across models in Tables 2   and 3 is somewhat limited, partially explaining the superior performance of the equal-weights approach. The poor performance of the Bayesian approach can be attributed to the fact that the weights are determined based on in-sample goodness of fit rather than in-sample forecast errors, given that model goodness of fit is not a good indicator of forecast accuracy (Makridakis et al. 2020). The poor performance of the MCS approach can be attributed to the 85% confidence level employed. With a higher confidence level, the MCS method would have selected more accurate models.

Advantages of model averaging
An essential advantage of model averaging is its synergy. In other words, improvements in forecast accuracy can be achieved while using existing methods. Further, in order to achieve optimal accuracy, the 'mix' of the set of models employed is allowed to change over the forecasting period by using horizon-specific weights favouring more accurate methods. Thus, a model that is highly accurate in the short-term but inaccurate in the long term (or vice versa) can be included in the set of models without jeopardising the accuracy of the model-averaged forecast. This allows the forecaster to make use of a wide range of models and approaches including those that might otherwise be regarded as unsuitable because of their horizon-limited accuracy. Moreover, the weights determining the mix of models is empirical, removing an element of subjectivity in model choice, though a model must of course be included in the initial set of models if it is to be considered at all.
In theory, forecast accuracy is improved by taking relevant additional information into account, most often in the form of other data, such as in the case of joint forecasting of the fertility of a group of countries (Ortega & Poncela 2005). In model averaging, given a single set of data, the additional information encompasses the features of multiple forecasting models. Model averaging can be expected to improve forecast accuracy because the strengths of different assumptions, model structures and degrees of model complexity are essentially combined through weighting that assigns greater weight to more accurate models. In other words, the model-averaged forecasts are more robust to model misspecification.
This advantage might also be expected to increase with the forecast horizon. For any model, uncertainty increases with forecast horizon because the relevant experience in the increasingly distant fitting period becomes more and more attenuated. The effects of model misspecification may be cumulated over time, or increasingly accentuated. By using model averaging, the different effects of model misspecification are averaged rather than cumulated, potentially resulting in reduced uncertainty, especially when counterbalancing occurs, and curtailing the extent to which uncertainty increases with forecast horizon. Our results provide some evidence of a greater model-averaging advantage at longer horizons. The model-averaged point and interval forecasts gain ground as forecast horizon increases in terms of accuracy in relation to their six constituent forecasts (see Tables 2 and 3).
Model averaging can also be expected to improve the reliability of forecast accuracy because as an average its range is limited. The model averaging approaches may not be the most accurate forecast among the set, but they are not likely to be the least accurate. In effect, model averaging imposes a lower bound on point and interval forecast accuracy. When the weights are derived empirically based on forecast accuracy from past data, model average accuracy can be expected to be relatively high and can be more accurate than any individual model. An example is seen in Table 3.
It is important to note that these advantages of model averaging hold for any set of models designed to predict period or cohort fertility behaviour at any level of detail. In other words, it is not a requirement that the forecasting models are purely extrapolative, as in this paper. Models based on other approaches such as expert judgement (e.g., Lutz et al. 1996) and explanation (e.g., Ermisch 1992), as well as Bayesian modelling (e.g., Schmertmann et al. 2014) can also be used in model averaging in order to take into account their different strengths in predicting fertility behaviour. Indeed, the traditional 'medium' fertility assumption could be included in the set of models.

Limitations
A limitation of this study is that only six models from two families were considered. The set of models included in model averaging will influence the outcome. In this study, we included the RWD, although its appropriateness for longer-term fertility forecasting is limited because of changing trends ) and potential inconsistencies arising from independent forecasting of age-specific rates (Bell 1988). The study confirms the RWD to be least accurate among the six models for both point and interval forecasts. Similarly, the RW is less appropriate for longer horizons because of changing trends. Despite these limitations of the selected models, gains in forecast accuracy were achieved through model averaging. With a larger set of models, greater gains may be achieved because a more diverse range of assumptions and model structures would be taken into account.
A further limitation, arising from data availability, is the focus on period fertility rates, although it is period rates that are used in population projection. Cohort fertility rates have the advantage of describing the fertility behaviour of women over their life course, while parity-specific period rates would provide useful detail to improve model accuracy. Unfortunately, lengthy annual time series of cohort data and parity-specific period data are not available for most of the selected countries, precluding their use in forecasting.

Future research
The findings of this research suggest that model averaging is a potentially fruitful approach to improving the accuracy of fertility forecasting, using models that already exist. A useful technical extension in the case of the frequentist approach to model averaging, would be to develop a single set of weights based on a collective criterion incorporating both point and interval forecast accuracy.
Additionally, a potential direction for further technical development is to examine other ways of combining models (bagging, boosting and stacking, for example) to achieve greater accuracy than any of the individual models (Bates & Granger 1969).
Based on our findings, we conclude that in the context of fertility forecasting the model-averaging approach warrants further examination using a more extensive range of models and considering both period and cohort data. The choice of potential models is extensive (Booth 2006, Bohk-Ewald et al. 2018) and could usefully include short-term models and those based on explanation and expert opinion. Examination of the weights assigned to different models over time may help to increase understanding of fertility change. Further insights may be gained from the application of model averaging to forecasting total fertility (Saboia 1977, Ortega & Poncela 2005, Alkema et al. 2011, with comparisons of accuracy between direct and indirect forecasting of this aggregate measure.
Finally, we note that optimal forecast accuracy is unlikely to be achieved by preselecting an approach or particular models, but rather is to be found in applying model averaging to a wide range of models encompassing differing strengths in predicting human behaviour. However, while model averaging can be expected to improve forecast accuracy beyond that of the individual models, it is also the case that the accuracy of the individual models broadly determines the range of accuracy that the model-averaged forecast might achieve. Thus, while the prospect of increased accuracy in fertility forecasting is enhanced by the potential of model averaging, the addition of model averaging to the forecasting toolset does not absolve demographers from striving to better understand and model fertility behaviour.
Model averaging involves the computation of weighted means for the point forecast and the interval forecast. Depending on the model-averaging method employed, the weights may be the same for point and interval forecasting, or they may differ. The derivation of weights is described below for the two more complex methods employed in this study, the Bayesian and model confidence set approaches. Full details of the frequentist approach and the use of equal weights appear in Sections 2.1 and 2.4.

A.1 A Bayesian viewpoint
Among the models in the set, let one be considered the correct model, and let θ 1 , θ 2 , . . . , θ L be the vector of parameters associated with each model. Let ∆ be the quantity of interest, such as the combined forecast of ASFRs; its posterior distribution given the observed data, D, is Pr ( where BIC = 2L + log(n)η , L is the negative of the log-likelihood, η is the number of parameters in model , n represents the number of years in the fitting period, and BIC is the Bayesian information criterion for model which can be used as a simple and accurate approximation of the log Bayes factor (see also Kass & Wasserman 1995, Kass & Raftery 1995, Raftery 1995, Burnham & Anderson 2002, 2004, Ntzoufras 2009). In the case of least squares estimation with normally distributed errors, BIC can be expressed as

A.2 Model confidence set (MCS)
As the equal predictive ability (EPA) test statistic can be evaluated for any loss function, we adopt the and let d τ,t denote the loss differential between two models and j, that is , τ = 1, . . . , L, t = 1, . . . , n, and calculate as the loss of model relative to any other model τ at time point t. The EPA hypothesis for a given set of M models can be formulated in two ways: H 0,M : c τ = 0, for all , τ = 1, 2, . . . , L The selection of the worse-performing model is determined by an elimination rule that is consistent with the test statistic, Var d . .
To summarise, the MCS procedure to obtain a superior set of models consists of the following steps: 2) If the null hypothesis is accepted, then the final model M * = M; otherwise use the elimination rules defined in (5) to determine the worst model; 3) Remove the worst model and go to Step 2).
In a particular case of the MCS approach, a lower confidence level results in the inclusion of all models without weights. With a higher confidence level, the MCS approach selects only the most accurate model.

Appendix B Selected models: technical details
We introduce the six models and describe the calculation of point and interval forecasts for ASFRs. In order to ensure non-negative predictions, prior to modelling, the ASFRs are first transformed using the Box-Cox transformation which is defined as where f t (x i ) represents the observed ASFR at age x i in year t, m t (x i ) represents the transformed ASFR, and κ is the transformation parameter. Following Hyndman & Booth (2008) and Shang (2015), we use κ = 0.4 as it gave relatively small holdout sample forecast errors on the untransformed scale.

B.1 Hyndman-Ullah (HU) model and two variants
The HU model (Hyndman & Ullah 2007) is described below, along with the defining features of its two variants, the robust HU model (HUrob) and the weighted HU model (HUw).
1) The transformed ASFRs are first smoothed using a concave regression spline (see Hyndman & Ullah 2007, Shang et al. 2016, for details). We assume an underlying continuous and smooth function {s t (x); x ∈ [x 1 , x p ]} that is observed with error at discrete ages, where σ t (x i ) allows the amount of noise to vary with x i in year t, and ε t,i is an independent and identically distributed (iid) standard normal random variable. As an example, Figure 5  where t = 1/n represents equal weighting in the HU and HUrob models. In the HUw model, . . , n} represents a set of geometrically-decaying weights (see Hyndman & Shang 2009, on the selection of optimal λ). The distinguishing feature of the HUw model is that forecasts are based more heavily on recent data.
3) Using functional principal components analysis, the set of continuous and smooth curves {s t (x); t = 1, 2, . . . , n} is decomposed into orthogonal functional principal components and their uncorrelated scores: represents a set of weighted (as above) functional principal components; k t,1 , k t,2 , . . . , k t,J is a set of uncorrelated scores; e t (x) is the estimated error function with mean zero; and J < n is the number of retained functional principal components. Following Hyndman & Booth (2008), we use J = 6 as this has been shown to be sufficiently large to produce iid residuals with a mean of zero and finite variance.
The HUrob model uses e t (x) to provide information about possible outlying curves. If a curve has a more considerable value of integrated square error, it indicates that this curve may be considered as an outlier that generates from a different data generating process than the rest of the observations. An example of this is fertility rates during the baby boom when sharp temporal increases occurred.
Computationally, we use the hybrid algorithm of Hyndman & Ullah (2007) with 95% efficiency (where the most outlying 5% of the data are removed).
4) By conditioning on the observed data I = {m 1 (x i ), . . . , m n (x i )} and the set of estimated can be expressed as where k n+h|n,j denotes the h-year-ahead forecast of k n+h,j using a univariate time-series model, such as an optimal autoregressive integrated moving average (ARIMA) model (see Section B.2). Note that multivariate time-series forecasting models, such as vector autoregressive and vector autoregressive moving average models, can also be used to forecast principal component scores (see, e.g., Aue et al. 2015).
Because of the orthogonality of the functional principal components, the overall forecast variance can be approximated by the sum of four variances: where σ 2 a (x) is the variance of the mean function, estimated from the difference between the sample mean function and the smoothed mean function; b j (x) 2 u n+h|n,j is the variance of jth estimated functional principal component decomposition where u n+h|n,j = Var k n+h,j | k 1,j , . . . , k n,j can be obtained from the univariate time-series models used for forecasting scores; v(x) is the model error variance estimated by averaging { e 2 1 (x), . . . , e 2 n (x)} for each x, and σ 2 n+h (x) is the smoothing error variance estimated by averaging { σ 2 1 (x), . . . , σ 2 n (x)} for each x. The prediction interval is constructed via (6) on the basis of normality.

B.2 Univariate time-series models
The three selected univariate time-series models (Box et al. 2008) are applied to transformed ASFRs at each age.
Random-walk models For each age x i , the random walk (RW) and random walk with drift (RWD) models are where c represents the drift term capturing a possible trend in the data (in the RW model c = 0), and e t+1 (x i ) represents the iid normal error with a mean of zero. where h represents the forecast horizon.
Optimal ARIMA model An ARIMA(p, d, q) model has autoregressive components of order p and moving average components of order q, with d being the degree of difference needed to achieve stationarity (Box et al. 2008). The model can be expressed as where ∆ d m t (x i ) represents the stationary time series after applying the difference operator of order d, c is the drift term; {β 1 , . . . , β p } represents the coefficients of the autoregressive components; {ψ 1 , . . . , ψ q } represents the coefficients of the moving average components; and ω t (x i ) is a sequence of iid random variables with mean zero and variance σ 2 ω .
We use the auto.arima algorithm (Hyndman & Khandakar 2008) in the forecast package  to select the optimal orders based on the corrected Akaike information criterion, and then estimate the parameters by maximum likelihood. The one-year-ahead point and interval forecasts are given by Var[m n+1 (x i )|m 1 (x i ), . . . , m n (x i )] = σ 2 ω 1 + ψ 2 1 + · · · + ψ 2 q .
Appendix C Forecast accuracy evaluation C.1 Evaluation of point forecast accuracy Following Booth et al. (2006) and Shang et al. (2011), we use MAFE to measure point forecast accuracy. MAFE is the average of absolute error, |actual − forecast|, across countries (in this study, g = 1, . . . , 17), years in the forecasting period and ages; it measures forecast precision regardless of sign.
where h denotes forecast horizon, m r+h|r (x i ) represents the actual ASFR at age x i in the forecasting period, and m r+h (x i ) represents the forecast. Note that x 1 corresponds to age 15, and x 35 corresponds to age 49.

C.2 Evaluation of interval forecast accuracy
To assess interval forecast accuracy, we use the interval score of Gneiting & Raftery (2007) (see also Gneiting & Katzfuss 2014). For each year in the forecasting period, one-year-ahead to 20-year-ahead prediction intervals were calculated at the 80% nominal coverage probability, with lower and upper bounds that are predictive quantiles at 10% and 90%. As defined by Gneiting & Raftery (2007), a scoring rule for the interval forecast at age x i is expressed as where α (customarily 0.2) denotes the level of significance, m r+h|r (x l ) and m r+h|r (x u ) represent the lower and upper prediction intervals at age x i in the relevant forecasting period and country. The interval score rewards a narrow prediction interval, if and only if the true observation lies within the prediction interval. The optimal score is achieved when m r+h (x i ) lies between m r+h|r (x l ) and m r+h|r (x u ), and the distance between m r+h|r (x l ) and m r+h|r (x u ) is minimal.
The mean interval score is averaged over age. For multiple countries (here, 1 to 17) and years in the forecasting period (here, 1 to 20), the mean interval score is further averaged: S α m r+h|r (x l ), m r+h|r (x u ); m r+h (x i ) .