Professional Documents
Culture Documents
2
3
4
Forecasting Italian retail sales at the aggregate level and down to 430 product categories
employing hierarchical time series models and a time series clustering application
Abstract:
In this thesis the forecasting accuracy of ETS, Arima and autoregressive neural network models is
evaluated in the empirical case of total Italian retail sales. Using a dataset of 430 time series for
product categories hierarchical forecasts are built with the optimal combination approach
developed in Hyndman et al. (2011). The predictive accuracy of these models is then evaluated in
hierarchy based on time series clustering procedures is introduced with the aim of improving the
predictive accuracy of the hierarchical models for the highest level of aggregation.
Introduction
This thesis is concerned with forecasting Italian retail sales first at the aggregate level then in the
second part a wider dataset with time series data for single product categories is used as a basis in
The first hypothesis tested is concerned with verifying whether the ETS forecast method is the
best performing automated forecasting technique for the time series of total Italian retail sales
when compared with Arima and artificial neural network autoregressive models. These forecasting
techniques are considered since they are methods apt to model a trend and strong seasonal
Retail sales also present a very clear hierarchical structure: according to the ECR classification
single product categories can be aggregate in sectors which sum up to eight departments. In the
second part of this work total retail sales are disaggregated in three inferior levels delineating a
5
hierarchical structure, which is used as the basis for a battery of empirical tests of the forecasting
The second hypothesis of this work is therefore that the optimal combination hierarchical
forecasting method is the best performing hierarchical forecasting method. In fact the main issue
with building forecasts for a hierarchy of time series is ensuring coherence across all levels, which
is, controlling that the sum of the forecasts for a given level is the same as the forecast for their
aggregate. Traditional approaches start with forecasting either the top level (top-down) or the
bottom one (bottom-up) and then use these results to build the other levels of the hierarchy.
Hyndman et al. (2011) introduce a novel approach based on forecasting all series at all levels of
the hierarchy separately and then use a regression model to combine the forecasts obtaining a
coherent result can be proved to be unbiased and to have minimum variance among all
The retail sales time series obtained from IRI data will provide the basis for a battery of empirical
tests on the forecasting performance of different hierarchical models aiming first to provide some
additional empirical tests which are lacking in the literature due to the relatively young age of the
“optimal combination approach” and of some of its variations. Particular focus will be devoted to a
The third hypothesis is then to confirm that the forecasting accuracy for total Italian retail sales
can be improved by employing hierarchical forecasting instead of considering only the time series
6
The last hypothesis tested in this work is that the predictive accuracy of the hierarchical forecasts
can be further improved by changing the hierarchical aggregation structure from that described by
the ECR classification to a custom one obtained through time series clustering procedures.
The next section will provide a literature review, the data and the univariate econometric
procedures are discussed in section 2, section 3 contains the analysis of aggregate retail sales.
Section 4 contains a discussion of hierarchical forecasting models, which are then evaluated in
their empirical application in section 5, section 6 will deal with restructuring the hierarchy through
1. Literature review:
This work attempts to provide a forecasting framework which can aid industry practitioners in
dealing with organizational problems: Mentzer and Bienstock (1998) identify forecasts of sales as
essential inputs to many decisional challenges in fields such as marketing, sales, purchasing and
accounting. Alon et al. (2001) add that more accurate forecasts for aggregate retail sales can be
useful for large retail chains as changes in their sales are often systemic and they are more affected
Zhang (2009) assesses that accurate demand forecasting plays a fundamental role in the profitability
of retail operations thanks to its importance in planning purchasing, production, transportation and
labour force. Moreover it seems important to note that improving the ability of retailing managers
to estimate future sales can lead to positive results in customer satisfaction, reduced loss of
products, increased sales revenue and more efficient production planning (Chen and Ou, 2011a,
2011b).
7
Kremer et al. (2016) study common biases in judgemental hierarchical forecasts for the retail
industry and provide some guidelines on how to better communicate forecasts to industry
practitioners.
On the topic of hierarchical models a heavy body of literature has been produced comparing the
performance of bottom-up and top-down methods, Narasimhan et al. (1995) and Fliedner (1999)
side in favour of the latter pointing out that forecasts at the aggregate level are better with
aggregate data while Edwards & Orcutt (1969) argue that aggregation of time series data causes a
non-negligible loss of information and therefore that bottom-up forecasts are preferable.
In general empirical tests on the performance of these two methods tend to come up in favour of
the bottom-up method, one earlier example is Kinney (1971), more recently Marcellino et al. (2002)
show that inflation and economic activity in the Euro area can be forecasted more accurately by
aggregating results from a different econometric model from each country than by considering
directly the series at the aggregate level. Also Stock & Watson (2002) present similar results in
identifying potential for the improvement of aggregate forecasts in the aggregation of country
specific effects.
If estimation uncertainty is an issue Giacomini & Granger (2004) state that aggregating forecasts
from a space-time AR model is weakly more efficient than the aggregate of the forecasts from a
VAR. Hendry et al. (2011) argue that forecasts for eurozone inflation can be improved by including
variables at the disaggregate level in the model for the aggregate level.
Hyndman et al. (2011) introduce the optimal combination approach for hierarchical time series
which is going to be employed extensively in this work. Building on this novel approach Hyndman
et al. (2015) provide some improved computational techniques useful for large datasets and some
8
Another empirical application of the optimal combination approach can be seen in Ben Taieb et al.
(2017) where they are applied to a hierarchy of electricity demand derived from smart meter data.
weighting procedure is introduced along with a battery of empirical tests on its forecasting
2. Forecasting models:
This section provides a brief description of the forecasting models which we are going to apply to
Exponential smoothing:
The term exponential smoothing describes a family of forecasting methods where in each method
the forecasts are built as a weighted combination of past observations, moreover more recent
observations have a higher weight than older ones. Finally one can note that the weight of the
observations decreases exponentially as they get older thus pointing to the origin of the name
exponential smoothing.
The origin of this class of forecasting methods dates back to 1950s when they were first
introduced by Robert G. Brown for the purpose of forecasting spare parts demand in the inventory
system of the military navy in the US. In the same period but independently Charles Holt was also
working on exponential smoothing for the US Office of Naval Research, his work appeared in an
internal document (Holt 1957) which was widely circulated and quoted but which was published
9
Holt is credited with a wide body of work on additive and multiplicative exponential smoothing
which became well know through a paper by his student Peter Winters (1960) which provided
John Muth (1960), who collaborated with Holt, introduced two statistical models whose forecasts
are equal to those given by simple exponential smoothing. His models were the first of a long
series of statistical models related to forecasting using exponential smoothing, many of these are
state space models, including the ones introduced by Muth, for which the minimum mean squared
Let 𝑦𝑡 denote the observation at time t and let 𝑥𝑡 be the state vector containing the unobserved
components, namely the level, trend and seasonality of the series. Following the notation first
delineated by Anderson and Moore (1979) the linear innovations form of state space models can
be written as:
𝑦𝑡 = 𝑤 ′ 𝑥𝑡−1 + 𝜀𝑇 (1.1 a)
Where 𝜀𝑇 is a white noise series and F,g and w are coefficients. Equation (1.1a) is knows as the
measurement equation as it models the relationship between the unobserved states 𝑥𝑡−1 and the
observation 𝑦𝑡 .
On the other hand the second equation (1.1b) is known as the transition equation; it describes the
This innovations formulation is the framework behind the ETS model Hyndman et al. (2002) which,
building on the work of Ord et al. (1997) introduce a class of state space models that underlies all
10
of the exponential smoothing methods, moreover they propose a modelling framework which
provides stochastic models allowing for likelihood calculations, the construction of prediction
ETS models:
In the context of time series decomposition, four main components of a time series can be
ETS models as described in Hyndman et al. (2008) will only consider three choosing to omit
As can be read in Hyndman et al. (2008) the idea behind the ETS model, namely Error, Trend,
Seasonality is that using these three components and combining them 30 state space models can
be identified representing the whole class of exponential smoothing, this allows for the
construction of the very useful ETS function included in the R package “forecast” which includes
automated model selection and is therefore extremely useful for automated batch forecasting.
Given that we have 430 time series for product sales available, the time cost of estimating
manually parameters for each of them would be prohibitive; moreover we could also be
interested in updating the model as new observations get recorded thus justifying the need for
automated forecasts.
Exponential smoothing places strong emphasis on the trend component, which is itself broken
down in the level term (l) and the growth term (b).Let 𝑇ℎ define the trend h periods ahead and
11
considers five variations of the trend component:
• None 𝑇ℎ = 𝑙
• Additive 𝑇ℎ = 𝑙 + 𝑏ℎ
• Additive damped 𝑇ℎ = 𝑙 + ( 𝜑 + 𝜑 2 + … + 𝜑 ℎ )𝑏
• Multiplicative 𝑇ℎ = 𝑙 + 𝑏 ℎ
Beside the trend the ETS framework must estimate a seasonal structure which can be absent,
multiplicative or additive. At last the error component must be considered, it can be either
models.
This taxonomy is based on the work of Gardner (1985) which was modified by Hyndman et al
Most of the models resulting from variations of this taxonomy are already established and widely
used exponential smoothing methods, for example a model with no trend and no seasonality is a
simple exponential smoothing method (SES) while Holt’s linear method id described by a
trend additive damped and no seasonality. A seasonal component combined with an additive
trend identifies Holt-Winters method which can be additive or multiplicative depending on the
first component.
Two models with the same parameters for seasonality and trend but which display different error
terms, which can be either additive or multiplicative, will produce the same point forecasts but
12
Here we write the formula for the state space model which underlies all the 30 exponential
smoothing models, suppose 𝑥𝑡 is the state vector 𝑥𝑡 = (𝑙𝑡 , 𝑏𝑡 , 𝑠𝑡 , 𝑠𝑡−1 , …. , 𝑠𝑡−𝑚+1 )′ , the state
Given that {𝜀𝑇 } is a Gaussian white noise process with variance 𝜎 2 and 𝜇𝑡 = 𝑤(𝑥𝑡−1 ) .
In the case of the model with additive errors 𝑟(𝑥𝑡−1 ) = 1 so that 𝑦𝑡 = 𝜇𝑡 + 𝜀𝑇 while a
multiplicative error model would have 𝑟(𝑥𝑡−1 ) = 𝜇𝑡 therefore setting the value of 𝑦𝑡 equal to
𝜇𝑡 (1 + 𝜀𝑇 ).
In their excellent book “forecasting with exponential smoothing: the state space approach”
Hyndman et al. (2008) provide complete formulas for each one of the 30 models.
ETS algorithm
In the R package “forecast” Hyndman provides a forecasting function “ETS” which builds forecasts
based on the exponential smoothing taxonomy previously described. Since we are later going to
employ extensively said function it appears quite relevant to delve deeper in its workings it is
• All appropriate models are applied to the series and each model’s parameters are
optimized.
13
Hyndman et al. (2002) conducted some extensive empirical tests on the forecasting performance
of this technique which was tested on the M-competition data (Makridakis et al.1982) and IJF-M3
These tests highlighted the performance of the model on short term forecasting horizons and on
The vector of initial values is defined using a simple heuristic scheme based on Hyndman et al.
moving average estimated on the first years of data, then the data is detrended and initial
seasonal indices are the result of a two steps process initiated by taking the average of the
detrended data in each seasonal period and concluded by normalizing the indices thus
obtained so that they sum up to zero for additive seasonal methods or to m for
multiplicative seasonality.
• In order to estimate the starting level there are two options: in the case of seasonal data a
linear trend must be computed on the first ten the seasonally adjust values available, on
the other hand when dealing with nonseasonal data it is sufficient to compute a linear
trend on the first ten observations against a time vector from one to ten. In both these
cases the initial level is set equal to the intercept of the trend.
• Lastly the initial growth is set to be the slope of the trend regarding an additive trend while
the presence of a multiplicative trend requires thee initial growth 𝑏0 = 1 + 𝑏/𝑎 where a is
14
After computing the initial states the estimation procedure can start when the initial states are
While multiple sources of error state space models need to employ the Kalman filter the
innovations state space models involved in the ETS model are single source of errors models
𝑛 𝑛
Is equal to twice the negative logarithm of the likelihood function conditional on the parameters
Therefore the parameters and the improved initial states can be easily computed using the
formulas for recursive calculations of the point forecasts and parameters specific to each model
and minimizing 𝐿∗ .
Of the 30 models which make an appearance in the ETS model I believe it is warranted to regard
with further detail the ones dealing with Holt-Winters exponential smoothing both in order to give
an example of the structure of an exponential smoothing and because it is the one which is most
often selected when the model is applied to the retail sales time series.
Holt (1957) introduced the first examples of exponential smoothing models and Winters (1960)
expanded on them and added the seasonal component. Thus creating a very effective method for
modelling series which display a strong seasonal pattern. The Holt-Winters seasonal method
consists of four equations: the forecast one and three smoothing equations one for the level ℓt,
15
one for the trend bt, and one for the seasonal component st. The corresponding smoothing
There are two versions of this method which compute differently the seasonal component. The
additive method results more performing when the seasonal variations are constant through the
series, while the multiplicative method is preferred when the seasonality increases proportionally
to the level of the series. In the case of the additive method the seasonal component is subtracted
in order to create the seasonally adjusted series. The multiplicative method reads the seasonal
component as a percentage and the seasonal adjustment is performed dividing by the seasonal
component.
The structural form for the additive method therefore can be written as
(ℎ−1)
where ℎ𝑚 = [ ]+1
𝑚
𝑦𝑡
𝑙𝑡 = 𝛼 ( ) + (1 − 𝛼)(𝑙𝑡−1 + 𝑏𝑡−1 )
𝑠𝑡−𝑚
In our case due to the strong seasonality displayed by our series the multiplicative method results
more performing, moreover we have that adding a damped trend, instead of a linear one can
Lately one does not have to look far in order to see multiple success stories of the application of
machine learning techniques to the most disparate tasks, from image processing to playing GO,
from asset allocation to forecasting energy demand as the price of computational resources
decline and big data becomes more and more widely available machine learning models seem to
Alon et al. (2001) present one of the few examples of an application of neural networks
forecasting models to retail sales, the authors draw a comparison of forecasting accuracy between
neural networks and traditional models of time series analysis for the case of aggregate US retail
sales in the period 1978-1995. The authors conclude that this model performs better than Box-
Jenkins Arima models and add a note praising Holt-Winters exponential smoothing for its
The model we are going to fit to the retail sales available is the autoregressive neural network
(AR-NN) model, the basic idea is to generalize the standard AR(p) model
𝑦𝑡 = 𝜑1 𝑦𝑡−1 + … + 𝜑𝑝 𝑦𝑡−𝑝 + 𝑎𝑡
𝑦𝑡 = 𝑓(𝑦𝑡−1 , … , 𝑦𝑡−𝑝 ; 𝑤) + 𝑎𝑡
17
Where 𝑎𝑡 is a noise process, w is the weight vector and 𝑓(𝑦𝑡−1 , … , 𝑦𝑡−𝑝 ; 𝑤) is a feedforward
neural network. The function we are using is the nnetar() function included in the R package
forecast, it builds only single layered multi-layer perceptrons and works upon three basic
parameters, the number of lags in the input layer, the number of seasonal lags and the number of
model except for the stationarity restrictions, an AR-NN(2,1,2)12 would include as inputs
𝑦𝑡−1 , 𝑦𝑡−2 , 𝑦𝑡−12 and would include 2 neurons in the hidden layer.
the Data
Monthly data were made available from IRI recording LCC (Largo consumo confezionato) retail
sales for Supermarkets, Ipermarkets and “Libero Servizio” therefore covering all the large retail
Where Ipermarkets are defined as retail stores with surface sales area higher than 2.500 m²,
Supermarkets are the stores between 400 m² and 2500 m², lastly stores with area between 100 m²
and 400 m² are classified as LIS. The only part of Italian retail distribution network missing from
our dataset is the Discount class, which accounts for 17% of total sales, which are defined as
structures in which the assortment does not provide for the presence of branded products.
We have four time series for retail sales for macro areas in Italy, these time series start in 2002
and end in September 2017 and can be aggregated to obtain total retail sales in Italy. Moreover
starting from the beginning of 2012 we can also make use of time series data for single product
categories, we have 430 products available which can be aggregated in 73 sectors which in turn
Due to the differences in length we are going to employ two different sets of techniques for our
analysis, regarding the historical sales from 2002 to 2017 we are going to employ some classical
18
time series modelling approaches, namely Arima, Holt Winter’s exponential smoothing and ETS
On the other hand, in the case of the multiple time series we have available from 2012 we are
going to build and analyse hierarchical models building heavily on Hyndman et al. (2011).
In figure one the time series for monthly aggregate Italian LCC sales can be observed, the strength
of the seasonal component of the series can be easily noted, the yearly peak is the month of
19
December which registers an increase in sales of about 30% compared to the November average
which can get as high as a 60% increase when measuring sales in the beverages department.
The LCC acronym stands for “Largo consumo confezionato “, according to the classification
structure adopted by most Italian retail chains total sales get divided in three groups: LCC, “peso
variabile” and Nonfood, the second category accounts for around 20% of total sales and describes
the sales of products which can be bought in a variable quantity chosen by the costumer, the third
category covers only about 5% of the distribution and relates to products which are, as the name
LCC therefore is the most important subset of the sales of Italian retail distributors and consists
mostly of alimentary products but not entirely as of the 8 departments it is subdivided in, 2 of
them are personal care and home cleaning products which account respectively for 7% and 8% of
LCC sales.
In terms of relative percentages the heaviest department is general groceries accounting for 36%
of total LCC sales, followed by Fresh which reaches 20% and Beverages with 15%; a lower volume
is covered by frozen products (5%), pet care supplies (2%) and fruits and vegetables with 6%. The
last number’s apparent low value can be better framed if one remembers that most of the sales of
fruits and vegetables do not fall in the LCC category as the customers can pick the quantities and
thus are not included in the dataset which was made available for this analysis.
The departments are subdivided into a total of 73 sectors which are unevenly distributed, general
groceries alone is divided in 22 sectors while pet care only has 3 and fruits and vegetables only
show two sectors. Moreover departments often contain products which display a different
characteristics: for example beverage sales are divided in wine, beer and liquors as well as water,
sparkling beverages and juices; another example of the in-department variability is general
20
groceries as it contains bread and pasta as well as sweets, condiments and pickled food beside
other categories.
As can be seen in figure one retail sales display strong seasonal patterns which appear both at the
aggregate level and at the lower levels, in fact much of the improvements in accuracy gained from
the construction of a hierarchical model stem from modelling the seasonal components of the
21
A better visualization of the seasonality at the aggregate level can be seen in figure two with the
representation provided by a polar plot, we can see that the general structure of the seasonal
patterns does not vary much with time: from 2002 onwards the month with the highest sales is
always December, with June and September also recording high revenue.
Regarding the months of March and April we can see a high variability in sales, this is explained by
the Easter effect: in fact Easter is a powerful driver for retail sales and it can fall in either month
thus shifting sales from one to the other depending on calendar effects.
In assessing the forecast accuracy of the various forecasting models on the empirical substrata of
Italian retail sales a time series rolling forecast origin cross validation approach is adopted.
This validation method is described in Athanasopoulos and Hyndman (2018, section 3.4) it starts
by defining a period the minimum length (k) necessary for estimating model parameters,
therefore each model is trained in the period from the starting point of the series up to k and
forecasts up to 12 months ahead are produced, as well as measures of the forecast accuracy.
Then each model is re-estimated using a rolling forecast origin, which is, stretching the estimation
window by one additional month in each step, this procedure is repeated for every window from k
up to the penultimate observation in our time series producing T-k measures of the 1 step ahead
forecast error and up to T-k-12 measures for the 12 step ahead forecast error.
The advantage of this approach is that it mimics the form of the inventory problem where the
22
Forecasting model comparison:
Our time series for the aggregate Italian sales and for the four macro areas start in 2002 include 15
years and 9 months of data, arbitrarily we set the minimum length necessary for estimating the
model to seven years and we proceed to apply the procedure described in the previous paragraph
As total length of each time series is of 189 monthly observations, we will have for each series 105
measures of the one step ahead forecasting accuracy, 104 of the 2 steps ahead and so on
The computations have all been carried out with the statistical software R (R Core Team 2018),
specifically the R distribution “Microsoft R Open” (2017) has been employed with the Intel MKL
For the construction of the forecasts the R package “forecast” by Hyndman and Khandakar (2008)
has been used, specifically we the ets() and nnnetar() functions working as described in section
one have been employed as well as the auto.arima() function which is responsible for the
automated estimation and forecasting of an Arima model, further details on the structure of the
It is interesting to note that for every estimation period and for each one of the five time series
studied the ETS model always select, out of its pool of 30 state space models, the Holt-Winters
exponential smoothing model with multiplicative errors, multiplicative seasonality and trend
additive damped (M, Ad, M). This observation is corroborated by the relevant literature as this
model is widely regarded as well performing in the context of modelling series with strong
seasonal patterns.
23
Regarding the neural network model employed it is an AR-NN(13,1,7)12 model meaning that we
include 13 lagged observations as well as one seasonal lag of the series analysed
Table 1: mean MAPE for out-of-sample forecasting for different models, in bold is lowest forecast error for a given horizon
In table one the result of the rolling forecast origin cross validation procedure can be seen, as a measure of
out-of-sample forecasting accuracy the average MAPE on a given forecasting horizon is displayed as well as
There is not a model which is unequivocally better than the others but, as it is often the case in the context
It seems that the Arima model displays a very strong performance when dealing with very short term
forecasts as it shows the lowest forecasting error measures at the aggregate level and for Nord-Ovest and
24
On the other hand for forecasting horizons equal or greater than five the ETS model appears to be the most
performing with the NN-AR placing second, in fact even if at the aggregate level the second one shows the
lowest forecast error, in all macro areas the ETS model proves to be the best performer.
4. Hierarchical models:
Alternative to the aggregate sales time series data analysed in the previous section, starting from
2012 a more detailed set of time series data is available at the product level. The dataset obtained
from IRI consists of 430 series for sales of classes of products, these series can be allocated in 73
sectors and 8 departments according to the ECR classification. This data framework seems
extremely suitable to the construction of a hierarchical model for forecasting sales both at the
Retail sales display very strong and varied seasonal patterns, for example sales of cold beverages
peak in the summer months while beef or apple sales tend to flatline, the biggest advantage of
employing a hierarchical model is that this allows us to take into account the different seasonal
In their paper Hyndman et al. (2011) have introduced an innovative method of computing
hierarchical forecasts while previous approaches were either top down or bottom up methods or
Top down forecasting starts with the computation of a forecast for the highest aggregate level
which is then disaggregated on the lower levels, the forecasts of the lower levels are constrained
according to historical proportions in order to sum up to the one for the aggregate time series;
Gross & Sohl (1990) describe several approaches to the computation of these proportions.
25
The opposite approach is the bottom-up method where each series at the lowermost level is
forecasted separately and then the aggregate level as some simple combination of the forecasts
The innovation introduced by Hyndman et al. (2011) consists in a new approach to hierarchical
This method computes independently a forecast for each series in the hierarchy at each level and
follows up by using a regression model to optimally combine and reconcile these forecasts. Under
some relatively simple assumption the forecasts produced by this method have been proven to be
It seems useful to introduce the notation for hierarchical modelling and forecasting adopted in
Hyndman et al. (2011) as it will serve as a basis to provide a description of the statistical structure
Let our observations be recorded at times t= 1,…,n and let t=n+1,…,n+h be the period we are
interested in forecasting, then we call 𝑌𝑡 the value of the aggregate at the highest level of the
hierarchy, adding an index we have that 𝑌𝑡 = ∑𝑖 𝑌𝑖,𝑡 where 𝑌𝑖,𝑡 denotes the generic member of the
first level, this property is true for all levels as for the second level 𝑌𝑖,𝑡 = ∑𝑗 𝑌𝑖𝑗,𝑡 and so on
𝑚𝑖 = ∑𝑘 𝑌𝑖𝑗𝑘,𝑡
Then we call 𝑚𝑖 the number of series at a given level, such that in our case 𝑚1 = 8 and 𝑚2 = 73;
number of levels in the hierarchy, therefore 𝑌𝐾,𝑡 represents the collection of all series at the
26
bottom level of the hierarchy. Then defining 𝑌𝑖,𝑡 the vector of all observations at a given level i and
at a given time t and 𝑌𝑡 will represent the collection of all series in hierarchy, at all levels.
With these definitions Hyndman et al. (2011) propose a modelling framework which can be used
to give a common structure to all hierarchical forecasting models, be they bottom-up or top-down,
𝑌𝑡 = 𝑆𝑌𝐾,𝑡
With S playing the crucial role of the “summing matrix” which is responsible for defining the
aggregating structure of the hierarchy up from the bottom level. The order of the summing matrix
is 𝑚 × 𝑚𝑘 where the first row is a unit vector of length equal to the number of series in the
Now that we have described the notational structure for describing a hierarchy we can describe
the generalised notation also introduced in Hyndman et al. (2011) which can describe every
currently used method for hierarchical forecasting: given a hierarchy of time series data with k
levels and m series we start by computing independent forecasts for each one of the m series for
periods 𝑛 + 1 … . 𝑛 + ℎ, calling 𝑌̂𝑋,𝑛 (ℎ) the forecast h periods ahead for series X where 𝑌̂𝑛 (ℎ) will
then describe the matrix of all the base forecasts built with the same structure of 𝑌𝑡 .
Then given that all currently used hierarchical forecasting methods consist of some linear
combination of base forecasts each on of them can be described introducing some matrix P of
̃
𝑌𝑛 (ℎ) = 𝑆𝑃𝑌̂𝑛 (ℎ)
In fact every hierarchical method must produce coherent forecasts in the sense that the sum of
the forecasts for the lower levels must add up to the higher levels. This process is implemented by
27
the P matrix which extracts the relevant elements from the base forecasts which are then summed
This notation can be used to describe both bottom up and top-down hierarchical forecasting
methods, using
Forecasts aggregated through the bottom-up method are obtained as the P matrix extracts only
bottom-level forecasts as indicated by the 𝑚𝑘 × 𝑚𝑘 identity matrix which are then summed up
according to the hierarchical structure defined by the summing matrix. While with
𝑃 = [𝑝|0𝑚𝑘 ×(𝑚−1) ]
proportions summing to one, note that different top down forecasting methods can be
The first important result which come to light in Hyndman et al. (2011) stems directly from this set
of definitions of hierarchical forecasting methods: the author proposes that, assuming that the
base forecasts are unbiased which is that 𝐸[𝑌̂𝑛 (ℎ) ] = 𝐸[𝑌𝑛 (ℎ) ] and that the revised forecasts are
̃
preferable if also unbiased, then the following equality must be maintained: 𝐸[𝑌𝑛 (ℎ)] =
𝐸[𝑌𝑛 (ℎ)] = 𝑆𝐸[𝑌𝐾,𝑛 (ℎ) ] Let us define 𝛽𝑛 (ℎ) = 𝐸[𝑌𝐾,𝑛+ℎ |𝑌1 , … , 𝑌𝑛 ] as the true mean of the
̃
future values for bottom level series named K. Then we have that 𝐸[𝑌 ̂
𝑛 (ℎ)] = 𝑆𝑃𝐸[𝑌𝑛 (ℎ) ] =
𝑆𝑃𝑆𝛽𝑛 (ℎ) the authors are able to state that the unbiasedness of the revised forecasts will hold if
𝑆𝑃𝑆 = 𝑆
Therefore, calling 𝛴ℎ the variance of the base forecast the authors state that the variance of the
̃
𝑉𝑎𝑟[𝑌𝑛 (ℎ)] = 𝑆𝑃𝛴ℎ 𝑃′𝑆′
28
Also, the set of pre-reconciliation forecasts can be expressed as:
Where 𝛽𝑛 (ℎ) is unknown and 𝜀ℎ has zero mean and covariance matrix 𝑉𝑎𝑟(𝜀ℎ ) = 𝛴ℎ .
The intuition which forms the basis for the main result of the paper is that the previous equation
can be managed as a regression, assuming full knowledge of 𝛴ℎ generalized least squares models
could be used to obtain the minimum variance unbiased estimate of 𝛽𝑛 (ℎ) given full knowledge of
the 𝛴ℎ matrix.
As in most practical applications this is not the case Hyndman et al. (2011) introduce one further
assumption: that the distribution of the forecast errors display the same aggregational structure of
the original data, that is 𝜀ℎ ≈ 𝑆𝜀𝐾,ℎ where 𝜀𝐾,ℎ is the matrix with the forecast error for each series
in the bottom level. As a direct result of adopting this assumption the variance matrix can be
Given 𝑌 = 𝛽ℎ + 𝜀 with 𝑉𝑎𝑟(𝜀) = 𝛴ℎ = 𝑆𝛺ℎ 𝑆′ and S a summing matrix. Then the generalized least
𝛺ℎ :
With variance matrix 𝑉𝑎𝑟(𝛽̂ℎ ) = 𝛺ℎ which is the minimum variance linear unbiased estimate and
This theorem shows that OLS can be employed rather than GLS in building the set of revised
forecasts, therefore the set of the revised forecasts for the optimal combination approach to
̃
𝑌𝑛 (ℎ) = 𝑆(𝑆 ′ 𝑆)−1 𝑆′𝑌̂𝑛 (ℎ)
29
This result also shows that under the assumption 𝛴ℎ = 𝑆𝛺ℎ 𝑆′ the optimal combination of base
forecasts depends uniquely on the aggregational structure, therefore allowing for the use of any
Lastly the authors point out that estimating the covariance matrix 𝛴ℎ can be avoided unless one is
̃
interested in building prediction intervals as 𝑉𝑎𝑟(𝑌𝑛 (ℎ)) = 𝛴ℎ .
Having described the forecasting framework the basis have been laid in order to begin the
empirical application: in the next section this framework for hierarchical time series forecasting is
5. Comparative analysis
In the following section we will confront measures of forecast accuracy for the main hierarchical
reconciliation methods: namely classical top down, bottom up approaches and the optimal
combination forecasts presented by Hyndman et al.( 2011), we will repeat this analysis with
forecasts built with both Arima models and ETS models in order to determine which models
performs better when applied to the time series of Italian retail sales structured as a 4 levels
hierarchy. The aim is to identify the optimal formulation of hierarchical models when applied to
retail sales in Italy and to provide further experimental data on the forecasting performance of the
The data set we are going to use as a basis for the empirical tests was provided by IRI and consists
of 430 time series for the sales of the 430 product categories starting from January 2012 and
ending in September 2017, these will form the bottom level for the hierarchy with two
intermediate levels: departments and sectors respectively made up of 8 and 73 series summing up
to total sales at level zero. The aggregational structure of this hierarchy, which is also the summing
matrix, is defined by the ECR tree of product categories which is adhered by most operators in the
30
When comparing the forecasting performances of hierarchical models a rolling origin time series
cross validation approach is adopted, the general structure is the same as the one which was
described previously. We have 69 observations available for each time series, starting from
January 2012 and ending in September 2017. The minimum length required for estimation is
assumed to be of 3 years which is far lower than would have been preferable but was forced by
the relative short length of the data available. In fact given a monthly seasonal frequency at the
shortest estimation length each model has only three observations for each seasonal period which
hampers in particular the estimation of the seasonal Arima models thus providing a possible
reason for their relatively poor out-of-sample performance compared to the ETS model.
The rolling origin cross validation procedure will give as output for each series in the hierarchy 33
observations of the one-step ahead forecasting accuracy and 21 observations for the 12 steps-
ahead one.
As a measure of the forecasting performance the MAE is adopted changing from the MAPE
reported in the previous table; this choice is motivated by the problem of variable magnitude at
any given level of the hierarchy: for example, at the departments level general grocery accounts
for 37% while pet care only has a weight of 2%. This problem is compounded then considering
single product categories: sales of some seasonal products, of insecticides or of some innovative
product categories like vegetarian cured meats are extremely hard to predict as much of their
sales levels depend on assortment and inventory decisions, on volatile seasonal factors or on
other factors. Nonetheless these products account for a very low volume of total sales and it
seems misleading to give them the same importance in the assessment of model accuracy as that
Therefore the tables below are built according to the following procedure: for each origin point t
(k<t<T) forecasts are computed, then the MAE for each series at each level are summed up to
31
obtain a measure of the total MAE for each level of the hierarchy for a single forecasting horizon
at a single t.
Finally an average is taken of the MAE measures for each forecasting horizon in order to obtain a
final measure for the average forecasting error for a given level of the hierarchy at a given horizon.
For example we take the average of each of the 31 measures of the 3 months ahead MAE as the
measure for the forecast error for a given method at a given level.
Combination methods:
The tables below contain the measures of the out-of-sample forecast errors obtained through this
procedure, table 2 and table 3 differ only in the computation of the base forecasts which are ETS
The software used for statistical modelling is R with base forecasts using the “forecast” package
for base forecasts and the “hts” package for hierarchical reconciliation. The computational time
was of 4 hours and 40 minutes for both tables computed on a laptop with i5-7200 CPU, clock
Table 2: average of total MAE for out-of-sample forecasting of alternative aggregation methods for
hierarchical forecasting with ETS forecast method, comb stands for optimal combination approach
32
Arima forecasts forecast horizon (h)
1 2 3 4 5 6 7 8 9 10 11 12 Average
Top level: Italia
comb 66,1 69,6 67,8 69,7 73,0 71,1 79,0 76,5 82,4 85,7 86,2 93,2 76,7
bottom up 95,8 95,2 82,8 97,4 92,4 96,1 107,6 106,8 107,5 116,7 121,6 120,8 103,4
top down 49,9 51,1 63,6 65,6 71,6 67,6 73,8 70,9 68,4 69,8 71,9 81,6 67,2
Level 1: Departments
comb 113,4 130,3 139,6 144,2 151,3 151,6 156,3 148,7 149,2 149,1 152,0 155,2 145,1
bottom up 132,5 148,8 155,9 171,3 174,1 175,2 182,4 172,5 171,5 173,9 174,8 175,1 167,3
top down 127,4 142,2 137,9 145,0 150,8 137,4 152,0 143,2 150,7 154,8 156,0 165,7 146,9
Level 2: Sectors
comb 210,4 247,0 274,4 287,5 295,6 293,8 293,0 271,1 265,5 267,5 272,5 279,1 271,4
bottom up 213,5 249,5 271,7 287,6 291,8 293,5 294,4 281,8 272,7 272,3 273,4 271,1 272,8
top down 254,3 289,2 325,1 344,2 349,8 347,7 358,9 327,7 325,5 329,5 332,5 345,5 327,5
Level 3: Products
comb 317,0 362,0 398,1 421,0 423,8 425,5 434,0 417,2 414,7 416,8 422,7 425,9 406,6
bottom up 291,8 335,8 365,2 387,2 394,2 395,8 402,2 389,6 386,0 388,8 393,0 393,2 376,9
top down 351,2 405,5 447,9 470,4 471,8 469,6 486,5 462,4 458,6 462,8 469,0 472,8 452,4
Table 3: average of total MAE for out-of-sample forecasting of alternative aggregation methods for
hierarchical forecasting with Arima forecast method, comb stands for optimal combination approach
These results seem to confirm that the overall performance of the optimal combination
hierarchical reconciliation method is better than that of the bottom up and top down approaches.
In table 2, where base forecasts are built with the ETS model, we can easily see that both the
optimal combination and the bottom up methods seem to univocally outperform the top-down
approach, this observation is coherent with the relevant literature as the bottom up method is
widely regarded as superior (Schwarzkopf et al., 1988) except when there is reason to believe the
At the intermediate levels, both in the case of departments and sectors the optimal combination
method’s performance is strictly superior to the bottom up’s, while at the bottom level the latter
can claim to be the most accurate, even if it beats the former only by a small margin. On the top
level, that of total sales, the situation is less clear as neither of the two is univocally better than
the other but the average MAE is slightly lower for the optimal combination approach.
Table 3 presents slightly messier results, the bottom-up and top-down methods show the lowest
MAE values respectively at the bottom and at the top level while the combination method
33
outperforms them both at the intermediate levels in term of average MAE but is not univocally
better for every forecasting horizon. However it should be noted that the optimal combination
approach is either the best performing or the second best for all 4 levels of the hierarchy.
Having confirmed that in general the optimal combination approach seems to be the best
performing one in the next section we are going to test some different weighting procedures.
Weighting methods:
The optimal combination forecast reconciliation method allows for the implementation of any set
of weights which can be independent of the observed time series (Hyndman et al. 2011). In this
section the wls, MinT and nseries weighting procedures related to hierarchical reconciliation
The nseries weights are the simplest ones as they are based on the number of series present at each
node of the hierarchy, they are equal to the inverse of the row sums of the summation matrix.
Hyndman, Lee and Wang (2016) introduce the use of weighted least squares (wls) to produce
reconciled forecasts in the context of the optimal combination reconciliation and develop an
improved algorithm for the computation of the variance-covariance matrix which are used to build
Athanasopoulos et al. (2018) highlight that this method ignores the diagonal covariance elements
and building on the approach introduced in Hyndman et al. (2011) introduce the MinT weighting
procedure which uses a full estimate of the covariance to build the weights.
This novel approach is aimed at the minimization of the variances of the forecast errors assuming
Table four below displays the forecasting performance of these three weighting procedures, the
evaluation is carried out implementing the same procedure of rolling forecast origin cross validation
34
which was applied to different combination methods in the previous section, the computations
Table 4: average of total MAE for out-of-sample forecasting employing different weighting methods, the
base forecasts are built with ETS model and optimal combination reconciliation
The most important result is that in this empirical test the nseries weights outperform the other two
methods at the higher levels of aggregation: this procedure is univocally better for the two highest
levels of the hierarchy and is pretty much tied up with the wls weights at the sector level as on
The most likely reason behind the high accuracy displayed by the nseries weights has to do with the
structure of the hierarchy as both at the departments and at the sectors levels the ECR classification
For example the whole department of fruits and vegetables, despite being composed of 57
product categories is divided in only two sectors: one for fruits and one for vegetables, also pet
care supplies which account for only around 2% of total sales are nonetheless accounted as one of
the 8 departments and are divided in 3 of the 73 sectors, one of which contains only a single
product category.
35
On the other hand the department of general groceries accounts for 36% of total sales and
contains a staggering 22 of the 73 sectors as well as 128 out of the 430 product categories.
Therefore the nseries weights seem able to outperform the other methods as they are able to
model better the structure of our hierarchy by taking into account the different number of series
at each node.
The hypothesis underlining this section is that the hierarchical structure we have been using up
until now may be suboptimal and that changing it could improve the forecasting performance at
The ECR classification is produced by GS1, a non-profit organization that develops and maintains
global standards, the most well knows of which is the barcode, it aims at improving the efficiency
of supply chains and provide the industry with a common nomenclature. Providing forecasts for
the middle groups, namely departments and sectors, is a worthwhile effort as they are universally
for sales at the aggregate level then the aggregational structure of the hierarchy could be
revisited.
In fact the hierarchy we have been using so far is mainly built to address concerns about supply
chain management and depends principally on assortment allocation, thus it may be suboptimal in
a context of pure time series analysis giving us reason to consider revisiting it; the observations in
the previous section point to a very imbalanced hierarchical structure which could be the source
introduced.
36
Time series clustering:
Time series clustering methods are employed to build an alternative hierarchy which is then tested
in terms of forecasting accuracy performance at the aggregate level against the original hierarchy
The alternative hierarchy is obtained through partitional time series clustering employing dynamic
Partitional clustering is a clustering strategy which assigns each observation univocally to one of k
clusters, where k is given at the beginning of the procedure, in this case 40 clusters are employed
which is a number chosen arbitrarily by observing the scree plot of the sum of squared within
distances at different values of k and by observing the silhouette curve and the CH curve.
This set of partitional clusters is built using the “tsclust” function included in the “dtwclust” R
package, the clustering algorithm starts by initialising k centroids which are created by randomly
choosing k series from our database and equating them to the centroids of k individual clusters,
then the following procedure is implemented either a set number of times or until no change in
• The distance of each series from each centroid is calculated and each series is assigned to
• The centroids of each cluster which have been changed in the previous procedure are
updated
37
The main issue of time series clustering is the definition of the distance function: as each
observation in our dataset consists of a single time series the distance function must provide a
The simplest approach would be to use the Euclidean distance, which is calculated by taking the
average of the Euclidean distances between each pair of points in the two time series, as shown in
Liao (2005) this metric can be improved by implementing some normalization procedure before
the clustering.
As the results of the application of this measure appeared unsatisfactory a more complex
approach was deemed necessary thus the Dynamic Time Warping distance(DTW) was
implemented, Giorgino (2009) provides a good review of the algorithm, which is then
This measure, that is the namesake of the dtwclust package, in general terms works by first
stretching or compressing a pair of time series in order to make them as similar as possible and
then computing the sum of the pairwise distances between the aligned time series points.
The computation of dtw distance starts by building an 𝑚 × 𝑛 matrix, called Local Cost Matrix
(LCM) where m and n are the length of the two series x and y which are being compared, each
Thus each (𝑖, 𝑗) entry of the LCM matrix consists of the 𝑙𝑝 norm between 𝑥𝑖 and 𝑦𝑖 thus it can get
computationally expensive for large datasets, fortunately in our case we only have 430 series and
69 observations which despite not being a trivial number do not pose insurmountable
computational issues.
38
Established the local cost matrix the DTW algorithm must find the optimal path through the local
cost matrix which must start from lcm(1,1) and end at lcm(n,m), this process substantially
amounts to optimally aligning the two series before calculating the pointwise distance between
the two. The path is calculated step by step each time minimizing the cost increases under a set of
constraints which are arbitrarily chosen, among these the most important is the step pattern
which determines which moves are available from each point in the lcm matrix as well as the cost
associated to each direction in which to move, in our case we are employing the “symmetric2”
pattern which places a higher cost on diagonal movers compared to orthogonal ones.
Given that 𝜙 = {(1,1), … , (𝑛, 𝑚)} the set of points which constitute this optimal path the DTW
1/𝑝
𝑚𝜙 𝑙𝑚𝑐(𝑘)𝑝
𝐷𝑇𝑊𝑝 (𝑥, 𝑦) = (∑ ) , 𝑤ℎ𝑒𝑟𝑒 𝑘 ∈ 𝜙
𝑀𝜙
Where 𝑚𝜙 is a per-step weighting coefficient and 𝑀𝜙 is the normalization constant, both are
This is the algorithm employed in the dtwclust package for the standard DTW distance, but in
order to improve computational times we introduce global constrains which are a slight
39
Figure 3: from Espinosa 2018, structure of a lower bound constraint in the lcm matrix for the calculation of DTW distance
A lower bound can be imposed on the lcm matrix which is then restricted as seen in figure three
above, this both restricts the possible optimal paths and reduces computational load as it permits
us to skip performing the calculations for a good part of the lcm matrix which are normally not
In our clustering model the “improved lower bound” (Lemire 2009) is used as it is described in
DBA centroids:
The partitional clustering procedure we are applying requires both the definition of a distance
function which has been given in the previous sect and the definition of the procedure for
estimating the centroids of each cluster. In order to solve this second problem the DTW
deal efficiently with the estimation of centroids with the DTW distance.
40
This function is particularly useful as the centroid structure is independent of the order in which
the series considered enter the calculation meaning that the computational requirements for
updating the centroid in case of changes in the clusters are greatly reduced.
The DBA algorithm starts by selecting a random series in the cluster and then proceeds by
computing iteratively the DTW alignment of the centroid with each series in the cluster in a
procedure that ends with a centroid that is as closely aligned as possible to all the members of the
cluster.
Clustering Structure:
Having defined the structure of the time series clustering algorithm employed it is relevant to note
that the structure of the clusters thus obtained is very different from that of the original hierarchy
as the new hierarchical structure seems to be based mainly on quantitative metrics: the level of
the series, their seasonality and their trend. In the pictures below the time series for the sales of
the product categories belonging to two of the 40 clusters we are using can be seen.
41
Figure 4: product categories belonging to cluster 11
Cluster 11 seems to contain products with very strong seasonal sales, note that thanks to the use
of the Dynamic Time Warping distance seasonal peaks which happen in different months are put
together in the same cluster thanks to the time series alignment procedure which is a central part
42
Cluster 30 on the other hand groups together product categories which display consistent sales
fluctuating around 20 million euros per month and displaying a strong positive trend.
Empirical application:
The aim of this section is twofold: measures of the forecast errors at the highest level of
aggregation for hierarchical models built according to the ECR hierarchical structure and also those
of the same models employing the time-series clustering derived classification are going to be
compared not only with each other but also against forecasts for the univariate time series of the
total sales computed with the three methods described in section one.
43
All the models considered display a negative mean error at all 12 forecast horizons, this seems
preferable to a positive one as a negative mean error implies an overestimation of future sales
which can result in unsold stock while a positive mean error can more easily be interpreted as lost
In order to assess the distribution of the forecast errors related to the Arima forecasting model
we produced a series of simulations of two year ahead forecasts starting from September 2016,
figure five, which appears below, displays the residuals for the forecasts built with the Arima
Figure 5: residuals for two years ahead forecast for the highest level of the clustering based hierarchical model with Arima forecasts
44
Table 5 below shows the results of the same procedure of rolling origin cross validation which has
been employed in the previous sections which is going to produce the final set of results of this
body of work. The sample for the computation of the out of sample forecasts is the same as in the
previous example, that is, between January 2015 and September 2017.
While time series data at the total sales levels are available since 2002 in order to compare the
accuracy of the univariate forecasts methods with that of hierarchical forecasts for which data are
available only from 2012 the length of the estimation period for univariate analysis has been
The computational time was of 15 minutes for execution of the clustering algorithm and of 2 hours
for the computation of the time series rolling forecast origin cross validation procedure related to
the custom hierarchy. The program was executed on Microsoft R Open 3.4.3 using Intel MKL for
parallel computing on a laptop with i5-7200 CPU, clock speed 2.5 GHz, and 8 Gb DDR4 RAM.
Table 5: out-of-sample MAPE for forecasts of total Italian retail sales in the period 2015-2017 for different
These results are coherent with the relevant literature as they point out that the forecasts for the
highest level series obtained through hierarchical models are more accurate than the forecasts
It is also noteworthy that the forecasting error for the hierarchical models structured according to
the custom hierarchy is univocally lower than that of the hierarchies built along the ECR
45
classification. The accuracy gain is more pronounced when using Arima models compared to ETS
models, this observation could be motivated by the already lower forecast error of the ETS
models. If we assume that our target series have a stochastic component which cannot be
accurately modelled we could also argue that the ETS model is already close to a forecast which
can accurately model the deterministic component of total sales thus limiting the potential gains
7. Conclusion:
In this last section the results of the empirical tests conducted in this work are summarised,
regarding the first hypothesis about the best univariate time series forecasting model for total
retail sales of packaged products at the Italian level the results obtained are not conclusive. The
Neural Network Autoregressive model shows the best performance in the case of total sales at the
Italian level with the Arima as a close second, however we also tested the forecasting accuracy for
the sales of the 4 macro-areas which divide the Italian area and here the former model fares
worse as it shows the worst performance while the ETS models excels and displays the lowest
Regarding the second hypothesis the simulations of out-of-sample forecasting built for hierarchical
models on Italian retail sales produced results which are easier to assess: the outcome of the first
battery of tests on the accuracy of different hierarchical reconciliation methods aligns with the
results of Hyndman et al (2011) showing that the optimal combination hierarchical approach
outperforms both the bottom up and the top down methods when applied to Italian retail sales.
Moreover it appears that the ETS forecasting model consistently outperform the Arima models in
producing forecasts for single retail sales series, this is probably caused by the strong seasonality
interpretable results as it appears that in the out-of-sample forecast simulations conducted on the
available data the nseries weights are preferable if the analyst is interested in accurately modelling
the top level series while the wls weights are more accurate for the bottom and at the
departments levels.
In merit to the third hypothesis some interesting results have also been produced in section 6
which concludes that hierarchical models show better performances than the univariate ones
Moreover in section 6 some evidence is provided pointing to potential reductions in forecast error
achievable by changing the hierarchical structure employed from the standard ECR one to a
custom hierarchy built through time series clustering. It should be noted however that this
approach is only viable if one is not interested in the intermediate levels of the hierarchy but only
in the total sales, which is, in the series at the highest level of aggregation.
Summing up the main conclusion of this work is that a hierarchical model with ETS independent
forecasts reconciled with the combination method and with nseries or wls weights seems the
optimal approach to forecasting Italian retail sales of packaged goods provided a good
informational base is available in term of time series of sales for single product categories.
47
Bibliography:
Aghabozorgi, S., Seyed Shirkhorshidi, A., & Ying Wah, T. (2015). Time-series clustering - A decade
review. Information Systems, 53, 16–38. https://doi.org/10.1016/j.is.2015.04.007
Alon, I., Qi, M., & Sadowski, R. J. (2001). Forecasting aggregate retail sales: Journal of Retailing and
Consumer Services, 8(3), 147–156. https://doi.org/10.1016/S0969-6989(00)00011-4
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive
comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.
https://doi.org/10.1016/j.patcog.2012.07.021
Arlot, S., & Celisse, A. (2009). A survey of cross-validation procedures for model selection, 4, 40–
79. https://doi.org/10.1214/09-SS054
Athanasopoulos, G., Ahmed, R. A., Hyndman, R. J., Lee, A. J., Feinberg, E. A., Genethliou, D., … Fan,
S. (2014). hts : An R Package for Forecasting Hierarchical or Grouped Time Series. Iranian
Journal of Electrical and Electronic Engineering, 55(June), 146–166.
https://doi.org/10.1016/j.ijforecast.2008.07.004
Ben Taieb, S., Taylor, J. W., & Hyndman, R. J. (2017). Hierarchical Probabilistic Forecasting of
Electricity Demand with Smart Meter Data, 1–30. Retrieved from http://souhaib-
bentaieb.com/pdf/jasa_probhts.pdf
Bergmeir, C. (2015). Validity of Cross-Validation for Evaluating Time Series Prediction. Elsevier,
(April). https://doi.org/https://doi.org/10.1016/j.csda.2017.11.003
Chen, F., & Ou, T. (2011). Constructing a Sales Forecasting Model by Integrating GRA and ELM: A
Case Study for Retail Industry. International Journal of Electronic Business …, 9(2), 107–121.
Retrieved from
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:CONSTRUCTING+A+SALES+
FORECASTING+MODEL+BY+INTEGRATING+GRA+AND+ELM+?+A+CASE+STUDY+FOR+RETAIL+I
NDUSTRY#0
Crone, S. F., Hibon, M., & Nikolopoulos, K. (2011). Advances in forecasting with neural networks?
Empirical evidence from the NN3 competition on time series prediction. International Journal
of Forecasting, 27(3), 635–660. https://doi.org/10.1016/j.ijforecast.2011.04.001
de Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with complex
seasonal patterns using exponential smoothing. Journal of the American Statistical
Association, 106(496), 1513–1527. https://doi.org/10.1198/jasa.2011.tm09771
Fan, S., Hyndman, R., & Hyndman, R. J. (2010). Short-term load forecasting based on a semi-
parametric additive model. Business, (August).
48
Fliedner, G. (1999). An investigation of aggregate variable time series forecast strategies with
specific subaggregate time series statistical correlation. Computers & Operations Research,
26(10-11), 1133-1149.
Giorgino, T. (2009). Computing and Visualizing Dynamic Time Warping Alignments in R : The dtw
Package. Journal of Statistical Software, 31(7), 1–24. https://doi.org/10.18637/jss.v031.i07
Guoqiang Zhang, B., & Eddy Patuwo, M. Y. H. (1998). Full-Text. International Journal of
Forecasting, 14, 35–62. https://doi.org/10.1016/S0169-2070(97)00044-7
Hendry, D. F., & Hubrich, K. (2011). Combining disaggregate forecasts or combining disaggregate
information to forecast an aggregate. Journal of Business and Economic Statistics, 29(2), 216–
227. https://doi.org/10.1198/jbes.2009.07112
Hyndman, R.J.; Koehler, A.B.; Snyder, R.D.; Grose, S. (2002). A state space framework for
automatic forecasting using exponential smoothing methods. International Journal of
Forecasting, 18(3), 439–454. https://doi.org/10.1016/S0169-2070(01)00110-8
Hyndman, R. J. (2008). Forecasting with exponential smoothing : the state space approach.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting : the forecast package
for R Automatic time series forecasting : the forecast package for R. Journal Of Statistical
Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Hyndman, R. J., & Lee, A. J. (2014). Fast computation of reconciled forecasts for hierarchical and
grouped time series Fast computation of reconciled forecasts for hierarchical and grouped time
series, 97(June), 16–32. Retrieved from http://robjhyndman.com/papers/hgts6.pdf
Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. (2011). Optimal combination
forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9), 2579-
2589.
Kang, Y., Hyndman, R. J., & Smith-miles, K. (2016). Visualising Forecasting Algorithm Performance
using Time Series Instance Spaces Visualising Forecasting Algorithm Performance using Time
Series Instance Spaces, (May). https://doi.org/10.1016/j.ijforecast.2016.09.004
Kinney, W. R. (1971). Predicting earnings: entity versus subentity data. Journal of Accounting
Research, 127-136.
Kotsialos, A., Papageorgiou, M., & Poulimenos, A. (2005). Long-term sales forecasting using holt-
winters and neural network methods. Journal of Forecasting, 24(5), 353–368.
https://doi.org/10.1002/for.943
49
Kremer, M., Siemsen, E., & Thomas, D. J. (2015). The Sum and Its Parts: Judgmental Hierarchical
Forecasting. Management Science, (December), mnsc.2015.2259.
https://doi.org/10.1287/mnsc.2015.2259
Marcellino, M., Stock, J. H., & Watson, M. W. (2003). Macroeconomic forecasting in the Euro area:
Country specific versus area-wide information. European Economic Review, 47(1), 1–18.
https://doi.org/10.1016/S0014-2921(02)00206-4
Bienstock, C. C., Mentzer, J. T., & Bird, M. M. (1997). Measuring physical distribution service
quality. Journal of the Academy of Marketing Science, 25(1), 31.
Narasimhan, S. L., McLeavey, D. W., & Billington, P. J. (1995). Production planning and inventory
control (pp. 1-165). Englewood Cliffs: Prentice Hall.
Orcutt, G. H., Watts, H. W. & Edwards, J. B. (1968), ‘Data aggregation and information loss’, The
American Economic Review 58(4), 773–787.
Peña, D., Tiao, G. C., & Tsay, R. S. (2001). A Course in Time Series Analysis.
https://doi.org/10.1198/tech.2001.s67
Podobnik, B., & Stanley, H. E. (2007). Detrended Cross-Correlation Analysis: A New Method for
Analyzing Two Non-stationary Time Series, 1–11.
https://doi.org/10.1103/PhysRevLett.100.084102
Rodrigues, P. P., Gama, J., & Pedroso, J. P. (2008). Hierarchical clustering of time-series data
streams. IEEE Transactions on Knowledge and Data Engineering, 20(5), 615–627.
https://doi.org/10.1109/TKDE.2007.190727
Smith, K. A., & Gupta, J. N. (2000). Neural networks in business: techniques and applications for
the operations researcher. Computers & Operations Research, 27(11-12), 1023-1044.
Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number
of predictors. Journal of the American statistical association, 97(460), 1167-1179.
Taieb, S. Ben, Taylor, J. W., & Hyndman, R. J. (2017). Coherent Probabilistic Forecasts for
Hierarchical Time Series. Proceedings of the 34th International Conference on Machine
Learning, 70(April), 3348–3357. Retrieved from
http://proceedings.mlr.press/v70/taieb17a.html
Wang, Y., & Powell, W. (2017). MOLTE: a Modular Optimal Learning Testing Environment, 27(3).
https://doi.org/10.18637/jss.v000.i00
50
Warren Liao, T. (2005). Clustering of time series data - A survey. Pattern Recognition, 38(11),
1857–1874. https://doi.org/10.1016/j.patcog.2005.01.025
Zhang, X. (2009). Retailers' multichannel and price advertising strategies. Marketing Science,
28(6), 1080-1094.
R packages:
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Microsoft R Open 3.4.3 The enhanced R distribution from Microsoft Using the Intel MKL for
parallel mathematical computing, 2017 Microsoft Corporation. URL
https://mran.microsoft.com/.
Hyndman RJ (2017). _forecast: Forecasting functions for time series and linear models_. R package
version 8.2. URL: http://pkg.robjhyndman.com/forecast>.
Hyndman and Khandakar (2008). “Automatic time series forecasting: the forecast package for R.”
_Journal of Statistical Software_, *26*(3), pp. 1-22.
URL: http://www.jstatsoft.org/article/view/v027i03>.
Rob Hyndman, Alan Lee and Earo Wang (2017). hts: Hierarchical and Grouped Time Series. R
package version 5.1.4. URL: https://CRAN.R-project.org/package=hts
Alexis Sarda-Espinosa (2017). dtwclust: Time Series Clustering Along with Optimizations for the
Dynamic Time Warping Distance. R package version 5.1.0.
URL: https://CRAN.R-project.org/package=dtwclust.
Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2017). dplyr: A Grammar of Data
Manipulation. R package version 0.7.4. URL: https://CRAN.R-project.org/package=dplyr.
Hadley Wickham and Lionel Henry (2017). tidyr: Easily Tidy Data with 'spread()' and 'gather()'
Functions. R package version 0.7.2. URL: https://CRAN.R-project.org/package=tidyr.
Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis.Springer-Verlag New York, 2009.
URL: https://cran.r-project.org/web/packages/ggplot2/index.html.
51