You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/319041996

A Deep Learning Algorithm to Forecast Sales of Pharmaceutical Products

Conference Paper · October 2017

CITATIONS READS
0 892

3 authors, including:

Oscar. Chang
Yachay Tech
26 PUBLICATIONS   165 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Deep Predictor View project

Grey-Based Neural Network model to predict supply chain performance. View project

All content following this page was uploaded by Oscar. Chang on 11 September 2017.

The user has requested enhancement of the downloaded file.


A Deep Learning Algorithm to Forecast Sales of
Pharmaceutical Products
Oscar Chang Ivan Naranjo Christian Guerron Dennys Criollo
Research and Innovation Centro de despacho Sistemas Gerencia de Sistemas
IT Empresarial FARMAENLACE IT Empresarial IT Empresarial
Yachay, Ecuador Yachay, Ecuador Yachay, Ecuador Yachay, Ecuador
oscarchang@it-empresarial.com ivannaranjo@farmaenlace.com christianguerron@it- denniscriollo@farmaenlace.com
empresarial.com

Jefferson Guerron Galo Mosquera


Sistemas Centro de Despacho
IT Empresarial FARMAENLACE
Yachay,Ecuador Quito,Ecuador
jeffersonguerron@it- gemosquera@usfq.edu.ec
empresarial.com

Abstract— This work proposes a deep neural network (DNN) computational power and abstract representation of some
algorithm that accomplishes consistent sales forecasting for functions [10] [11]. Unfortunately, well stablished gradient
weekly data of pharmaceutical products. The resultant time descent methods such as backpropagation, that have proved
series is used to train with backpropagation, step by step, a DNN, effective when applied to shallow architectures, does not work
where shallow nets face selected scenarios, with different space-
well when applied to deep architectures.
time data considerations.
In each step, by using a sum of square differences and a peak
search procedure, a reasonable quality in the obtained abstract In previous works [12] [13] [14] we have shown an innovative
representations is pursued. First an autoencoder is trained as to line of deep learning algorithms, with its own set of
develop in its hidden layer neural data abstractions about a advantages / disadvantages, but eventually producing efficient
random moving window. Thereafter the autoencoder neural computing processors. We have taken these ideas
abstractions are used to train a second shallow net which further and in this paper, we propose a DNN specialized in
operates in a restricted area and specializes in one week ahead forecasting the sales or pharmaceutical products. The general
predictions. Lastly by using the abstraction of this second net problem is to find, for each outlet and for each product, an
plus recently captured information, a third shallow net is trained
ideal balance that minimizes inventory costs and maximize
to produce its own one week ahead estimates, using new timing
and data procedures. After training, the whole stacked system customer attention. For a distribution centers with hundreds of
can produce stable weekly forecasting with up to 91%_ 55 % hit outlets and thousands of products, this becomes a most
rate, for assorted products and periods. The system has been entangled and important operation, where deep learning could
tested in real time with real data. contribute with practical solutions.

Keywords— Deep Learning, Time Series Prediction, Sales Our methodology contemplates the training with
Forecasting. backpropagation of shallow networks inside explicit scenarios,
with specialized tasks, where predictive information about
I. INTRODUCTION predictive sales, circulates freely and is used as immediate
In the deep learning world state-of-the art performance have targets or rewards for local neural training. The final objective
gained a good reputation in fields like object recognition [1], is to produce reliable abstract representations of the data
speech recognition [2], natural language processing [3], behavior at both short-term and long-term influences, codified
physiological affect modelling [4] and many others. More in hidden layers, and then stack them together as to produce
recently papers on time-series prediction or classification with forecasting information.
deep neural networks have been reported [5] [6] [7] [8].
We also propose a primordial method to measure the
The search for depth quality of abstract representations generated in the different
Both in biology and circuit complexity theory it is maintained used hidden layers, by monitoring, while training is in
that deep architectures can be much more efficient (even progress, the neural activity of hidden neurons. This procedure
exponentially efficient) than shallow ones in terms of requires quadratic sum of differences over a selected period.

Draft August 2017


II. DEEP NETWORK one output layers. To combine the higher-level features of
different data behaviors, hidden layers are trained separately
A. Deep architecture and then stacked, on top of which the output layer is added.
The proposed stacked network utilizes five layers of The third hidden layers also incorporated as input recently
sigmoidal neurons organized as one input, three hidden and captured information such as the last eight weeks’ average, and
some fresh peaks and valleys values (Fig. 1).

Figure 1. Deep predictor

For operative purposes, the proposed stacked architecture is provides the final prediction information. Finally, the zone
derived from three shallow networks called Autoencoder, “unknown future” is used to test the performance of the system
Precursor and Gambler. and to make a real prediction, when the unknown future line
reaches the end of the data. At any time, more data can be
B. Data Handling added and the system responds creating new predictions.
Given three years of daily sales grouped in weeks, the
network unravels the problem of predicting sales one week
ahead of the current input window (one product-one outlet). C. Input Vector
The dataset is taken from the database of a real pharmaceutical
databases in Ecuador. For training purposes, the available data The input vector is composed by a moving window of 16
is divide in three mobile zones (Fig. 2), where times moves to consecutive weeks plus three other elements defined by the
the right. day/month/year where the top right of the moving window
stays at a given time instant (Fig. 2). All 19 entries are
The first initial zone, to the left, is reserved to train the first normalized to neural values inside the analog segment [0,1].
two shallow nets “autoencoder-predictor” which work as a When a target is needed, it will be taken as the sale value of
coordinate duet. the week next to the right of the sample window (near future).
The shown data ranges from January 2014 to April 2017.
The next zone, of about 10 weeks, is reserved to train the net
“gambler”, which holds the final solidary network output and

IT _Empresarial.
neuron in two consecutive randomly selected images in times t
and t-1. That is:

(1)

Where:
Vt = hidden outputs variation between two consecutive
inputs
n = number of hidden neurons

Figure 2. Data handling and input vector. Weekly sales behavior of a typical oi,t = output of hidden neuron i at time t.
pharmaceutical products, with an erratic pattern of consumption and a moving
window of 16 weeks data sale plus window ubication date information.
In a typical run, with small initial random weights in the
hidden layer, V starts from a small value and then grows into a
. random oscillatory time series. We use this outcome and
For training purposes, the moving widow travels in introduce a selective peak search procedure where the last
different space-time patterns for diverse training scenarios. found peak value of V is stored until a bigger peak value is
found. In pseudo code:
cycles_count=0;
III. FIRST SCENARIO: THE AUTOENCODER do
Our autoencoder has 19 inputs, 11 hidden and 19 output {
neurons. To train it, the moving window is located at a random
position inside the autoencoder zone and the same input vector calculate net;
is used as target. backpropagation;
The job of the trained autoencoder is to reproduce in its calculate Vn
output, as exactly as possible, the image of the moving
window just loaded in its inputs, for any random position in the if(Vn>peak) peak={Vn;cycles_count=0;}
allowed area. Since hidden neurons are less than input neurons,
cycles_count++;
data compression and abstract representations must occur
during training. Our stacked systems will work with } while(cycles_count<3000);
abstraction that travel from layer to layer as the main source of
information, so we take special care about abstractions’
quality. It turns out that after a while the period required to reach
the next peak value (cycles count) grows exponentially,
probably as an overtraining signal. So, for our purposes if in
3000 consecutive cycles no new bigger peak originate, the net
is assumed to be done and training stops. In our experiments,
this peak-search scheme produces small output error and high
hidden layer activity in the autoencoder, taking an average of
50k cycles of backpropagation to be completed.

IV. SECOND SCENARIO: THE PRECURSOR


The precursor is a three-layer neural net whose inputs are
the nonlinear abstractions generated by the trained
Figure 3. The Autoencoder and the Precursor. Once the Autoencoder is
autoencoder. Its only output is trained to predict, inside the
trained, its hidden layer becomes the input to the Precursor, which never sees precursor zone, the sale value for the week next to the window
the real input windows but only abstractions created by hidden1. Also, the position. Precursor never sees the actual data patterns captured
learning cycles of precursor do not affect the weights of the autoencoder by the window but only the resulting abstraction generated by
the Autoencoder, that is why good quality abstractions were
required.
We try several metric methods to avoid overfitting-underfitting To train the precursor the moving window is located at a
problems [15] and at the same time try to guarantee quality random position inside the allowed zone (Fig. 2) and the value
abstractions from the involved hidden layers. We finally of the week next to moving window is used as target. After
adopted the following scheme, which begins by measuring the feeding forward all the participating layers, one backpro cycle
quadratic variation V among all the outputs of the hidden is carried out, only for the precursor layers. As before we check
the hidden layer activity using (1) and the mentioned peak- of precursor convey valuable feature abstractions about
search algorithm. predictions in the allowed training zone. Noticeably enough
the predictive capacity of the duet fades away when it moves
A. Trained staked Behavior to the future, toward never seen data (Fig. 4).

In Fig. 4 the behavior of the trained, staked autoencoder-


precursor duet is shown. The quadratic error inside the
allowed training zone shrinks to a minimal. The two curves
look almost identical, so it is expected that the hidden layers

Figure 4. The behavior of the staked Autoencoder-Precursor duet. After training the quadratic error inside the allowed training zone shrinks to a minimal. The two
curves look almost identical, so the hidden layers of precursor should convey valuable feature abstractions about predicting. The predictive capacity fades away
when the duet moves to the unknown future, with never seen data (Fig. 4).

V. THIRD SCENARIO: THE GAMBLER. VI. RESULTS.


The Gambler is a third neural network responsible for We work with a group of several different pharmaceutical
predicting the sales value of the moving window’s immediate products and for illustrative purposes we selected four products
week, in a new, never seen training zone. As shown in Fig. 1 A, B, C, D with different nature and behaviors. For training
this shallow network accepts as input the abstractions purposes, the data covers more than 170 weeks, from 2014 to
generated by the Precursor (7 inputs) and a selected group of 2017. In the shown graphics, the one-week-ahead predictions
recent events defined by the average value of the 8 last weeks includes 22 consecutive weeks from December 2016 to June
(1 input), and the peaks and valleys of the last 8 weeks (16 2017.
inputs). This totalizes 24 signals that bear recent and old Weeks 16 to 150 are used to train the duet Autoencoder-
information. Precursor duet and Weeks 150 to 160 are used to train the
Gambler. After each prediction, an error measure is made
A. The Gambler Training. against a threshold obtained as the 10% average value of the
To train the GAMBLER the moving window moves step given data. If threshold is satisfied the prediction is declared as
by step strictly to the future, beginning in the gambler training “true” and stored. If threshold is not accomplished the duet
zone in Fig. 2, Fig. 4 and going forward two successive steps, Autoencoder-Gambler is retrained for up to four more
up to the border of the unknown future, carrying out one opportunities. After this the prediction is declared as “false”
backpropagation cycle in each step. Once the unknown future and stored. Once one prediction is finished, the whole system
is touched the window returns to the beginning in the gambler advances one week toward the unknown future and so on, until
training zone. This left to right sweep is repeated while the
the end of the data is reached.
hidden layer activity of Gambler in monitored with the peak-
search method previously described. Termination occurs in an The resultant time series (blue) and their respective one week
average of 25k backpropagation cycles. ahead predictions (red) are shown in Fig. 5, times run to the
right. Black vertical lines represent fails, where the predictions
did not satisfy the 10% threshold and re-training of the duet
AUTOENCODER-PRECURSOR was done. When new
weekly information is added to the data, the process fires again
and a new prediction is produced.
Figure 5. The Results. From left to right high, medium, low-low and low rotation products sales (blue) and their respective one week ahead predictions (red), from
December-2016 to June-2017. Black vertical lines represent failures where the prediction did not satisfy the required 20% threshold and retraining of the whole
network was required. For three out of four randomly selected products, the hit rate is noticeable high.

Product A. High rotation, average sale is 72 units per week. analysis that uses both short-term and long-term features and in
There are 3 fails in 22 trials. The hit rate for this run is 86.4 %. experiments with real-world data delivers good results for
Some peaks and valleys are correctly anticipated. Other runs products with different consumption behaviors. Due to the
may go down to 66.3 %. many possibilities in training strategies, linking with other
Product B. Medium rotation, average sale 9.6 units per week. training techniques such as reinforcement learning and genetic
There are 4 fails in 22 trials. The hit rate is 81.8 %. Some algorithms are foreseen.
peaks and valleys are correctly anticipated.
Product C. High rotation, average sale 0.6 units per week.
There are 10 fails in 22 trials. The hit rate is 54.5 %.
Product D. Low rotation, average sale 110 units per week. REFERENCES
There are 4 fails in 22 trials. The hit rate is 81.8 %.
[1] Alex Krizhevsky, I Sutskever, and GE Hinton. Imagenet classification
VII. DISCUSSION with deep convolutional neural networks. Advances in Neural
Information Processing Systems, pages 1–9, 2012. (object
recognition)
For some products, the system produces good hit rate [2] O Abdel-Hamid and A Mohamed. Applying convolutional neural
forecasting, where some peaks and valleys are well predicted. networks concepts to hybrid NN-HMM model for speech recognition.
Acoustics, Speech, and Signal Processing, 2012. (speech recognition)
For other kind of products, the hit rate barely keeps above
[3] R. Collobert and Jason Weston. A unified architecture for natural
54%. Further parameters and training strategies should yet be language processing: Deep neural networks with multitask learning.
developed for these cases. Proceedings of the 25th international conference on Machine learning,
2008. (natural language processing )
According to our results the proposed peak-search scheme
[4] HP Martinez. Learning deep physiological models of affect.
produces good enough abstraction that convey important Computational Intelligence Magazine, (April):20–33, 2013.
information, raising the hit rates well above 50%. (physiological affect modelling)
[5] Atiya, Amir F. , Gayar, Neamat El and El-Shishiny, Hisham(2010) 'An
VIII. CONCLUSIONS Empirical Comparison of Machine Learning Models for Time Series
Forecasting', Econometric Reviews, 29: 5, 594-621
We presented a sales forecasting deep learning model that [6] S. Prasad, P. Prasad. ‘Deep Recurrent Neural Networks for Time Series
makes sales predictions by staking abstract representations, Prediction’.
whose quality is monitored by using a sum of square [7] DeepLearningforEvent-DrivenStockPrediction
differences and a peak search scheme. Mladen Dalto 'Deep neural networks for time series prediction with
Abstraction are produced by three different shallow applications in ultra-short-term wind forecasting', 2015 IEEE
International Conference on Industrial Technology (ICIT)
networks: autoencoder, precursor and gambler, trained inside
[8] XiaoDing, YueZhang, TingLiu, JunwenDuan, Mladen Dalto, “Deep
explicit scenarios, with focused tasks, timing and reward neural networks for time series prediction with applications in ultra-
procedures. Our training algorithm accomplishes a temporal short-term wind forecasting time series_I. “University of Zagreb,
Faculty of Electrical Engineering and Computing, E-mail: [13] O Chang, P Constante, A Gordon, M Singaña, F Acuna. “A deep
mladen.dalto@fer.hr. architecture for visually analyze Pap cells”. - IEEE 2nd Colombian
[9] C. Deep Prakash et al….. Data Analytics based Deep Mayo Predictor for Conference Automatic Control (CCAC), Oct. 2015. DOI:
IPL-9 (time series_II . International Journal of Computer Applications 10.1109/CCAC.2015.7345210
(0975 – 8887) Volume 152 – No.6, October 2016 exponential [14] Chang Oscar, A Bio-Inspired Robot with Visual Perception of
[10] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Affordances, in Computer Vision - ECCV 2014 Workshops,vol. 8926,
learning: A review and new perspectives. Pattern Analysis and Lecture Notes in Computer Science 2015 , pp. 420-426,Springer
Applications, (1993):1–30, 2013. International Publishing, http://link.springer.com/chapter/10. 1007
[11] Yoshua Bengio. Learning Deep Architectures for AI. Foundations and [15] MathWorks. “Improve Neural Network Generalization and Avoid
Trends® in Machine Learning, 2(1):1–127, 2009. Overfitting”. https://www.mathworks.com/help/nnet/ug/improve-neural-
network-generalization-and-avoid-
[12] O. Chang, P Constante, A Gordon, M Singaña. “A Novel Deep Neural
overfitting.html?requestedDomain=www.mathworks.com#responsive_o
Network that Uses Space-Time Features for Tracking and Recognizing a ffcanvas. 2017
Moving Object”. Journal of Artificial Intelligence and Soft Computing
Research. Poland, 2017. (on press)

View publication stats

You might also like