You are on page 1of 5

Organizational Strategy to Execution

Predictive Analytics with SAS Enterprise Miner


Garrett Stauber
The Citadel
Charleston, USA

The expansive amount of available data presents organizations


with a valuable resource. By harnessing and analyzing data
through predictive analytical processes valuable insights can be
gained about future events and behaviors. However, to reach its full
potential, predictive analytical insights need to find their way to the
organizational decision makers. By making predictive analytics
accessible to those in strategic roles via well designed software
tools, data analysis can feed business strategies that are responsive
to the organizations intrinsic and extrinsic environment.

I. INTRODUCTION
Organizations are established with the intent of profitability
[1]. To achieve and improve profitability, high level business
goals must be established to position the organization for future
success. To disseminate these goals, an organization typically
pushes the goals downward through the business layers.
Meaningful and achievable strategies must be extracted from
the goals at each business layer.
This process of setting goals and then translating them into
objectives for the various organizational layers is designed to
provide top down control over the organization. The top layer
in this hierarchy of control is known as the strategic layer [1].
In this layer high-level business goals are established, and a
business strategy is devised to achieve these goals. The
business strategy will provide an overall direction for the
organization [1]. At the next layer of the organization, known
as the tactical layer, the business strategy must be translated
into tactics which various business units can employ to
contribute towards the strategy [1]. Finally, at the lowest level
of the organization control hierarchy, the operational layer, the
business units tactics are transformed into actions which can
be accomplished by business processes in their day-to-day
operations [1].
II. PREDICTIVE ANALYTICS & ORGANIZATIONAL LAYERS
The organizational control flow begins with goal setting.
The process of establishing goals and the strategies to achieve
them requires a tremendous amount of insight. Todays
businesses have the benefit of massive amounts of data
available to them about the external environment, and the
ability to collect large amounts of data regarding their internal
operations of their business. With this data countless insights
can be drawn. The problem, however, is that this massive
amount of data is difficult to organize and analyze. Thus the
specialized fields have arisen to focus on organizing and

extracting meaningful relationships from big data. The field


known as predictive analytics is one of these big-data fields.
Predictive analytics is aimed at drawing conclusions about
future events and behaviors based upon historical trends found
in data [2].
A. Relationship to Strategy Execution
At the strategic control layer, predictive analytics provides
the insights to make informed decision regarding future
changes to the business environment. At this layer, the external
environment is of primary concern. By providing insights into
the future customer and competitor behaviors, goals and
strategies can be formulated to beneficially position the
organization ahead of the competition.
B. Relationship to Tactical Execution
At the tactical control layer, predictive analytics can
provide business units insights into the intrinsic trends of the
organization. These trends can then be leveraged to shift
intrinsic factors such as cost, efficiency, or output.
C. Relationship to Organizational Execution
At the organizational control layer, business processes
generate data that feeds predictive analytics. This data can
then be used to identify business processes in need of
improvement at the tactical layer, or identify organizationwide trends that can be addressed at the strategic level.
III. PREDICTIVE ANALYTICS TOOLS
Due to the complexity of big-data, many tools are required
to manage and analyze it. Thus, predictive analytics tools are
often packaged in a suite with other big-data tools. There are
numerous software suites focusing on big-data that include
predictive analytics capabilities. These include large enterprise
software vendors SAP, SAS, Oracle and IBM, as well as, the
software from the open-source Apache community [3].
A. Analysis Techniques
A 4-step process for conducting predictive analytics on bigdata is outline in Predictive Analytics: Its everywhere by
Popow [4]. The first step is to ensure the data is of high quality.
Here the data is scanned for incomplete or erroneous data sets.
This data must be corrected or removed to prevent statistical
errors. Next, the data-mining process occurs. In this phase
trends, patterns, and other relationships in the data are
discovered. Using this information in the next step, we can

build mathematical models to conduct predictive analysis on


the data. These models contain the logic for conducting
analysis on a known set of data. Finally, the models are
validated by running them against older data and comparing
the output to the known results. The types of models used to
conduct predictive analytics can broadly be grouped into two
categories: regression and machine learning models [3].
In regression modeling an equation is established, the
output of which will be the focus of our analysis [3]. The
analyst will want to discover what factors lead to
maximization, or minimization of this equation. By looking at
historical data and with a knowledge of the problem at hand,
new variables can be introduced to the equation and their effect
on the output can be gauged.
Machine learning is different in that equation is not
explicitly stated and the machine has no way of interpreting
which factors have known relevance to the problem at hand
[5]. The machine is simply given access to the data, and it is
able to remember the data in states in which a certain event
occurred. The machine can then use its memory to recognize
an occurrence of this state in the future. By feeding the
machine more known occurrences which it can learn from, it
can be trained to better determine when an event of interest
will occur.
B. PMML
Big-data tools are largely proprietary and practitioners are
often experts in a limited set of tools. As big-data analysis
becomes more prominent in business strategy, standardization
is being introduced to the set of available tools. This allows
businesses to transition between big-data suites without
extensive investments in re-training their professionals.
One form of standardization in big-data, and specifically to
predictive analytics, is the creation of the predictive model
markup language [PMML]. PMML is an XML-based file
format developed by the Data Mining Group to provide a way
for applications to describe and exchange models produced by
data mining and machine learning algorithms [6].
The primary purpose of PMML is to provide a standardized
structure to represent the various data-mining models [6]. This
allows the models to be imported and exported from various
statistical language packages [6]. Large enterprise suites like
SAS, which is explored in the next section, support both
importing and exporting PMML.

IV. SAS
SAS is a software suite originally developed at North
Carolina State University and released in 1972 [7]. Today it is
comprised of over 200 components [7]. It can mine, alter,
manage and retrieve data from a variety of sources and
perform statistical analysis on it [7]. SAS provides a graphical
point-and-click user interface for non-technical users and
more advanced options through the SAS programming
language.
A. Enterprise Miner
The SAS module of importance, in reference to Predictive
Analytics, is SAS Enterprise Miner. This module is focused
on the mining of data relationships and the creation of
accurate descriptive and predictive data models [8].
B. Predictive Analysis in Enterprise Miner
The process of data mining and analysis in Enterprise
Miner is described in 5 steps (SEMMA) in the Getting Started
with SAS Enterprise Miner 14.1 [8, page 1].
Sample the data by creating one or more data sets. The
sample should be large enough to contain significant
information, yet small enough to process. This step includes
the use of data preparation tools for data import, merge,
append, and filter, as well as statistical sampling techniques.
Explore the data by searching for relationships, trends, and
anomalies in order to gain understanding and ideas. This step
includes the use of tools for statistical reporting and graphical
exploration, variable selection methods, and variable
clustering.
Modify the data by creating, selecting, and transforming the
variables to focus the model selection process. This step
includes the use of tools for defining transformations, missing
value handling, value recoding, and interactive binning.
Model the data by using the analytical tools to train a
statistical or machine learning model to reliably predict a
desired outcome. This step includes the use of techniques such
as linear and logistic regression, decision trees, neural
networks, partial least squares, LARS and LASSO, nearest
neighbor, and importing models defined by other users or even
outside SAS Enterprise Miner.
Assess the data by evaluating the usefulness and reliability
of the findings from the data mining process. This step
includes the use of tools for comparing models and computing
new fit statistics, cutoff analysis, decision support, report
generation, and score code management.
After the models have been created and assessed,
Enterprise Miner will report your models which have the
highest degree of accuracy [8, page 2]. These models can then
be applied to new data, to score the likelihood of the target
outcome [8, page 2].

C. Enterprise Miner Nodes


The heart of Enterprise Miners purpose and functionality
lies what they call nodes. The nodes provide the user all of the
functionality needed to import, mine, and analyze data. The
simple drag and drop user interface provided by Enterprise
Miner allows the user to create a process flow containing
various nodes that follow the SEMMA methodology [8, page
2]. Using this interface, it is possible to rapidly generate and
analyze many data models. The most commonly used nodes
are organized below along with a brief description according
to their SEMMA phase as documented in the software
documentation [8, page 47].
1) Sample
a) Append
The append node is used to combine data sets within
the process flow diagram [8, page 47].
b) Data Partition
The Data Partition node is used to divide a data set
into a subset used for model fitting, and other data set
aside for model validation and testing [8, page 47].
c) Filter
The Filter node can be used to exclude data based
upon variable values [8, page 48].
d) Input Data
The Input Data node is used to include SAS
recognized data sets in the process flow [8, page 48].
e) Sample
The sample node is used to extract a subset from a
larger dataset [8, page 48].

2) Explore
a) Association
The Association node is used to identify relationships
of interest between variables in the dataset [8, page 48].
b) Graph Explore
The Graph Explore node provides data visualizations
along with interactive exploration to aid in pattern and
trend discovery [8, page 49].
c) MultiPlot
The MultiPlot node will automatically create charts
of the input and target variables [8, page 49].
d) Path Analysis
The Path Analysis node is used to explore web logs
to determine the paths taken to navigate a website [8,
page 49].
e) StatExplore
The StatExplore node is used to examine the
statistical properties of an input data set [8, page 49].
f) Variable Clustering
The Variable Clustering node is used to replace
redundant and collinear variables in the data with a
single variable [8, page 49].
g) Variable Selection
The Variable Selection node is used to select which
variables are useful to the model for target prediction [8,
page 49].

3) Modify
a) Drop
The Drop node is used to remove variables from the
dataset or to hide metadata [8, page 49].
b) Impute
The Impute node is used to compute and input
missing variable values into the dataset [8, page 50].
c) Transform Variables
The Transform Variable node is used to replace a
variable with a value that is a transformation of its
existing value [8, page 50].

4) Model
a) AutoNeural
The AutoNeural node is an automated tool that will
aid in discovering the optimal configuration for a neural
network model [8, page 50].
b) Decision Tree
The Decision Tree node is used to fit decision trees to
the data and includes auto-ranking of the input variables
[8, page 51].

D. Sample Enterprise Miner Process Flow Diagram


In its simplest form an example usage of these nodes
following the SEMMA methodology would look something
like the following:

Our first node would be a Sample node. Using the


Input Data node, we would link to a SAS data source
and define which fields we would like to exclude and
identify our target field [8, page 11].

In order to facilitate analysis we must next include a


Data Partition node. Using this node, we will define
what percentage of our data source to divide into
subsets used for initial model fitting, analysis of the
model, and for test data [8, page 18].

Our next node will be used to explore the data


source. In this example will use the StatExplore node.
Once we link this node to our Data node, we can run
the StatExplore node and view the results [8, page
16]. The results will display a graph charting the
correlation between our various data source fields
and our target field [8, page 16].

Next, before conducting analysis on our data, we will


include an Impute node to fill in any missing data
that can be calculated or interpreted from other fields.
This step can improve the amount of data we have to
generate and validate our models since model nodes
will discard incomplete records [8, page 30].

Once we have defined the impute rules, we can


choose a Model node to fit to our data. In this case
we will use a Regression node, and select a linear
regression algorithm in the nodes properties.

Finally, we can complete our process work flow with


a Score node. This Score node is used to predict the
target variable on a data source outside of the data
used for model fitting and validation. When we
choose to run the scoring node, it will execute all
previous nodes and score the data set aside in the
Data Partition node for testing. The predicted result
for the target variable will be stored in the Score
nodes results and our analysis is complete [8, page
44].

c) Model Import
The Model Import node is used to import a model not
created in Enterprise Miner [8, page 52].
d) Neural Network
The Neural Network node is used to construct, train,
and validate multilayer, feed-forward neural networks
[8, page 52].
e) Regression
The Regression node is used to fit linear and logistic
regression models to the data [8, page 52].

5) Assess
a) Model Comparison
The Model Comparison node is used to compare
model nodes and their predictions. The node will
generate visualizations showing the usefulness of the
given models [8, page 53].
b) Score
The Score node is used to manage SAS scoring code
that is generated by the models. Scoring is the
generation of predicted values for a data set that might
not contain a target variable [8, page 53].

Graphic on following page

V. HOW PREDICTIVE ANALYTICS TOOLS AID ORGANIZATIONAL


CONTROL
With limited training, it is possible to conduct
effective analysis on big data using enterprise software tools
such as SAS Enterprise Miner [8, page 2]. This allows
predictive analysis to be conducted at organizational control
levels with more non-technical members. This can put the
ability to gain important insights into the hands of the
organizational decision makers. At the strategic level,
predictive analytical tools can evaluate data generated daily to
dynamically guide organizational strategies and evaluate the
probability of achieving organizational goals. At the tactical
layer, various tactics can be evaluated and compared to view
their effect on profitability. Predicative analytics could also be
used to predict business process deficiencies and avert them
before they effect the bottom line. At the organizational level,
day to day operations can be planned efficiently for future
events and stresses on daily operations can be avoided.

REFERENCES
[1]

Boundless [2015]. Boundless Management. Retrieved December 7,


2015,
from
https://www.boundless.com/management/textbooks/boundlessmanagement-textbook/control-8/types-of-control-62/strategic-tacticaland-operational-control-313-3960/

[2]

Predictive Analytics: What it is and why it matters. [n.d.]. Retrieved


December
7,
2015,
from
http://www.sas.com/en_us/insights/analytics/predictive-analytics.html

[3]

Predictive Analytics Tools - Predictive Analytics Today. [2013,


November
1].
Retrieved
December
7,
2015,
from
http://www.predictiveanalyticstoday.com/predictive-analytics-tools/

[4]

Popow, D. [2008, April 3]. Predictive Analytics: It's everywhere.


Retrieved
December
7,
2015,
from
http://claimsmanagement.theclm.org/home/article/predictive-analytics

[5]

Machine Learning: What it is and why it matters. [n.d.]. Retrieved


December
7,
2015,
from
http://www.sas.com/en_us/insights/analytics/machine-learning.html

[6]

Guazzelli, A., Zeller, M., Lin, W., & Williams, G. [2009]. PMML: An
Open Standard for Sharing Models. The R Journal, 1, 60-65. doi:May 1,
2009

[7]

SAS.
[n.d.].
Retrieved
December
https://en.wikipedia.org/wiki/ SAS_[software]

[8]

Getting started with SAS Enterprise Miner 14.1. [2015]. Cary, North
Carolina: The SAS Institute.

7,

2015,

from

You might also like