You are on page 1of 19

Building Your First Machine Learning

Model Using KNIME


Get started with KNIME, a GUI-driven tool for predictive analytics and machine
learning, without writing one piece of code!

by
Shantanu Kumar
·
Oct. 30, 17 · AI Zone · Tutorial

One of the biggest challenges for beginners in machine learning and data science is that
there is too much to learn simultaneously — especially if you don't know how to code. You
need to quickly get used to Linear Algebra, Statistics, other mathematical concepts and
learn how to code them! It might end up being a bit overwhelming for the new users.

If you have no background in coding and find it difficult to cope with, you can start learning
data science with a GUI-driven tool. This enables you to focus your efforts on learning the
actual subject when you're just starting off. Once you are comfortable with basic concepts,
you can always learn how to code later on.

In today’s article, I will get you started with one such GUI-based tool: KNIME. By end of
this article, you will be able to predict sales for a retail store without writing a piece of code!

Let’s get started!

Why KNIME?
KNIME is a platform built for powerful analytics on a GUI-based workflow. This means that
you do not have to know how to code (a relief for beginners like me) to be able to work using
KNIME and derive insights.
You can perform functions ranging from basic I/O to data manipulations, transformations,
and data mining. It consolidates all the functions of the entire process into a single
workflow.

Setting Up Your System


To begin with KNIME, you first need to install it and set it up on your PC.

Go to the KNIME downloads page.

Identify the right version for your PC:

Install the platform and set the working directory for KNIME to store its files:
This is how your home screen at KNIME will look.

Creating Your First Workflow


Before we delve more into how KNIME works, let’s define a few key terms to help us in our
understanding and then see how to open up a new project in KNIME.

 Node: A node is the basic processing point of any data manipulation. It can do a
number of actions based on what you choose in your workflow.

 Workflow: A workflow is the sequence of steps or actions you take in your platform
to accomplish a particular task.

The workflow coach on the left top corner will show you what percentage of the community
of KNIME recommends a particular node for usage. The node repository will display all
nodes that a particular workflow can have, depending on your needs. You can also go
to Browse Example Workflows to check out more workflows once you have created
your first one. This is the first step towards building a solution to any problem.

To set up a workflow, you can follow these steps.

Go to the File menu and click on New:


Create a new KNIME Workflow in your platform and name it Introduction.

Now, when you click on Finish, you should have successfully created your first KNIME
workflow.
This is your blank Workflow on KNIME. Now, you’re ready to explore and solve any
problem by dragging any node from the repository to your workflow.

Introducing KNIME
KNIME is a platform that can help us solve any problem that we could possibly think of in
the boundaries of data science today. From the most basic visualizations or linear
regressions to advanced deep learning, KNIME can do it all.

As a sample use case, the problem we’re looking to solve in this tutorial is the practice
problem BigMart Sales that can be accessed at Datahack.

The problem statement is as follows:

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10
stores in different cities. Also, certain attributes of each product and store have been
defined. The aim is to build a predictive model and find out the sales of each product at a
particular store. Using this model, BigMart will try to understand the properties of products
and stores which play a key role in increasing sales.

You can find the approach and solution to BigMart Sales problem here.
Importing the Data Files
Let's start with the first (yet very important) step in understanding the problem: importing
our data.

Drag and drop the File reader node to the workflow and double-click on it. Next, browse
the file you need to import into your workflow.

In this article, as we will be learning how to solve the practice problem for BigMart Sales, I
will import the training dataset from BigMart Sales:
This is what the preview would look like once you import the dataset.

Let's visualize some relevant columns and find the correlation between
them. Correlation helps us find what columns might be related to each other and have a
higher predictive power to help us in our final results. To learn more about correlation, read
this article.

To create a correlation matrix, we type “linear correlation” in the node repository, then drag
and drop it to our workflow.
After we drag and drop it like shown, we will connect the output of the File reader to the
input of the node Linear correlation.

Click the green button Execute on the topmost panel. Now right click the correlation node
and select View: Correlation Matrix to generate the image below.
This will help you select the features that are important and required for better predictions
by hovering over the particular cell.

Next, we will visualize the range and patterns of the dataset to understand it better.

Visualization and Analysis


One of the primary things we would like to know from our data would be that what item is
sold the maximum out of the others.

There are two ways to interpret the information: scatterplot and pie chart.

Scatterplot

Search for Scatter Plot under the Views tab in our node repository. Drag and drop it in a
similar fashion to your workflow and connect the output of File reader to this node.

Next, configure your node to select how many rows of the data you need and wish to
visualize (I chose 3000).

Click Execute, and then View: Scatter Plot.


I have selected the X axis to be Item_Type and the Y axis to be Item_Outlet_Sales.

The plot above represents the sales of each item type individually and shows us that fruits
and vegetables are sold in the highest numbers.

Pie Chart

To understand an average sales estimate of all product types in our database, we will use a
pie chart.
Click on the Pie Chart node under Views and connect it to your File reader. Choose the
columns you need for segregation and choose your preferred aggregation methods, then
apply.

This chart shows us that sales were averagely divided over all kinds of products. “Starchy
foods” amassed the highest average sales of 7.7%.

I have used only two types of visuals although you can explore the data in numerous forms
while you browse through the Views tab. You can use histograms, line plots etc. to better
visualize your data.

I admire tools like Tableau as the strongest tool for data visualization.

How Do You Clean Your Data?


The other things you can include in your approach before training your model are data
cleaning and feature extraction. Here, I will provide an overview of data cleaning steps in
KNIME.

Finding Missing Values


Before we impute values, we need to know which ones are missing.

Go to the node repository again and find the node Missing values. Drag and drop it, and
connect the output of our File reader to the node.
Imputations
To impute values, select the node Missing value and click Configure. Select the
appropriate imputations you want for your data depending on the type of data it is and
hit Apply.
Now when we execute it, our complete dataset with imputed values is ready in the output
port of the node Missing value. For my analysis, I have chosen the imputation methods
as:

 String:

o Next value
o Previous value
o Custom value
o Remove row

 Number (double and integer):


o Mean
o Median
o Previous value
o Next value
o Custom value
o Linear interpolation
o Moving average

Training Your First Model


Let's take a look at how we would build a machine learning model in KNIME.

Implementing a Linear Model


To start with the basics, we will first train a Linear Model encompassing all the features of
the dataset just to understand how to select features and build a model. Here is a beginners'
guide to linear regression.

Go to your node repository and drag the Linear Regression Learner to your workflow.
Then connect the clean data that you gathered in the Output Port of the Missing
value node.
This should be your screen visual as of now. In the Configuration tab, exclude

the Item_Identifier and select the target variable on top. After you complete this task,

you need to import your testdata to run your model.

Drag and drop another file reader to your workflow and select the test data from your
system.

As we can see, the test data contains missing values, as well. We will run it through
the Missing value node in the same way we did for the training data.

After we’ve cleaned our test data, as well, we will now introduce a new node: Regression
predictor.
Load your model into the predictor by connecting the learner’s output to the predictor’s
input. In the predictor’s second input, load your test data. The predictor will automatically
adjust the prediction column based on your learner, but you can alter it manually, as well.

KNIME has the capability to train some very specialized models as well under
the Analytics tab. Here is a list:

1. Clustering
2. Neural networks
3. Ensemble learners
4. Naïve Bayes

Submitting Your Solution


After you execute your predictor now, the output is almost ready for submission.

Find the node Column filter in your node repository and drag it to your workflow.
Connect the output of your predictor to the column filter and configure it to filter out the
columns you need. In this case, you need Item_Identifier, Outlet_Identifier, and the
prediction of Outlet_Sales.
Execute the Column filter and finally, search for the node CSV writer and document
your predictions on your hard drive.

Adjust the path to set it where you want the CSV file stored, and execute this node. Finally,
open the CSV file to correct the column names as according to our solution. Compress the
CSV file into a ZIP file and submit your solution!
This is the final workflow diagram that was obtained.

KNIME workflows are very handy when it comes to portability. They can be sent to your
friends or colleagues to build on together, adding to the functionality of your product!

To export a KNIME workflow, you can simply click on File > Export KNIME Workflow.
After that, select the suitable workflow that you need to export and click Finish!

This will create a .knwf file that you can send across to anyone and they will be able to

access it with one click!

Limitations
KNIME, being a very powerful open-source tool, has its own set of limitations. The primary
ones are:

 The visualizations are not as neat and polished as some other open-source software is
(i.e. RStudio).
 Version updates are not supported well; you will have to reinstall the software (i.e.
for updating KNIME from version 2 to version 3, you will need a fresh installation
and updating won’t work).
 The contributing community is not as large as Python or CRAN communities, so it
takes a long time for new additions to KNIME.

You might also like