You are on page 1of 31

Restricted Boltzmann Machines on Multi-Core

Processors
By
Sai Prasad nooka
Stavan Karia

Overview
What is machine learning?
Brain Vs Processors
Motivation & Goal
Introduction to Artificial Neural Networks
What is a Deep Neural Network ?
Why Deep Neural Network?
Boltzmann Machines
Restricted Boltzmann Machines
Semi- Supervised Learning
Learning Feature Hierarchy
RBM Implementation
Compute Unified Device Architecture (CUDA )
Implementation on GPU
Results
Conclusion

What is Machine Learning?


2000 top-level neurons
A model of digit recognition demo :
http://www.cs.toronto.edu/~hinton/adi/in
dex.htm
This model learns to generate combinations
of labels and images.

10 label
neurons

500 neurons

500 neurons
28 x 28
pixel
image
Hinton

Brain Vs Processors
Brain is made up of billions of cells, called neurons which is highly parallel and
adaptive connections.
We now have similar number of transistors per chip but not adaptive.
Switching Time:
Neurons switching frequency is 10KHz.
Processors switching frequency is approaching 10 GHz, in this case processors
are way faster.
Connections:
In brain we have thousands of interconnected neurons.
In processors we have maximum of 10 connections per transistor.

Motivation & Goal

http://publications.csail.mit.edu/abstracts/abstracts07/brussell2/brussell2.html

Motivation & Goal

The goal is to solve practical problems by using novel learning algorithms inspired
by brain and make computers more user friendly.
Try to achieve Human like performance in problems like :
Object Detection
Speech recognition

Introduction to Artificial Neural Networks

Activation function

Introduction to Artificial Neural Networks

http://neuralnetworksanddeeplearning.com/chap5.html

What is a Deep Neural network?

http://neuralnetworksanddeeplearning.com/chap5.html

Why Deep Neural Networks?


There is a huge chance that back propagation might get stuck in local minima.
It is very slow in networks with multiple hidden layers.
It requires labeled training data.

Boltzmann Machines
Boltzmann Machines were introduced by Hinton & Sejnowski [83].
Boltzmann Machines have bidirectional Connections.
Each neuron have binary valued states (on or off).
Boltzmann Machines learn the complex irregularities in the
training data.
Probabilistic state transition mechanism.
This Learning algorithm is very slow in networks with many layers,
which gave rise to Restricted Boltzmann Machines.

Restricted Boltzmann Machines (RBM)

RBMs are the Boltzmann Machines with some restrictions stated below:
There are no connections between two visible units.
There are no connections between any two hidden units.

With these restrictions hidden units are conditionally independent given a visible
vector.

Semi-Supervised Learning

Unlabeled images (all cars/elephants)


Test

Elephant

car
Source: Caltech-101

Learning feature hierarchy

Akshay n hegde

Compute Unified Device Architecture


(CUDA )
CUDA is the general purpose architecture which allows to compute in parallel on
NVIDIA GPUs
The CUDA programming models use C and C++ to create special functions called
Kernels that define data parallel computations.
Kernels are executed by different threads on GPUs that operate as a coprocessor
/accelerator to CPU.
To run a kernel threads must first be organized into blocks that can run
independently of each other.

RBM Implementation
Z is the partition function given by:

Given a random input (v) the probability of the hidden unit (j) to be 1 is

Where (x) is the sigmoid function


Similarly given a random hidden vector the state of visible i can be set to 1 with
probability given by:

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

RBM Implementation
Updating weights:
(6)
(7)
(8)

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Algorithm -1

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Algorithm - 2

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

RBM implementation on GPU


RBM Kernel's:
Compute Status Hidden Units
Compute Status Visible Units
Correct Weights

Sequence of GPU Kernel calls per epoch

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Kernel Implementation
Compute Status Hidden Units kernel & Compute Status Hidden Units kernel :
Each neuron in both visible and hidden layer represents a block and sum up the values
computed by each thread using a reduction process and then compute the output of the
neuron for the active sample.
The order in which the weight matrix is placed in the memory will affect both the kernels,
We place these weights J X I matrix which favor Compute Status Hidden Units as it is
executed more number of times.

Correct weights Kernel 1st Method:


Correct weights kernel consists of summing the values of all samples in each block.
Each thread gathers and sums up the values for all the samples, then a reduction
process take place in order to calculate the weights and biases.

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Kernel Implementation
Correct weights Kernel 2nd Method:
Each block is of 16 X 16 threads, the first dimension of the block (x) is associated to an
input unit i,while the second dimension (y) to a hidden unit j. Each thread within a block
performs all the samples.

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Comparison between two methods

Fig. Proportion of time spent, per epoch, in each task/kernel (as measured
in a GTX 280 device).
[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Limitations Of 1st Approach


Two main problems related to memory access :
The memory access is not in coalesced manner, thus cache performance not at its
best.
Many blocks were trying to access the exactly the same memory addresses, generating
memory conflicts.

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Experiment Setup
Data Set : MNIST
Number of Samples : 60,000
Number of Visible Units : 784 (28 X 28)
CPU : Intel dual-core i5-2410M with 8 GB of memory
GPU : NVIDIA GeForce 460 GTX
The number of Hidden units and number of training samples are changed.

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Results
Increase in sample size

Increase in number of hidden units (across


Horizontal Dimension)

Increase in sample size

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Analysis
The GPU speed ups obtained are in the range of 22 to 46 times.
For example if N=60,000 with Hidden units= 800 takes 40 minutes per epoch to
train but on GPU it takes only 53 seconds per epoch.

Factor

Change

Speedup

Execution Time

Number of Samples

Increases

Tremendous Increases

Drastic Fall

Number of hidden units

Increases

Sub-linear increase

Mediocre reduction

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Conclusion
Deep Belief Networks model is time consuming and computationally expensive.
With the help of GPUs by taking advantage of its inherent parallel architecture we
could run many number of experiments in short period.

[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

References
[1] G. O. Young, Synthetic structure of industrial plastics (Book style with paper title and editor), in Plastics,
2nd ed. vol. 3, J. Peters, Ed. New York: McGraw-Hill, 1964, pp. 1564.
[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief
Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia

Thank You

You might also like