Restricted Boltzmann Machines On Multi-Core Processors: by Sai Prasad Nooka Stavan Karia

Restricted Boltzmann Machines on Multi-Core
Processors
By
Sai Prasad nooka
Stavan Karia
Overview
What is machine learning?
Brain Vs Processors
Motivation & Goal
Introduction to Artificial Neural Networks
What is a Deep Neural Network ?
Why Deep Neural Network?
Boltzmann Machines
Restricted Boltzmann Machines
Semi- Supervised Learning
Learning Feature Hierarchy
RBM Implementation
Compute Unified Device Architecture (CUDA )
Implementation on GPU
Results
Conclusion
What is Machine Learning?

2000 top-level neurons
A model of digit recognition demo :
http://www.cs.toronto.edu/~hinton/adi/in
dex.htm
This model learns to generate combinations
of labels and images.
10 label
neurons
500 neurons
500 neurons
28 x 28
pixel
image
Hinton
Brain Vs Processors
Brain is made up of billions of cells, called neurons which is highly parallel and
adaptive connections.
We now have similar number of transistors per chip but not adaptive.
Switching Time:
Neurons switching frequency is 10KHz.
Processors switching frequency is approaching 10 GHz, in this case processors
are way faster.
Connections:
In brain we have thousands of interconnected neurons.
In processors we have maximum of 10 connections per transistor.
Motivation & Goal
http://publications.csail.mit.edu/abstracts/abstracts07/brussell2/brussell2.html
Motivation & Goal
The goal is to solve practical problems by using novel learning algorithms inspired
by brain and make computers more user friendly.
Try to achieve Human like performance in problems like :
Object Detection
Speech recognition
Activation function
http://neuralnetworksanddeeplearning.com/chap5.html
What is a Deep Neural network?
http://neuralnetworksanddeeplearning.com/chap5.html
Why Deep Neural Networks?

There is a huge chance that back propagation might get stuck in local minima.
It is very slow in networks with multiple hidden layers.
It requires labeled training data.
Boltzmann Machines
Boltzmann Machines were introduced by Hinton & Sejnowski [83].
Boltzmann Machines have bidirectional Connections.
Each neuron have binary valued states (on or off).
Boltzmann Machines learn the complex irregularities in the
training data.
Probabilistic state transition mechanism.
This Learning algorithm is very slow in networks with many layers,
which gave rise to Restricted Boltzmann Machines.
Restricted Boltzmann Machines (RBM)
RBMs are the Boltzmann Machines with some restrictions stated below:
There are no connections between two visible units.
There are no connections between any two hidden units.
With these restrictions hidden units are conditionally independent given a visible
vector.
Semi-Supervised Learning
Unlabeled images (all cars/elephants)

Test
Elephant
car
Source: Caltech-101
Learning feature hierarchy
Akshay n hegde
Compute Unified Device Architecture

(CUDA )
CUDA is the general purpose architecture which allows to compute in parallel on
NVIDIA GPUs
The CUDA programming models use C and C++ to create special functions called
Kernels that define data parallel computations.
Kernels are executed by different threads on GPUs that operate as a coprocessor
/accelerator to CPU.
To run a kernel threads must first be organized into blocks that can run
independently of each other.
RBM Implementation
Z is the partition function given by:
Given a random input (v) the probability of the hidden unit (j) to be 1 is
Where (x) is the sigmoid function

Similarly given a random hidden vector the state of visible i can be set to 1 with
probability given by:
[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia
RBM Implementation
Updating weights:
(6)
(7)
(8)
Algorithm -1
Algorithm - 2
RBM implementation on GPU

RBM Kernel's:
Compute Status Hidden Units
Compute Status Visible Units
Correct Weights
Sequence of GPU Kernel calls per epoch
Kernel Implementation
Compute Status Hidden Units kernel & Compute Status Hidden Units kernel :
Each neuron in both visible and hidden layer represents a block and sum up the values
computed by each thread using a reduction process and then compute the output of the
neuron for the active sample.
The order in which the weight matrix is placed in the memory will affect both the kernels,
We place these weights J X I matrix which favor Compute Status Hidden Units as it is
executed more number of times.
Correct weights Kernel 1st Method:

Correct weights kernel consists of summing the values of all samples in each block.
Each thread gathers and sums up the values for all the samples, then a reduction
process take place in order to calculate the weights and biases.
Kernel Implementation
Correct weights Kernel 2nd Method:
Each block is of 16 X 16 threads, the first dimension of the block (x) is associated to an
input unit i,while the second dimension (y) to a hidden unit j. Each thread within a block
performs all the samples.
Comparison between two methods
Fig. Proportion of time spent, per epoch, in each task/kernel (as measured
in a GTX 280 device).
Limitations Of 1st Approach

Two main problems related to memory access :
The memory access is not in coalesced manner, thus cache performance not at its
best.
Many blocks were trying to access the exactly the same memory addresses, generating
memory conflicts.
Experiment Setup
Data Set : MNIST
Number of Samples : 60,000
Number of Visible Units : 784 (28 X 28)
CPU : Intel dual-core i5-2410M with 8 GB of memory
GPU : NVIDIA GeForce 460 GTX
The number of Hidden units and number of training samples are changed.
Results
Increase in sample size
Increase in number of hidden units (across

Horizontal Dimension)
Increase in sample size
Analysis
The GPU speed ups obtained are in the range of 22 to 46 times.
For example if N=60,000 with Hidden units= 800 takes 40 minutes per epoch to
train but on GPU it takes only 53 seconds per epoch.
Factor
Change
Speedup
Execution Time
Number of Samples
Increases
Tremendous Increases
Drastic Fall
Number of hidden units
Increases
Sub-linear increase
Mediocre reduction
Conclusion
Deep Belief Networks model is time consuming and computationally expensive.
With the help of GPUs by taking advantage of its inherent parallel architecture we
could run many number of experiments in short period.
References
[1] G. O. Young, Synthetic structure of industrial plastics (Book style with paper title and editor), in Plastics,
2nd ed. vol. 3, J. Peters, Ed. New York: McGraw-Hill, 1964, pp. 1564.
[2] Noel Lopes, Bernardete Ribeiro and Joao Goncalves Restricted Boltzmann Machines and Deep Belief
Networks on Multi-Core Processors in WCCI 2012 IEEE World Congress on Computational Intelligence
Thank You

Restricted Boltzmann Machines On Multi-Core Processors: by Sai Prasad Nooka Stavan Karia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Restricted Boltzmann Machines On Multi-Core Processors: by Sai Prasad Nooka Stavan Karia

Uploaded by

Copyright:

Available Formats

Restricted Boltzmann Machines on Multi-Core

What is Machine Learning?

Motivation & Goal

Motivation & Goal

Introduction to Artificial Neural Networks

Introduction to Artificial Neural Networks

What is a Deep Neural network?

Why Deep Neural Networks?

Restricted Boltzmann Machines (RBM)

Unlabeled images (all cars/elephants)

Learning feature hierarchy

Compute Unified Device Architecture

Where (x) is the sigmoid function

RBM implementation on GPU

Sequence of GPU Kernel calls per epoch

Correct weights Kernel 1st Method:

Comparison between two methods

Limitations Of 1st Approach

Increase in number of hidden units (across

Increase in sample size

Number of hidden units

You might also like