Big Data Performance Testing-The SandStorm Way

Intellectual Property Acknowledgement
You, the author of this white paper, agree that you have not used any copyrighted material in this white paper or,
have appropriately indicated the copyright of the respective object. You further agree that you have not used any
other intellectual property of any third party in this white paper or, have appropriately indicated the intellectual
property of the respective object.
You confirm, that consistent with the terms of your employment with Accenture, that any rights, title and interest
whatsoever, including, but not limited to, patents, copyright, trade secret and design rights, mask rights, whether
register able or not, arising or created as a result of the development of and/or the application of any tangible or
intangible work product or materials produced by you during or as a consequence of your employment (including in
this white paper), whether alone or in conjunction with others and whether during normal working hours or not,
including, but not limited to, any invention, design, discovery, improvement, computer program, documentation, or
other material (including in this white paper) which you conceive, discover or create during or in consequence of
employment hereunder (Work Product) shall belong exclusively to Accenture and its affiliates. You hereby convey
ownership in such rights, title and interest to Accenture and its affiliates upon inception or development.
All Work Product shall constitute a work(s) made for hire under all copyright acts. To the extent that any Work
Product does not constitute a work made for hire under the foregoing laws, you hereby irrevocably assign all
worldwide right, title, and interest (including without limitation, patents, copyright, trade secret, trademarks, design
rights, contract and licensing rights) in such Work Product to Accenture and its affiliates. You retain no rights to use
the Work Product and agree not to challenge the validity of Accentures and its affiliates' ownership in the Work
Product. You hereby forever waive all moral rights in the Work Product and any results or proceeds there from,
even if after expiration or termination of your employment hereunder. If you agree that you will not violate or
attempt to violate the intellectual property rights (IPR), interests or title of any third party. Accenture shall be
entitled to immediate injunctive or similar relief upon a potential or actual breach of this Section by you. You agree
to cooperate with Accenture and its affiliates as is necessary and provide any necessary documentation consistent
with and in furtherance of the rights conveyed hereunder.
By submitting this white paper, you agree that you have read and understood the disclaimer set out above and that
you understand that it may affect your rights and agree to be bound by these terms.
4/18/2014
Ver 1.0
1 of 10
Submitter
Credentials
Name of the Employee

Employee SAP ID
Enterprise ID
Current Project Name
Topic Category (Please type in the category number by referring
to the attachment sent as part of the announcement mailer.
For ex: if you have written a paper on Test Automation your
category is B8)
If you are locked to an Industry Group, please name the industry
group (FS/H&PS/CMT/Products/Resources/Others)
If you are locked to capability organization, please name the
Technology Platform (SAP/Oracle/Testing/Solution Factory,
Management Services, Digital & Marketing, AD&I, TOD,
Operations)
Brindha Krishnan
10621010
brindha.krishnan
EMC PPMG
D3
Name of the Employee

Employee SAP ID
Enterprise ID
Current Project Name
Topic Category (Please type in the category number by referring
to the attachment sent as part of the announcement mailer.
For ex: if you have written a paper on Test Automation your
category is B8)
If you are locked to an Industry Group, please name the industry
group (FS/H&PS/CMT/Products/Resources/Others)
If you are locked to capability organization, please name the
Technology Platform (SAP/Oracle/Testing/Solution Factory,
Management Services, Digital & Marketing, AD&I, TOD,
Operations)
Krunal Gosalia
10847000
k.sandeep.gosalia
EMC PPMG
D3
CMT
Performance Testing
CMT
Note: The above data being solicited is for Internal use & referential purposes only.
4/18/2014
Ver 1.0
2 of 10
Big Data Performance Testing The SandStorm Way

1. Abstract
With the advent of modern era the intrinsic needs of social organisations (institutions,
communities, and businesses) are constantly increasing. Organisations are forced to rely on
innovative data analysis techniques in order to compete and grow; this task only gets tougher with
the proliferation of interconnected devices and the Internet. The billions of data transmitted by
consumers provide a great source of information used to better target customer segments and
predict industry trends: Welcome to the era of Big Data!
The sheer volume of data coming into the mainstream of business is forcing enterprises to
validate and test their Big Data projects. While scale is the promise of Big Data technologies,
reliability under specific use cases and strict SLA conditions is paramount. Thus, ensuring that a Big
Data Application is reliable and scalable is a critical business consideration.
For smooth operation of any Big Data project, it is crucial to define a robust performance testing
strategy and to ensure optimal performance of all components of the Big Data Application. This
whitepaper provides an overview of the essential components and the process of Performance
Testing a Big Data Application using an automated tool called SandStorm.
2. Introduction
2.1.
Big Data:
In todays world of fast growing technologies Big Data is the buzz-word of this decade.
The term Big data is used to describe massive volumes of both structured and
unstructured data that is so large that it's difficult to process using traditional database and
software techniques. In most enterprise scenarios the data is too big (in petabytes or exabytes)
or it moves too fast or it exceeds current processing capacity and is often too loosely structured
that is incomplete and inaccessible.
2.2.
Big Data Technologies:
Typically, a big data application makes use of the following set of technologies:
4/18/2014
Map Reduce: Map Reduce is a software framework that allows developers to write programs
that process massive amounts of unstructured data in parallel across a distributed cluster of
processors or stand-alone computers. E.g. Apache Hadoop, Cloudera etc.
NoSQL databases: NoSQL database provides a mechanism for storage and retrieval of data
that is modeled in means other than the tabular relations used in relational databases. E.g.
MongoDB, Apache HBase.
Message Queue: A publish-subscribe messaging system to input data into a system from
multiple sources. E.g. Kafka, RabbitMQ, ActiveMQ
Search components like Elastic search and etc.
Ver 1.0
3 of 10
2.3.
Need for Big Data Performance Testing:
Big data Projects involve in processing huge volumes of structured and unstructured data
preferably, across multiple nodes to complete the job in less time. At times because of bad
architecture and poorly designed code, performance is degraded thus rendering the very purpose
of setting up Hadoop or Mongo DB and any other Big-data technologies useless. Hence,
performance testing plays a key role in any Big-data project due to huge volume of data and
complex architecture.
3. Big Data Performance Testing

3.1.
Big data Performance Testing Challenges:
Performance testing Big Data is one of the challenges faced because of lack of knowledge
on what to test, how to test and how much data to test. Challenges faced in defining the
strategies for validating the performance of individual sub components, creating an appropriate
test environment, working with NoSQL and other systems. These challenges are responsible for
poor quality in production, delayed implementation and increase in cost. Lets look at some of
these challenges in a bit more detail:
Diverse set of technologies: There are many sub-components involved in Big-data

application belongs to a different technology. We need to test each component in isolation,
but no single tool can support each of the technologies.
Unavailability of specific tools: No single tool can cater to each of the technology. For e.g.
database testing tools for NoSQL might not be a fit for message queues. Similarly, custom
tools and utilities are required to test map reduce jobs.
Test scripting: There are no record and playback mechanisms for such systems. A high degree
of scripting is required to design test cases and scenarios.
Test environment: It might not always be feasible to create a performance testing

environment that can simulate production usages because of the cost and scale. We need to
have a scaled down version sufficient to predict realistic performance with all the
components.
Monitoring solutions: Since each component has a different way of exposing performance
metrics limited solutions exists that can monitor the entire environment for performance
anomalies and detect issues.
Diagnostic solutions: Custom solutions need to be developed to further drill down the
performance bottleneck areas.
4/18/2014
Ver 1.0
4 of 10
3.2.
Big data Performance Testing Focus Areas:
Unlike the traditional Web applications, Big-data systems focus on and an entirely
different set of performance testing areas:
Data Ingestion and Throughput: Data ingestion is done through messaging queues like Kafka,
Rabbit MQ, and Zero MQ and so it is very important that the queues perform optimally and
have a maximum throughput.
Data Processing: Data processing refers to the speed with which the queries and/or map
reduce jobs are executed to generate the resultant data set of which the analytics can be run
for generating business reports and analysis.
Data Persistence: It is important to conduct a performance evaluation of different databases

and select the appropriate one which suits the application performance requirements. We
also need to identify how quickly data can be inserted into the underlying data store. For e.g.
what is the insertion rate into a Mongo and Cassandra database?
3.3.
Big data Performance Testing Approach:
The procedural approach involved in load testing Big Data applications is as follows:
i.
The process starts with the setting up of the Big Data cluster which is to be tested for
performance.
ii.
Depending on the typical usage of the components, we need to identify and create
corresponding workloads. For e.g. a typical workload could be that 90% of the time inserts
are performed while the remaining are read operations.
4/18/2014
Ver 1.0
5 of 10
iii.
After identifying the workload the next step is to create and prepare the custom scripts for
single user.
iv.
Tests are executed to simulate realistic usage and results are analyzed for any
issues/bottlenecks.
v.
Based on the results we can tune the cluster and re-execute the tests till the maximum
performance is achieved
3.4.
Big data Performance Testing Solutions:
YCSB: YCSB is a cloud service testing client that performs reads, writes and updates
according to specified workloads. Running from the command line it can create an arbitrary
number of threads that will query the system under test. It will measure throughput in
operations per second and record the latency in performing these operations. YCSB can run
in parallel from multiple hosts.
JMeter: JMeter provides few plugins to apply load to Cassandra. This plugin acts as a client
of Cassandra and can send requests over Thrift. The plugin is fully configurable. After creating
the thread group you can confirm that the Cassandra JMeter plugin has been loaded
correctly. Select the Thread Group, right click and a pull down menu will appear. Select Add,
then Sampler. The 7 Cassandra Samplers should be included in the list.
Independent Custom Utilities: Cassandra stress test etc.
SandStorm: SandStorm is an Enterprise Performance Testing Solution that enables us to

predict the scalability, reliability, and performance issues of Big Data applications.
4. Big data Performance Testing Using SandStorm: A Case Study

4.1.
Requirement:
Performance testing all the components of a Big Data Application and suggesting a way
to improve the performance.
4.2.
Challenge:
Any Big Data application comprises of components like queues, Data processing
frameworks and Data storage systems. The underlying architecture/technologies for each of
these components are different for e.g. a Data Storage may be MongoDB, Cassandra or Hbase.
The way in each of these components accept/process data is different rendering the task of
performance testing these diverse technologies a tough one. Even though the individual products
do provide custom utilities to perform a load test there is a challenge when we need to load test
a system as a whole.
4/18/2014
Ver 1.0
6 of 10
4.3.
Solution:
The load testing tool SandStorm perfectly

fits into the picture. In our earlier sections we have
discussed on the focus areas for Big Data
Performance Testing which comprises of 3 major
components that have to be Performance Tested for
any big data application namely Data Ingestion
(Queues), Data Processing (MapReduce/ETL) and
Data Persistence (Data Store/NoSQL).
In order to facilitate an effective technique
to validate these components SandStorm has in-built
features to Record/Script the Data Input, Execute the
load tests and for Reporting and Analytics. Below are a few features offered by SandStorm:
Facilitates optimization of infrastructure usage by tuning existing components and servers.

Helps save time and efforts by leveraging ready to use solution for performance testing of
your Big Data applications.
Easily create test scripts for different technologies like Cassandra, MongoDB, Kafka
Design test scenarios easily and quickly using UI wizards and creating a mix of different
activities
Support the monitoring of multiple Big Data and NoSQL technologies like Apache Hadoop,
Cassandra, MongoDB, Oracle NoSQL, Kafka, ActiveMQ alongside other popular J2EE servers
and OS.
Create custom scripts for different technologies
Cloud platform for scalability, high volume testing.
It offers powerful analytics.
4.4.
Implementation:
4.4.1. Architecture:
SandStorm provides a Recorder component, Controller component, and a scripting

interface. This interface is used to load and stress test any Big Data stack.
4/18/2014
i.
Recorder: The Recorder component quickly creates test scripts for end-to-end workflows
and performance tests the entire application. This script can be used to test any NoSQL,
messaging components in the Big Data cluster.
ii.
Controller: The Controller component creates and designs test scenarios. Additionally,
SandStorm also provides resource monitoring of the Big Data cluster to identify the
performance issues during the test execution.
iii.
Command Launcher: Command launchers are used for distributed load generation in a
test scenario. As shown in the diagram below, and depending on the concurrency, you
might use more than one Command launcher. Command launchers send the results back
to the Controller for real time dashboards and result storage.
iv.
Analyzer: After execution, the results can be analysed using the analyzer component. The
analyzer generates multiple reports and graphs for quick and easy analysis.
Ver 1.0
7 of 10
4.4.2. Sandstorm Performance Testing Approach:
i.
4/18/2014
Configuration Settings for Supported Technologies:
Add the configuration as per the requirement shown below:

E.g.: Adding ActiveMQ configuration for testing Database component.
Fill in the details for the selected Configuration.
Ver 1.0
8 of 10
ii.
Adding Test Script:
iii.
Selecting Request type:
iv.
4/18/2014
Add a test-script or add a blank test script by right-clicking on Init, Action or End node.
Select the request type need to be tested and fill the appropriate properties.
E.g. - ActiveMQ Producer is selected to test ActiveMQ component.
Parameterization:
Parameterize the dynamic values to run it for multiple iterations as shown below:
Ver 1.0
9 of 10
v.
Execution and Analysis:
The tests are executed from the controller component. After execution, the results can be
analysed using the analyzer component. These results can be used for finding the application
bottlenecks.
5. Conclusion
Every organisation chooses Performance Testing solutions that best suit their
requirements to enable a better performing application. SandStorm is a tool that helps us
validate the performance of a diverse set of data storage architectures. It helps create test scripts
for end-to-end workflows, execute the load test both on premise and cloud based and finally
analyse the results. This script can be used to test any NoSQL, messaging components in the Big
Data cluster. It also supports monitoring various Big Data technologies like Hadoop, Cassandra,
MongoDB, HBase, Kafka, ActiveMQ and RabbitMQ .
6. Disclaimer
The information contained in this document represents the views of the author(s) and
Accenture is not liable to any party for any direct/indirect consequential damages.
7. References
1.
2.
3.
4.
5.
http://sandstorm.impetus.com
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
http://caffeet.org/2013/05/10/the-advent-of-big-data-a-revolution/
http://www.appdynamics.com/blog/apm/big-data-monitoring/
http://en.wikipedia.org/wiki/Big_data
4/18/2014
Ver 1.0
10 of 10

Big Data Performance Testing-The SandStorm Way

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Performance Testing-The SandStorm Way

Uploaded by

Copyright:

Available Formats

Intellectual Property Acknowledgement

Name of the Employee

Name of the Employee

Big Data Performance Testing The SandStorm Way

Big Data Technologies:

Need for Big Data Performance Testing:

3. Big Data Performance Testing

Big data Performance Testing Challenges:

Diverse set of technologies: There are many sub-components involved in Big-data

Test environment: It might not always be feasible to create a performance testing

Big data Performance Testing Focus Areas:

Data Persistence: It is important to conduct a performance evaluation of different databases

Big data Performance Testing Approach:

Big data Performance Testing Solutions:

Independent Custom Utilities: Cassandra stress test etc.

SandStorm: SandStorm is an Enterprise Performance Testing Solution that enables us to

4. Big data Performance Testing Using SandStorm: A Case Study

The load testing tool SandStorm perfectly

Facilitates optimization of infrastructure usage by tuning existing components and servers.

SandStorm provides a Recorder component, Controller component, and a scripting

4.4.2. Sandstorm Performance Testing Approach:

Configuration Settings for Supported Technologies:

Add the configuration as per the requirement shown below:

Fill in the details for the selected Configuration.

Adding Test Script:

Selecting Request type:

Execution and Analysis:

You might also like