You are on page 1of 5

International Journal of Engineering Trends and Technology- May to June Issue 2011

PERFORMANCE OF MAP REDUCE


V.Anitha Moses, Professor, Department of Computer Application, Panimalar Engineering College, Chennai.
1

B.Palanivel, PG Scholar, Department of Computer Application, Panimalar Engineering College, Chennai.

S. Srinidhi

Asst. Prof., Department of MCA, Panimalar Engineering College, Chennai

ABSTRACT
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day. of MapReduce is simple yet expressive. Although MapReduce only involves two functions map() and reduce(), a number of data analytical tasks including traditional SQL query, data mining, machine learning, and graphprocessing can be expressed with a set of MapReduce jobs. MapReduce is flexible. It is designed to be independent of storage systems and is able to analyze various kinds of data, structured and unstructured. Finally, MapReduce is scalable. Installation of MapReduce on a 4,000 nodes shared-nothing cluster has been reported [2]. MapReduce also provides tne-grain fault tolerance whereby only tasks on failed nodes need to be restarted. Traditionally, the large-scale data analysis market is dominated by Parallel Database systems. The popularity of MapReduce gives rise to the question of whether there are fundamental differences between MapReducebased and Parallel Database systems. Along this direction, reported a comparative evaluation of the two systems in many dimensions, including schema support, data access methods, fault tolerance and so on. The authors also introduced a bench mark to evaluate the performance of both systems. The results showed that the observed performance of a Parallel Database system is much better than that of a MapReduce based system. The authors of speculated about possible architectural causes for the performance gap between the two systems. For instance, MapReduce-based systems need to repetitively parse records since it is designed to be

INTRODUCTION
MapReduce-based systems are increasingly being used for large-scale data analysis. There are several reasons for this. First, the interface

ISSN:2231-5381

- 92 -

IJETT

International Journal of Engineering Trends and Technology- May to June Issue 2011

independent of the storage system. Thus, parsing introduces performance overhead.

HADOOP
Hadoop consists of the Hadoop Common, which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community.A key feature is that for effective scheduling of work, every filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, so reducing backbone traffic. The HDFS filesystem uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable. A typical Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a jobtracker, tasktracker, namenode, and datanode. A slave or compute node consists of a datanode and tasktracker. Hadoop requires JRE 1.6 or higher. The standard startup and shutdown scripts require ssh to be set up between nodes in the cluster. While Microsoft Windows and OS/X are supported for development, as of April 2011 there are no public claims that these are in use in large servers.The HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB, across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes

MAPREDUCE
MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers. The framework is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms. Map Reduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers nodes, collectively referred to as a cluster if all nodes use the same hardware or as a grid if the nodes use different hardware. Computational processing can occur on data stored either in a file system unstructured or within a database structured.

CLUSTERING
Clustering allows us to run an applications on several parallel servers (cluster nodes). The load is distributed across different servers, and even if any of the servers fails, the application is still accessible via other cluster nodes. Clustering is crucial for scalable enterprise applications, as you can improve performance by simply adding more nodes to the cluster. Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is a set of nodes.a node is a JBoss server instance. Thus, to build a cluster, several JBoss instances have to be grouped together known as a partition. On a same network, we may have different clusters. In order to differentiate them, each cluster must have an individual name. Clustering is a nonlinear activity that generates ideas, images and feelings around a stimulus word. As students cluster, their thoughts tumble out, enlarging their word bank for writing and often enabling them to see patterns in their ideas. Clustering may be a class or an individual activity.

ISSN:2231-5381

- 93 -

IJETT

International Journal of Engineering Trends and Technology- May to June Issue 2011

can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. The HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files. The HDFS does not provide High Availability. A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster. The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the Primary Namenode goes offline, the Secondary Namenode takes over

integers, for example representing a person's height in centimeters, but may also be nomina data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a level of measurement. For each variable, the values will normally all be of the same kind. However, there may also be "missing values", which need to be indicated in some way.

EXISTING SYSTEM
In existing System there is no job completion times.large jobs or heavy users cannot access error latencies in long-running tasks will be occurred.No Load balancing.Time Delay. MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.

DATASET
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows. A data set has several characteristics which define its structure and properties. These include the number and types of the attributes or variables and the various statistical measures which may be applied to them such as standard deviation and kurtosis. In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. In spite of the name, such a univariate data set is not a set in the usual mathematical sense, since a given value may occur multiple times. Normally the order does not matter, and then the collection of values may be considered to be a multiset rather than an (ordered) list. The values may be numbers, such as real numbers or

PROPOSED SYSTEM
The main insights from our analysis, are that: (i) job completion times and cluster allocation patterns follow a long-tailed distribution and require fair job schedulers to prevent large

ISSN:2231-5381

- 94 -

IJETT

International Journal of Engineering Trends and Technology- May to June Issue 2011

jobs or heavy users from monopolizing the cluster; (ii) better diagnosis and recovery approaches are needed to reduce error latencies in longrunning tasks; (iii) evenly-balanced load across most jobs implies that peer comparison is a suitable strategy for anomaly detection as described in our previous work and (iv) low variability in user behavior over short periods of time allows us to exploit temporal locality to predict job completion times. The classified data are stored in different databases as different styles.

Under different levels of database tables, a valid user can access the authorized attributes through the multi authentication. These ways of authentication processes are mutual so that our scheme is secure against spoofing or masquerading attack. More Secure is associated with the same key on distributed computer nodes. The runtime system takes care of data partitioning, scheduling, load balancing, fault tolerance, and network communications. The simple interface of MapReduce allows programmers to easily design parallel and distributed applications.

REPORTS GENERATED

CONCLUSION
We analyzed Hadoop logs from the 400-node M45 supercomputing cluster which Yahoo! made freely available to select universities for systems research. Our studies tracks the evolution in cluster utilization patterns from its launch at Carnegie Mellon University in April 2008 to April 2009. Job completion times and cluster allocation patterns followed a long-tailed distribution motivating the need for fair job schedulers [5] to

prevent large jobs or heavy users from monopolizing the cluster. We also observed large error-latencies in some long-running tasks indicating that better diagnosis and recovery approaches are needed. User tended to run the same job repeatedly over short intervals of time thereby allowing us to exploit temporal locality to predict job completion times. We compared the effectiveness of a distance-weighted algorithm against a locallyweighted linear algorithm at predicting job completion times when we scaled the map input

ISSN:2231-5381

- 95 -

IJETT

International Journal of Engineering Trends and Technology- May to June Issue 2011

sizes of incoming jobs. Locallyweighted linear regression performs better with a mean relative prediction error of 26%. The MapReduce programming model has been successfully used at Google for many different purposes. Weattribute this success to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google's production web search service,for sorting, for data mining, for machine learning,and many other systems. Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes ef_cient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google.

8. T. A. S. Foundation, The Map/Reduce Tutorial, 2008, http://hadoop.apache.org/common/docs/current/ mapred tutorial.html.

REFERENCES
1. J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters. Communications of the ACM, vol. 51, pp. 107113, 2008. 2. http://en.wikipedia.org/wiki/Clustering. 3. Hadoop, Powered by Hadoop, http://wiki.apache.org/hadoop/ PoweredBy. 4.http://econ.worldbank.org/WBSITE/EXTERNA L/EXTDEC/EXTRESEARCH/0,,contentMDK:206 99301~pagePK:64214825~piPK:64214943~the SitePK:469382,00.html. 5. R. Sahoo, M. Squillante, A. Sivasubramaniam, and Y. Zhang, Failuredata analysis of a large-scale heterogeneous server environment, inDependendable Systems and Networks, Florence, Italy, Jun. 2004. 6. Yahoo!, Hadoop capacity scheduler, 2008, https://issues.apache.org/jira/browse/HADOOP3445. 7. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg,Quincy: fair scheduling for distributed computing clusters, inACM Symposium on Operating Systems Principles, Big Sky, Montana, Oct. 2009, pp. 261276.

ISSN:2231-5381

- 96 -

IJETT

You might also like