You are on page 1of 4

Next generation technologies (The best way to jump into a parade is to jump in front of one that is already going)

We are going to talk about the framework that backs up the technological infrastructure of the biggest players of internet world, some of them are embedded in the following image: These are just some biggest name; there are lots more in this list. Here we are talking about next generation computer technology, which has scalability, tolerance and much more features. The term cloud will not unheard for you but here I am going to talk about a super technological terms that will be back bone of cloud or distributed computing. Now you may be thing what is that technology right? The technology that we are going to discuss is called Hadoop. The best thing about the technology is its open source and readily available where you can contribute, experiment, and use. As apache web site says The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Lets talk about some best features first: High scalability. High availability. High performance. Handling Multi-dimensional data storage. Handling Distributed storage.

Lets first look on some scenarios in the internet world:

How much data can you think of which need to process by a internet player? Do you know how much data twitter process daily? It about 7 Tb per day. How much time will it take to process this much of data for a general computer

About 4 hr. that is just for reading, not processing, can you think about processing all twitter data will not it take years.

So here comes Hadoop in play which sorts a petabyte in 16.25 Hr and a terabyte of data in 62 seconds. Is not it good choice yes sure it is. Likewise think about the amount of data Facebook, Google, amazon need to process daily. The best thing about Hadoop setup is, you dont need special costly and high end servers rather you can make a cluster out of Hadoop using commodity computers. Keep adding computers and keep increasing storage and processing power.

So ultimately here are some point for Why Hadoop? Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day Failure is expected, rather than exceptional. The number of nodes in a cluster is not constant. Need common infrastructure Efficient, reliable, Open Source Apache License The above goals are same as Condor, but Workloads are IO bound and not CPU bound

Hadoop basically depends of following concept: 1. Hadoop common (Base) Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes File System, RPC, and serialization libraries.

2. HDFS ( Hadoop File System)(File System) Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. 3. Map-Reduce (Code) Hadoop Map-Reduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. So what it is used for: 1. Internet scale data : a. Web Logs: years of logs Terabytes per day. b. Web search- all the webpages present on this earth. c. Social data- all the data, messages, images, tweets, scraps, wall posts generated on Facebook, Twitter, and other social media. 2. Cutting edge analytics: a. Machine learning, data mining. 3. Enterprise applications: a. Network instrumentation, mobile logs. b. Video and audio processing. c. Text mining. 4. And lots more. Let's see the timeline:

References: http://hadoop.apache.org, http://developer.yahoo.com/hadoop/ This is the best place where you can find all information about Hadoop. On this website you'll find lots of wiki pages links and ongoing links, from which you can get lot of information about Hadoop on how to get started with Hadoop, and all how where how to questions and their answers. Just visit this site is explore it and experiment with the next-generation technology that is going to be the backbone of Internet. In the next coming articles, we'll talk about some other technologies related Hadoop like HBase, Hive, Avro, Cassandra, Chukwa, Mahout, Pig, Zookeeper.

Shashwat Shriparv

You might also like