Automated Data Validation Framework

USER RECOMMENDER SYSTEM BASED UPON
KNOWLEDGE, AVAILABILITY AND REPUTATION

FROM INTERACTIONS IN FORUMS
A PROJECT REPORT
Submitted by
C.SANJUNA DEVI (111713205083)
P.PRIYANKA (111713205082)
R.SANJIVA SINDHUJA (111713205075)
in partial fulfillment for the award of the degree

of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
RMK ENGINEERING COLLEGE, CHENNAI
ANNA UNIVERSITY: CHENNAI 600 025
APRIL 2017
BONAFIDE CERTIFICATE
Certified that this project report on AUTOMATED DATA VALIDATION

FRAMEWORK is the bonafide Work of SANJIVA SINDHUJA R (111713205082),
SANJUNA DEVI C (111713205083), PRIYANKA P (111713205053) who carried out
the project under my supervision.
SIGNATURE SIGNATURE
Dr. K. VIJAYA M.E., Ph.D MS.PRATHUSHA LAXMI M.E.,(Ph.D)
HEAD OF THE DEPARTMENT SUPERVISOR
Dept. of Information Technology, Dept. of Information Technology,

RMK Engineering College, R.M.K Engineering College,
R.S.M. Nagar, R.S.M. Nagar,
Kavaraipettai-601206. Kavaraipettai-601206.
ii
CERTIFICATE OF EVALUATION
College Name : RMK ENGINEERING COLLEGE
Department : INFORMATION TECHNOLOGY
Semester : 08
Title of Project Name of the Students Name of the Supervisor

with designation
USER
SANJUNA DEVI C
RECOMMENDER MS.PRATHUSHA LAXMI
SYSTEM BASED PRIYANKA P M.E., (Ph.D)
UPON
KNOWLEDGE, SANJIVA HEAD OF THE DEPARTMENT
AVAILABILITY SINDHUJA R Dept. of Information Technology
AND REPUTATION
FROM
INTERACTIONS IN
FORUMS
The report of the project work submitted by the above students in partial
fulfillment for the award of Bachelor of Technology Degree in INFORMATION
TECHNOLOGY of Anna University was evaluated and confirmed to be the report
of the work done by the above students and then evaluated.
iii
Submitted the project during the viva voce held on .
INTERNAL EXAMINER EXTERNAL EXAMINER
iv
ACKNOWLEDGEMENT
At the outset, we would like to express our gratitude to our beloved and respected
Chairman, Thiru.R.S.Munirathnam for his support and blessings to accomplish the
project.
We would like to express our thanks to our Vice Chairman Thiru. R.M.Kishore
for his encouragement and guidance.
We thank our Principal, Dr. K.A. Mohamed Junaid, for creating the wonderful
environment for us and enabling us to complete the project.
We wish to express our sincere thanks and gratitude to Dr. K. VIJAYA, M.E.
Ph.D., Head, Department of Information Technology and our Project Guide
MS.PRATHUSHA LAXMI ME.,(P.hD) who has been a constant source of inspiration to
us and also for having extended their fullest co-operation and guidance without which this
project would not have been a success.
We express our sincere gratitude and thanks to our beloved Project Coordinator
Mr. K. CHIDAMBARA THANU M.E,(Ph.D) for having extended their fullest
co-operation and guidance without which this project would not have been a success.
Our thanks to all faculty and non teaching staff members of our department for
their constant support to complete this project.
iv
ABSTRACT
One emerging challenge in handling large amount of data is Verifiability. It

is very important to have correct data for analysis. Invalid data can give wrong
and inappropriate results which deviates from the desired intention. Data
validation testing is essential after data migration process over to ensure the
quality of data. This project is going to deal with Medical claims and settlement
around the globe and has different types of validity checks on different types of
files. Data validation is essential to ensure the data quality. Usually large amount
of data are handled in medical claims and it is more important to ensure that
received data is valid data or not. In our project we will design a framework
which takes care of automating the validation process. The framework should be
flexible enough to add or remove or modify existing validation checks. The tables
are created in the hive database and then the datasets of any format are loaded
into the hive tables. Later, the criteria are loaded into the hbase tables and then
they are compared and checked if the conditions are satisfied. The condition
satisfied data are stored in the lookup tables. Then some analyses are made on the
lookup tables. And finally once the validations are concluded an email
notification is sent to the pre-defined recipient and lookup tables are also
refreshed successfully.
vi
TABLE OF CONTENTS
CHAPTER TITLE PAGE

NO NO
ABSTRACT v
LIST OF FIGURES ix
LIST OF ABBREVIATIONS x
1 INTRODUCTION 1
1.1 GENERAL
1.1.1 IMPORTANCE OF BIG DATA
1.2 CHARACTERISTICS OF BIG DATA
2 HADOOP DISTRIBUTED FILE SYSTEM 4
2.1 INTRODUCTION
2.2 PROBLEMS IN HADOOP 1.x
2.3 COMPONENTS OF HADOOP2.x
2.3.1 HDFS
2.3.2 YARN
2.4 DESIGN OF HDFS
vii
2.5 HDFS BLOCKS
2.6 NAMENODE AND DATANODES
2.7 GOAL OF HDFS
3 DESCRIPTION OF TOOLS USED 9
3.1 APACHE HADOOP
3.2 HIVE
3.2.1 FEATURES OF HIVE
3.2.2 APACHE HIVE
3.2.3.WORKING OF HIVE
3.3 MAP REDUCE
5.3.2 TERMINOLOGIES
3.4 APACHE HBASE
3.4.1 INTRODUCTION TO HBASE
3.4.2 FEATURES OF HBASE
3.5 PYTHON
3.6 MYSQL
3.7 UBUNTU
4 AUTOMATED DATA VALIDATION 24
viii
FRAMEWORK
4.1 OBJECTIVE
4.2 EXISTING SYSTEMS
4.3 LIMITATIONS
4.4 PROPOSED SYSTEM
4.5 ADVANTAGES
5 SYSTEM DESIGN 26
5.1 USECASE DIAGRAM
5.2 ACTIVITY DIAGRAM
5.3 SEQUENCE DIAGRAM
5.4 COLLABORATION DIAGRAM
5.5 COMPONENT DIAGRAM
6 INITIAL SETUP PROCESS STEPS 32
6.1 INSTALLATION OF VIRTUAL BOX
6.2 IMPORT OF APPLIANCE
6.3 SINGLE NODE HADOOP 2.x CLUSTER
SETUP ON CENTOS
6.4 STARTING DAEMONS
ix
7 PROCESSING STEPS 36
8 CONCLUSION 38
9 APPENDIX 1 39
SCREENSHOTS
10 REFERENCES 62
x
LIST OF FIGURES
FIGURE NO TITLE PAGE NO
1.1 TRENDS OF BIG DATA 2
2.6 HDFS ARCHITECTURE 8
3.1 APACHE HADOOP MULTI NODE CLUSTER 9
3.2 ARCHITECTURE OF HIVE 11
3.2.3 WORKING OF HIVE 12
3.3.1 MAP REDUCE ALGORITHM 15
3.4 WORKING OF HBASE 20
5.1 USECASE DIAGRAM 27
5.2 SEQUENCE DIAGRAM 28
5.3 ACTIVITY DIAGRAM 29
5.4 COLLABORATION DIAGRAM 30
5.5 COMPONENT DIAGRAM 31
xi
LIST OF ABBREVIATIONS
HDFS - Hadoop Distributed File System
CSV - Comma- separated values
ORC - Optimized Row Columnar
HIVE - Highly Immersive Visualization Environment
HQL - Hive SQL
SQL - Structured query Language
HBase - Hadoop database
x
CHAPTER 1
INTRODUCTION
1.1 GENERAL:
1.1.1. Importance of Big Data:
In todays world with the rapidly increasing need for the storage of
large amounts of data, a technology is needed to support it, one such technology is
Big Data". It plays a vital role in the storage of data sets that are so large or
complex that traditional data processing application software is inadequate to deal
with them. Challenges include capture, storage, analysis, data-curation,
search, sharing, transfer, visualization, querying, updating and information
privacy. The term "big data" often refers simply to the use of predictive
analytics, user behavior analytics, or certain other advanced data analytics
methods that extract value from data, and seldom to a particular size of data set.
"There is little doubt that the quantities of data now available are indeed large, but
thats not the most relevant characteristic of this new data ecosystem." Analysis of
data sets can find new correlations to "spot business trends, prevent diseases, and
combat crime and so on." Scientists, business executives, practitioners of
medicine, advertising and governments alike regularly meet difficulties with large
data-sets in areas including Internet search, finance, urban informatics,
and business informatics. Scientists encounter limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics simulations,
biology and environmental research
1
BIG DATA MARKET FORECAST, 2011-2017(in $US Billions)
$60.00
$50.00 $50.10
$45.30
$40.00
$38.40
$30.00
$28.30
$20.00
$18.60
$11.80
$10.00
$7.30
$0.00
2011 2012 2013 2014 2015 2016 2017
Fig1.1 Trends of Big data
1.2. CHARACTERISTICS:
Big data can be described by the following characteristics:
Volume
The term Volume refers to the amount of data generated. While handling
medical claims, large amount of data is being generated per day which contains
details about the individuals and their policies which will vary on daily basis. Each
day lots of entries of data are registered and the volume of data gets doubled and
2
tripled every day. The volume of data might be petabytes and Exabyte which are
too complex.
Variety
The term Variety refers to the various sources of data. Medical claims
consists of data coming from different sources which may be structured , semi
structured or unstructured data i.e. entries which are made in spreadsheets and
claim holders image, etc. basically it deals with all types of data such as text,
documents and images.
Velocity
The term Velocity refers to the speed of data generated and how it can be
processed. In medical claims, there will be a new entry for every second and lots
of data are injected to the databases to process them.
Variability
The term Variability refers to data whose meaning is constantly changing.
Medical claims data can be varying since there will be continuous updates
performed by the individuals like they would like to close the claim policy or
wants to add anything new to the existing policy. So data is varying with time and
it should be meaningful to analyze the data generated.
Veracity
The quality of captured data can vary greatly, affecting accurate analysis.
For effective analysis of medical claims, all the data entered in the database should
be accurate enough to produce effective analysis for future references. For
example, the concerned company may want to analyze the claim holders for a
particular period of time the amount they claimed.
3
CHAPTER 2
HADOOP DISTRIBUTED FILE SYSTEM
2.1. Introduction
Hadoop runs a number of applications on distributed systems with

thousands of nodes involving petabytes of data .It have a distributed file system,
called the Hadoop Distributed File System or HDFS, which enables fast data
transfer among the nodes. The file store in HDFS provides scalable, fault-tolerant
storage at low cost .The HDFS software detects and compensates for hardware
issues, including disk problems and server failure. HDFS stores files across the
collection of servers in a cluster. Files are decomposed into blocks and each block
is written to more than one of the servers .The replication provides both fault-
tolerance and performance.
2.2. Problems in Hadoop 1.x:
Name Node is a single point of failure
Solution:
High Availability: In HDFS High Availability, multiple Name Nodes are
4
used in the Active-Standby mode with shared edits to handle the Name Node
failure.
Secondary Name Node cannot act as Name Node
Solution:
HDFS Federation: Hadoop 2.x Architecture which allows managing multiple

namespaces by enabling multiple Name Nodes. So on HDFS shell you have
multiple directories available but it may be possible that two different directories
are managed by two active Name Nodes at a time.
2.3. Components of Hadoop 2.x:
Hadoop 2.x has the following three Major Components:
HDFS
YARN
Map Reduce
2.3.1. HDFS:
HDFS stands for Hadoop Distributed File System. It is also known as HDFS
V2 as it is part of Hadoop 2.x with some enhanced features. It is used as a
Distributed Storage System in Hadoop Architecture.
2.3.2. YARN:
YARN stands for Yet Another Resource Negotiator. It is new Component in

Hadoop 2.x. It is also known as MR V2. It is a software rewrite that decouples
5
MapReduce's resource management and scheduling capabilities from the data
processing component, enabling Hadoop to support more varied processing
approaches and a broader array of applications. YARN combines a central
resource manager that reconciles the way applications use Hadoop system
resources with node manager agents that monitor the processing operations of
individual cluster nodes
2.3.3. Map Reduce:
Map Reduce is a Batch Processing or Distributed Data Processing Module.

It is also known as MR V1 as it is part of Hadoop 1.x with some updated
features.
2.4. Design of HDFS
HDFS has been designed keeping in view the following features: Very large
files: Files that are megabytes, gigabytes, terabytes or petabytes in size .Streaming
data access: HDFS is built around the idea that data is written once, but read many
times .A dataset is copied from the source and then analysis is done on that dataset
over time Commodity hardware: Hadoop does not require expensive, highly
reliable hardware as it is designed to run on clusters of commodity hardware.
2.5. HDFS Blocks
Hard Disk has concentric circles which form tracks .One file can contain
many blocks. These blocks in a local file system are nearly 512 bytes and are not
6
necessarily continuous. For HDFS, since it is designed for large files, block size is
128 MB by default. Moreover, it gets blocks of local file system contiguously to
minimize head seek time.
2.6. NameNodes and Datanodes:
HDFS has master/slave architecture. An HDFS cluster consists of a single

Namenode, a master server that manages the file system namespace and regulates
access to files by clients. The NameNodes Contains the Hadoop File System Tree
and other metadata information about files and directories. Contains in-memory
mapping of which blocks are stored in which datanode Secondary Namenode
Performs house-keeping activities for Namenode, like periodic merging of
namespace and edits .In addition, there are a number of DataNodes, usually one
per node in the cluster, which manage storage attached to the nodes that they run
on. HDFS exposes a file system namespace and allows user data to be stored in
files. Internally, a file is split into one or more blocks and these blocks are stored
in a set of DataNodes. The Namenode executes file system namespace operations
like opening, closing, and renaming files and directories. It also determines the
mapping of blocks to DataNodes. NameNodes contains two important files on its
hard disk:
1. fsimage (file system image): It contains: All directory structure of HDFS ,

Replication level of file, Modification and access times of files, Access
permissions of files and directories. Block size of files. The blocks constituting a
file
7
2. Edits: When any write operation takes place in HDFS, the directory structure
gets modified .These modifications are stored in the memory as well as in edits
files (edits files are stored on hard disk). If existing fsimage file gets merged with
edits, well get an updated fsimage file. This process is called Check pointing and
is carried out by the Secondary Namenode. It takes fsimage and edits files from
Namenode and returns updated fsimage file after merging
The datanode stores actual data blocks of file in HDFS on its own local
disk. It Sends signals to Namenode periodically (called as Heartbeat) to verify if it
is active. Sends block reporting to the Namenode on cluster startup as well as
periodically at every 10th Heartbeat. The DataNode are the workhorse of the
system .They perform all the block operations, including periodic the Checksum.
They receive instructions from the Namenode of where to put the blocks and how
to put the blocks.
Fig 2.6 HDFS Architecture
8
2.7. Goal of HDFS:
Fault detection and recovery:
Since HDFS includes a large number of commodity hardware, failure of

components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets:
HDFS should have hundreds of nodes per cluster to manage the

applications having huge datasets.
Hardware at data:
A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
CHAPTER 3
DESCRIPTION OF THE TOOLS USED:
3.1. Apache Hadoop
Apache Hadoop is an open-source software framework used for distributed

storage and processing of big data sets using the Map Reduce programming model.
It consists of computer clusters built from commodity hardware. All the modules in
Hadoop are designed with a fundamental assumption that hardware failures are
common occurrences and should be automatically handled by the framework. The
core of Apache Hadoop consists of a storage part, known as Hadoop Distributed
9
File System (HDFS), and a processing part which is a Map Reduce programming
model. Hadoop splits files into large blocks and distributes them across nodes in a
cluster. It then transfers packaged code into nodes to process the data in parallel.
This approach takes advantage of data locality nodes manipulating the data they
have access to to allow the dataset to be processed faster and more efficiently than
it would be in a more conventional supercomputer architecture that relies on
a parallel file system where computation and data are distributed via high-speed
networking.
Fig 3.1 Apache Hadoop
3.2. Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop.

It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
Initially Hive was developed by Face book; later the Apache Software
Foundation took it up and developed it further as an open source under the name
Apache Hive. It is used by different companies. For example, Amazon uses it in
Amazon Elastic Map Reduce.
10
Hive is not
A relational database
A design for On Line Transaction Processing (OLTP)
A language for real-time queries and row-level updates
3.2.1. FEATURES OF HIVE:
It stores schema in a database and processed data into HDFS.

It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Fig 3.2 Architecture of Hive
3.2.2 APACHE HIVE:

11
Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis. While initially developed by
Face book, Apache Hive is now used and developed by other companies such
as Netflix. Amazon maintains a software fork of Apache Hive that is included
in Amazon Elastic Map Reduce on Amazon Web Services. Apache Hive supports
analysis of large datasets stored in Hadoop's HDFS and compatible file systems
such as Amazon S3 file system. It provides an SQL-like language called
HiveQL with schema on read and transparently converts queries to map/reduce,
Apache Tez and Spark jobs. All three execution engines can run
in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap
indexes.
3.2.3. Working of Hive

Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
of the replacements of traditional approach for Map Reduce program. Instead of
writing Map Reduce program in Java, we can write a query for Map Reduce job
and process it.
The conjunction part of HiveQL process Engine and Map Reduce is Hive
Execution Engine. Execution engine processes the query and generates results as
same as Map Reduce results. It uses the flavor of Map Reduce.
Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
12
Fig 3.2.3 Working of Hive
3.3. MAP REDUCE:
Hadoop MapReduce is a software framework for easily writing

applications which process big amounts of data in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The MapReduce framework consists of a single master Job Tracker and

one slave Task Tracker per cluster-node. The master is responsible for resource
management, tracking resource consumption/availability and scheduling the jobs
component tasks on the slaves, monitoring them and re-executing the failed tasks.
The slaves Task Tracker execute the tasks as directed by the master and provide
task-status information to the master periodically.
13
The Job Tracker is a single point of failure for the HadoopMapReduce
service which means if Job Tracker goes down, all running jobs are halted.
MapReduce is a processing technique and a program model for distributed

computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the
map job.
The major advantage of MapReduce is that it is easy to scale data

processing over multiple computing nodes. Under the MapReduce model, the
data processing primitives are called mappers and reducers. Decomposing a data
processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted
many programmers to use the MapReduce model.
3.3.1. ALGORITHM OF MAP-REDUCING TECHNIQUE:

Generally MapReduce paradigm is based on sending the computer to where
the data resides!
MapReduce program executes in three stages, namely map stage, shuffle

stage, and reduce stage.
14
o Map stage : The map or mappers job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
o Reduce stage : This stage is the combination of the Shuffle stage

and the Reduce stage. The Reducers job is to process the data that
comes from the mapper. After processing, it produces a new set of
output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing

tasks, verifying task completion, and copying data around the cluster
between the nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data
to form an appropriate result, and sends it back to the Hadoop server.
15
Fig 3.3.1 Map-Reduce Algorithm
Overall, Mapper implementations are passed the JobConf for the job via
the JobConfigurable.configure(JobConf) method and override it to initialize
themselves. The framework then calls map(WritableComparable, Writable,
OutputCollector, Reporter) for each key/value pair in the InputSplit for that task.
Applications can then override the Closeable.close() method to perform any
required cleanup.
Output pairs do not need to be of the same types as input pairs. A given
input pair may map to zero or many output pairs. Output pairs are collected with
calls toOutputCollector.collect(WritableComparable,Writable).
Applications can use the Reporter to report progress, set application-level

status messages and update Counters, or just indicate that they are alive.
All intermediate values associated with a given output key are subsequently
grouped by the framework, and passed to the Reducer(s) to determine the final
output. Users can control the grouping by specifying
a Comparator via JobConf.setOutputKeyComparatorClass(Class).
The Mapper outputs are sorted and then partitioned per Reducer. The total
number of partitions is the same as the number of reduce tasks for the job. Users
can control which keys (and hence records) go to which Reducer by implementing
a custom Partitioner.
Users can optionally specify a combiner,

via JobConf.setCombinerClass(Class), to perform local aggregation of the
16
intermediate outputs, which helps to cut down the amount of data transferred from
the Mapper to the Reducer.
The intermediate, sorted outputs are always stored in a simple (key-len, key,
value-len, value) format. Applications can control if, and how, the intermediate
outputs are to be compressed and the Compression Codec to be used via
the JobConf.
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of

nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering
map outputs as the maps finish. With 1.75 the faster nodes will finish their first
round of reduces and launch a second wave of reduces doing a much better job of
load balancing.
Increasing the number of reduces increases the framework overhead, but

increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few
reduce slots in the framework for speculative-tasks and failed tasks.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is

desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into
the output path set by setOutput Path(Path). The framework does not sort the map-
outputs before writing them out to the FileSystem.
17
Partitioner
Partitioner partitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-

outputs. The key (or a subset of the key) is used to derive the partition, typically
by a hash function. The total number of partitions is the same as the number of
reduce tasks for the job. Hence this controls which of the m reduce tasks the
intermediate key (and hence the record) is sent to for reduction.
HashPartitioner is the default Partitioner.
Reporter
Reporter is a facility for MapReduce applications to report progress, set

application-level status messages and update Counters.
Mapper and Reducer implementations can use the Reporter to report

progress or just indicate that they are alive. In scenarios where the application
takes a significant amount of time to process individual key/value pairs, this is
crucial since the framework might assume that the task has timed-out and kill that
task. Another way to avoid this is to set the configuration
parameter mapred.task.timeout to a high-enough value (or even set it to zero for
no time-outs).
Applications can also update Counters using the Reporter.
18
OutputCollector
OutputCollector is a generalization of the facility provided by the

MapReduce framework to collect data output by the Mapper or the Reducer (either
the intermediate outputs or the output of the job).
HadoopMapReduce comes bundled with a library of generally useful

mappers, reducers, and partitioners.
3.3.2. TERMINOLOGIES:
PayLoad - Applications implement the Map and the Reduce functions, and
form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate

key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System

(HDFS).
DataNode - Node where data is presented in advance before any processing

takes place.
MasterNode - Node where JobTracker runs and which accepts job requests
from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
19
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a

SlaveNode.
3.4. Apache Hbase
3.4.1. Introduction to Hbase
HBase is a distributed column-oriented database built on top of the Hadoop file

system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Googles big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write

access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
20
Fig 3.4 Working of HBASE
Features of HBase
HBase is linearly scalable.

It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
3.5. Python
Python is a widely used high-level programming language for general-

purpose programming, created by Guido van Rossum and first released in 1991.
An interpreted language, Python has a design philosophy which emphasizes code
readability (notably using whitespace indentation to delimit code blocks rather
than curly braces or keywords), and a syntax which allows programmers to
express concepts in fewer lines of code than possible in languages such as C++ or
21
Java. The language provides constructs intended to enable writing clear programs
on both a small and large scale.
Python features a dynamic type system and automatic memory management and
supports multiple programming paradigms, including object-oriented, imperative,
functional programming, and procedural styles. It has a large and comprehensive
standard library.
Python interpreters are available for many operating systems, allowing Python
code to run on a wide variety of systems. CPython, the reference implementation
of Python, is open source software and has a community-based development
model, as do nearly all of its variant implementations. CPython is managed by the
non-profit Python Software Foundation.
3.6. MySQL
MySQL is an open source relational database management system

(RDBMS) based on Structured Query Language. MySQL runs on virtually all
platforms, including Linux, Unix, and Windows. Although it can be used in a wide
range of applications, MySQL is most often associated with web-based
applications and online publishing and is an important component of an open
source enterprise stack called LAMP. LAMP is a Web development platform that
uses Linux as the operating system, Apache as the Web server, MySQL as the
relational database management system and PHP as the object-oriented scripting
language. MySQL is customizable. The open-source GPL license allows
programmers to modify the MySQL software to fit their own specific
22
environments. MySQL is a fast, easy-to-use RDBMS being used for many small
and big businesses.
3.7. UBUNTU
Ubuntu is a Debian based Linux operating system and distribution,

with Unity as its default desktop environment for personal
computers including smart phones in later versions. Ubuntu also runs network
servers.
It is based on free software and named after the Southern African

philosophy of Ubuntu, which often is translated as "humanity towards others" or
"the belief in a universal bond of sharing that connects all humanity".
A default installation of Ubuntu contains a wide range of software that

includes Libre office, Firefox, Thunderbird, Transmission, and several lightweight
games such as Sudoku and chess. Many additional software packages, including
titles no longer in the default installation such as Evolution, GIMP, Pidgin,
and Synaptic, are accessible from the built in Ubuntu Software Center as well as
any other APT-based package management tool.
Ubuntu operates under the GNU General Public License (GPL) and all of
the application software installed by default is free software. In addition, Ubuntu
installs some hardware drivers that are available only in binary format, but such
packages are clearly marked in the restricted component.
Ubuntu's goal is to be secure "out-of-the box". By default, the user's
23
programs run with low privileges and cannot corrupt the operating system or other
users' files. For increased security, the sudo tool is used to assign temporary
privileges for performing administrative tasks, which allows the root account to
remain locked and helps prevent inexperienced users from inadvertently making
catastrophic system changes or opening security holes. Policy Kit is also being
widely implemented into the desktop to further harden the system. Most network
ports are closed by default to prevent hacking. A built-in firewall allows end-users
who install network servers to control access. A GUI (GUI for Uncomplicated
Firewall) is available to configure it. Ubuntu compiles its packages
using GCC features such as PIE and buffer overflow protection to harden its
software. These extra features greatly increase security at the performance expense
of 1% in 32 bit and 0.01% in bit. The home and Private directories can be
encrypted.
24
CHAPTER 4
AUTOMATED DATA VALIDATION
FRAMEWORK
4.1. OBJECTIVE
This is mainly designed to automate the validation of data in the field of medical
claims based on the constraints given by the respective organization on time basis.
4.2. EXISTING SYSTEM
Earlier all the medical claims have been processing manually. All the data are
recorded by the organizations agent and maintained into separate records for each
category. Later, people started moving these manual records into computers storage
using various spreadsheets and databases like MySQL to store the data.
4.3. LIMITATIONS / DISADVANTAGES
In manual entering of data, it was time consuming, mistakes cannot be traced

easily and if an error exists or needs modification entire data have to be
changed and maintained.
It also requires lot of man power.
After the usage of computer software, all the data are moved and maintained in
spreadsheets. As the time varied, the amount of data generated and entered
were in large amount which cannot be stored and processed in the existing
25
software.
Since only certain amount of data can be stored in spreadsheet it was very
difficult to manage and process the data. So handling of very large amount of
data i.e. petabyte and Exabyte of data per day is a difficult task.
4.4. PROPOSED SYSTEM
In this proposed system, medical claims data are collected and loaded into the
database. Then the data are cleaned and transformed to make it effective for performing
validation. Once the data is loaded into the database, it automatically validates the data
based upon the given constraints to maintain the data with consistency and for effective
analysis. After the data is validated, the invalid data are moved and stored in the
BAD_FILES directory of the HDFS. The administrator gets notified about the data
refresh through email notification. This automation makes the clients to perform the
validation of large amount of medical claims easier and effective for analysis and also
saves a lot of time.
4.5. ADVANTAGES
Time efficient.
It is an automated process and hence does not require much of man power.
Once the validation process is completed and lookup refresh is success then an
email notification is sent to pre-defined recipients.
26
CHAPTER 5
SYSTEM DESIGN
Systems design is the process of defining the architecture, components, modules,

interfaces, and data for a system to satisfy specified requirements.
Systems design could be seen as the application of systems theory to product

development.
If the broader topic of product development "blends the perspective of change in

settings, design, and integrating maps into a single approach to product
development," then design is the act of taking the user profile settings values
information and creating the design of the product to be developed. Systems design is
therefore the process of defining and developing systems to satisfy
specified requirements of the user.
The system design is depicted using the following diagrams

27
Use case diagram
Activity diagram
Sequence diagram
Collaboration diagram
Component Diagram
5.1. USECASE DIAGRAM
Use case diagrams give a graphic overview of the actors involved in a system,
different functions needed by those actors and how these different functions are
interacted.
28
Fig 5.1 Usecase diagram
5.2. SEQUENCE DIAGRAM
Sequence diagrams in UML show how objects interact with each other and the order
29
those interactions occur. Its important to note that they show the interactions for a
particular scenario. The processes are represented vertically and interactions are show
as arrows.
Fig 5.2 Sequence diagram
30
5.3. ACTIVITY DIAGRAM
Activity diagrams represent workflows in a graphical way. They can be used to

describe business workflow or the operational workflow of any component in a system.
Sometimes activity diagrams are used as an alternative to State machine diagrams.
31
Fig 5.3 Activity diagram
5.4. COLLABORATION DIAGRAM
A collaboration diagram, also called a communication diagram or interaction diagram,

is an illustration of the relationships and interactions among software objects in
the Unified Modeling Language (UML).
Fig 5.4 Collaboration diagram
32
5.5. COMPONENT DIAGRAM
A component diagram displays the structural relationship of components of a software

system. These are mostly used when working with complex systems that have many
components. Components communicate with each other using interfaces. The interfaces
are linked using connectors.
33
Fig 5.5 Component diagram
CHAPTER 6
INITIAL SETUP PROCESS STEPS
6.1. Installation of virtual box

The first step in the setup process is the installation of the oracle virtual box of a
34
proper version, here the installed version islVirtualBox-5.0.16-105871.Win.exe.
6.2. Import of the appliance

Open the virtual box and then follow the below steps:
Click on File -- Import Appliance.
In the next window that appears, Click on the Folder option to browse to the
location where hadoop vm is present.
Then, Select the VM to be imported an then click on Open button. After, Click on
Next to proceed to next step.
Finally, Click on import button to start importing the acadgild VM. Once the vm
starts getting imported it gives a prompt and once the importing gets finished, we
need to click on the start button after which the password needs to be entered.
6.3. Single Node Hadoop 2.X Cluster Setup On CentOS
Download CentOS which will be downloaded as a compressed le. Unzip it

using any unzipping software by right clicking on the le and selecting the
option Extract here.
Then, Click on New Option and then enter the appropriate details as Name:
Type in any name, to name your VM. Type: Select the option Linux from the
drop down list. Version: Select Other Linux (64 bit) from the drop down list and
Click Next.
35
On clicking Next, a prompt appears to set up RAM size for the VM, Increase
the RAM up to 2048 MB if the system has 8 GB RAM and increase up to 1 GB
if the system has 4 GB RAM.
Before powering on the VM, click on the setting option and then increase the
RAM size.
Click on Next to get the option of selecting the Hard Disk Option; choose the
third option i.e. using the existing Virtual hard drive le.
After, click on the folder icon to browse the location where the unzipped le of
CentOS is present.
Select the imported VM and click on the Start button to start the VM and Type
username and password.
Open the terminal and login to the root user to have administrator permissions
and Type the password.
Add more users in the CentOS by using the command adduser followed by the
username.
Then, Set the password of the added user by using the command passwd
followed by the password.
Disable the rewall in the CentOS using the required command.
Add the user into sudoers le to give the administrator rights to the created
user.Type the required command to add the created user into sudoers le.
Add the user by scrolling the cursor down to the appropriate position.
To type any command in the above le, enter insert mode by pressing I in the
keyboard and then add the users in the sudoers le and then press Esc button to
come out of insert mode and then type:wq to save and exit.
36
Reboot the machine and then login to the created user Select the option shown
with the red colored arrow symbol On clicking the above option, download will
start and get saved in Downloads folder.
Move the above le into /home directory using the mv command and then
switch the directory to /home by typing the command cd.
Untar the jdk and extract the java le by using the tar command.
Enter the command ls to see the extracted jdk in the same folder /home.
Download the hadoop le and follow the same steps to untar it and move to
/home.
Update the .bashrc le with required environment variables including
hadoop path.
Type the command source .bashrc to make the environmental variables work.
Create two directories to store NameNode metadata and DataNode blocks
Note: Change the permissions of the directory.
Change the directory to the location where hadoop is installed .
Open hadoop-env.sh and add the java home(path) and hadoop home(path) in it.
Open Core-site.xml and add the required properties in between conguration tag
of core-site.xml .
Open the hdfs-site.xml and add the required lines in between conguration tags.
Open the Yarn-site.xml and add the frequired lines in between conguration
tags
Copy the mapred-site.xml template into mapred-site.xml and then add the
required properties .
Login to the root user and then install openssh server in the CentOS.
Generate ssh key for hadoop user .
37
Copy the public key from .ssh directory to authorized_keys folder.
Change the directory to .ssh and then type the below command to copy the les
into the authorized _keys folder.
Type the command ls to check whether authorized_keys folder has been created
or not.
To ensure whether the keys have been copied, type the cat command.
Change the permission of the .ssh directory Restart the ssh service .
6.4. Starting the daemons

Open the terminal and type the command jps to check the list of daemons, we
need to start all the daemons for the first time. We can start Hadoop daemons by
using two ways:
I. Starting all the hadoop daemons by using -start-all.sh command
II. Starting all the hadoop daemons manually by Starting Namenode,
DataNode, Resource Manager, Node Manager and Job history server manually.
Before which change the directory to sbin of hadoop before starting the daemon
using the Command cd /usr/local/hadoop-2.6.0 /sbin
Starting namenode: use the command ./hadoop-daemon.sh start namenode
Starting datanode: use the Command ./hadoop-daemon.sh start datanode
Starting resourcemanager: use the Command
./yarn-daemon.sh startresourcemanager
Starting nodemanager: use the command ./yarn-daemon.sh start nodemanager
Finally,type the command jps to check the health of the daemons, if all the
daemons will be running then cluster is ready.
CHAPTER 7
38
PROCESSING STEPS
1. Data Load
2. Data Validation
3. Data Analysis
DATA LOAD
The first step before the data load is to start the daemons like
the namenode, datanode, resource manager, node manager etc.
Once the daemons are started the data needs to be loaded into
hive and hbase tables to make it available for the validation
parameters.
DATA VALIDATION
After the data is loaded, the data are refreshed and then during
the data validation phase the data are checked based on the
criteria given.
For this the validity of the data will be checked for based on
the criteria,
o Country must be one of those values present in valid_countries table
o City must be one of from valid_cities table
o Claims must be greater than 0 and must be an integer
39
o Date must be of the format YYYY-MM-DD
o d must start with alphabets (all upper case) and end with digit
The valid data will be stored into the HDFS directory and
the invalid data are stored in the bad_files.
DATA ANALYSIS
The data are stored in the hive and lookup tables, then they are
validated based on the conditions given then the data are
analyzed and then use cases are passed and finally the output of
the use cases are displayed.
Fig 7 Processing steps
40
CHAPTER 8
CONCLUSION
Thus, to conclude, we have created an automated data validation
framework based on the latest trend analysis .When developed as a complete project,
this framework will be able to ease the data validation of large data sets and time taken
for the transfer between the hive tables and the hdfs with the growing need to store
huge amounts of validated data in an astonishingly quick way.
41
CHAPTER 9
SCREENSHOTS
HADOOP INSTALLATION
42
43
44
HDFS OVERVIEW
45
46
Stopping all the existing daemons
47
Starting all the new daemons
48
49
Creating the hive tables
50
Loading the data and displays the data in the
51
52
53
Lookup tables are getting created and loaded with data
54
55
56
Data refresh is being done
57
58
59
60
Config file and the validation file is run
61
62
63
The usecases are put in the analysis file and it is run
64
65
66
67
68
69
SAMPLE CODE:
Validation_project_master.sh
# Lets us start with creating tables in hive
echo "LETS GET THE DAEMONS STARTED...."
sh stop-daemons.sh
sh start-daemons.sh
jps
echo "DAEMONS STARTED"
echo "PREPARING TO CREATE HIVE DATABASE AND LOAD DATA INSIDE

IT..."
hive -f createhivetables.hql
hive -f loadstagetables.hql
echo "HIVE TABLES CREATED AND LOADED!"
70
echo "LET US NOW CREATE THE LOOK UP TABLES IN HIVE AND HBASE AT
THE SAME TIME..."
hive -f lookupload.hql
echo "LOOK UP TABLES CREATED!"
echo "LET US DO THE DATA REFRESH NOW..."
hive -f data_refresh.hql
echo "DATA REFRESH COMPLETE!"
echo "LET US VALIDATE THE DATA..."
#python config.py
#data_validation.py
echo "VALIDATION COMPLETE!"
71
echo "LETS US RUN THE USE CASES.."
hive -f analysis.hql
echo "VALIDATION COMPLETE!"
sh invalid_file.sh
echo INVALID FILES ARE MOVED TO BAD_FILES DIRECTORY OF HDFS
72
73

Automated Data Validation Framework

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Data Validation Framework

Uploaded by

Copyright:

Available Formats

USER RECOMMENDER SYSTEM BASED UPON

KNOWLEDGE, AVAILABILITY AND REPUTATION

C.SANJUNA DEVI (111713205083)

R.SANJIVA SINDHUJA (111713205075)

in partial fulfillment for the award of the degree

ANNA UNIVERSITY: CHENNAI 600 025

Certified that this project report on AUTOMATED DATA VALIDATION

Dr. K. VIJAYA M.E., Ph.D MS.PRATHUSHA LAXMI M.E.,(Ph.D)

HEAD OF THE DEPARTMENT SUPERVISOR

Dept. of Information Technology, Dept. of Information Technology,

College Name : RMK ENGINEERING COLLEGE

Department : INFORMATION TECHNOLOGY

Title of Project Name of the Students Name of the Supervisor

INTERNAL EXAMINER EXTERNAL EXAMINER

One emerging challenge in handling large amount of data is Verifiability. It

CHAPTER TITLE PAGE

1.1.1 IMPORTANCE OF BIG DATA

1.2 CHARACTERISTICS OF BIG DATA

2 HADOOP DISTRIBUTED FILE SYSTEM 4

2.2 PROBLEMS IN HADOOP 1.x

2.3 COMPONENTS OF HADOOP2.x

2.4 DESIGN OF HDFS

2.6 NAMENODE AND DATANODES

2.7 GOAL OF HDFS

3 DESCRIPTION OF TOOLS USED 9

3.1 APACHE HADOOP

3.2.1 FEATURES OF HIVE

3.2.2 APACHE HIVE

3.3 MAP REDUCE

3.4 APACHE HBASE

3.4.1 INTRODUCTION TO HBASE

3.4.2 FEATURES OF HBASE

4 AUTOMATED DATA VALIDATION 24

4.2 EXISTING SYSTEMS

4.4 PROPOSED SYSTEM

5.1 USECASE DIAGRAM

5.2 ACTIVITY DIAGRAM

5.3 SEQUENCE DIAGRAM

5.4 COLLABORATION DIAGRAM

5.5 COMPONENT DIAGRAM

6 INITIAL SETUP PROCESS STEPS 32

6.1 INSTALLATION OF VIRTUAL BOX

6.2 IMPORT OF APPLIANCE

6.3 SINGLE NODE HADOOP 2.x CLUSTER

6.4 STARTING DAEMONS

FIGURE NO TITLE PAGE NO

1.1 TRENDS OF BIG DATA 2

2.6 HDFS ARCHITECTURE 8

3.1 APACHE HADOOP MULTI NODE CLUSTER 9

3.2 ARCHITECTURE OF HIVE 11

3.2.3 WORKING OF HIVE 12

3.3.1 MAP REDUCE ALGORITHM 15

3.4 WORKING OF HBASE 20

5.1 USECASE DIAGRAM 27

5.2 SEQUENCE DIAGRAM 28

5.3 ACTIVITY DIAGRAM 29

5.4 COLLABORATION DIAGRAM 30

5.5 COMPONENT DIAGRAM 31

HDFS - Hadoop Distributed File System

CSV - Comma- separated values

ORC - Optimized Row Columnar

HIVE - Highly Immersive Visualization Environment

HQL - Hive SQL