Data Leakage Detection

DATA LEAKAGE DETECTION
OBJECTIVE /MOTIVATION
A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebodys laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.
Data allocation strategies (across the agents) that improve the probability of identifying leakages.
These methods do not rely on alterations of the released data (e.g., watermarks). In some cases distributor can also inject realistic but fake data records to further improve our chances of detecting leakage and identifying the guilty party.
Our goal is to detect when the distributors sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.
AIM
Aim is to detect when the distributors sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.
ABSTRACT
A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebodys laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Data allocation strategies (across the agents) that improve the probability of identifying leakages has been proposed. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases distributor can also inject realistic but fake data records to further improve our chances of detecting leakage and identifying the guilty party.
LITERATURE SURVEY GENERAL INTRODUCTION The guilt detection approach we present is related to the data provenance problem, tracing the lineage of S objects implies essentially the detection of the guilty agents. Suggested solutions are domain specific, such as lineage tracing for data warehouses and assume some prior knowledge on the way a data view is created out of data sources. Watermarks were initially used in images, video and audio data whose digital representation includes considerable redundancy. Watermarking is similar in the sense of providing agents with some kind of receiver-identifying information. However, by its very nature, a watermark modifies the item being watermarked. If the object to be watermarked cannot be modified then a watermark cannot be inserted. In such cases methods that attach watermarks to the distributed data are not applicable. Recently, works have also studied marks insertion to relational data. There are also lots of other works on mechanisms that allow only authorized users to access sensitive data through access control policies. Such approaches prevent in some sense data leakage by sharing information only with trusted parties. However, these policies are restrictive and may make it impossible to satisfy agents requests.
ACHIEVING K-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION1
Abstract: Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to kanonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, kanonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and m-Argus are compared to MinGen. Both Datafly and m-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and m-Argus can additionally fail to provide adequate protection
WATERMARKING DIGITAL IMAGES FOR COPYRIGHT PROTECTION
A watermark is an invisible mark placed on an image that is designed to identify both the source of an image as well as its intended recipient. The authors present an overview of watermarking techniques and demonstrate a solution to one of the key problems in image watermarking, namely how to hide robust invisible
WHY AND WHERE: A CHARACTERIZATION OF DATA PROVENANCE
Abstract. With the proliferation of database views and curated databases, the issue of data provenance { where a piece of data came from and the process by which it arrived in the database { is becoming increasingly important, especially in scientific databases where understanding prove-nance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query. We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between \why" provenance (refers to the source data that had some influence on the existence of the data) and \where" provenance (refers to the location(s) in the source databases from which the data was extracted).
PRIVACY, PRESERVATION AND PERFORMANCE: THE 3 PS OF DISTRIBUTED DATA MANAGEMENT
Privacy, preservation and performance (3 Ps) are central design objectives for secure distributed data management systems. However, these objectives tend to compete with one another. This paper introduces a model for describing distributed data management systems, along with a framework for measuring privacy, preservation and performance. The framework enables a system designer to quantitatively explore the tradeoffs between the 3 Ps.
EXISTING SYSTEM Perturbation Application where the original sensitive data cannot be perturbed has been considered. Perturbation is a very useful technique where the data is modified and made less sensitive before being handed to agents. For example, one can add random noise to certain attributes, or one can replace exact values by ranges. However, in some cases it is important not to alter the original distributors data. For example, if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers. If medical researchers will be treating patients (as opposed to simply computing statistics), they may need accurate data for the patients.
Watermarking Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.
Disadvantages
Consider applications where the original sensitive data cannot be perturbed. Perturbation is a very useful technique where the data is modified and made less sensitive before being handed to agents. However, in some cases it is important not to alter the original distributors data. Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious.
PROPOSED SYSTEM Unobtrusive techniques for detecting leakage of a set of objects or records have been studied. After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a web site, or may be obtained through a legal discovery process.) At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if Freddie with a single cookie has been cached, he can argue that a friend gave him the cookie. But if Freddie with 5 cookies has been cached, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees enough evidence that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings. A model for assessing the guilt of agents has been developed. An algorithm for distributing objects to agents, in a way that improves our chances of identifying a leaker has been proposed. The option of adding fake objects to the distributed set also been considered. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects acts as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.
Advantages After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. If the distributor sees enough evidence that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings. To develop a model for assessing the guilt of agents. We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leaker. Consider the option of adding fake objects to the distributed set. Such objects do not correspond to real entities but appear. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.
REQUIREMENT ANALYSIS
User Requirements User need to do the following activities to create the data leakage model Distributor Distributor will forward the requested data to the agent along with created fake object. Distributor must lead a text file, which contains the leaked data. Distributor can compare the data available in the text file with the data, which is allocated to agent. Agents Agents receive the data from distributor, which they can use for the own purpose. Sometime the agents may leak-out data to unauthorized person. Unauthorized person Unauthorized person may get data from agent.
System Requirements:
Hardware requirements: Processor Ram Hard Disk Compact Disk Input device Output device : Any Processor above 500 MHz. : 128Mb. : 10 Gb. : 650 Mb. : Standard Keyboard and Mouse. : VGA and High Resolution Monitor.
Software requirements:
Operating System Language Front End Database
: Windows Family. : JDK 1.5 : Java Swing : Msaccess
Introduction to JAVA: Java is a programming language developed by Sun Micro System. Java is an object oriented programming language that is used in Conjunction with java Enabled web browsers. These browsers can interrupt byte codes created by the language compiler. The technical design of Java is architectural neutral. The term architectural in this sense refers to computer hardware. Programmers can create Java programs without having to worry about his
underlying architecture of a user computer. customized to the users.
Instead, the hot Java browser is
Features and advantage of JAVA:
Java fits the Network communication programmers need to understand more specific technical characteristics of Java. Java is a simple, Object-oriented, distributed, interpreted, robust, secure, architecture neutral, portable, highperformance, multithreaded, and dynamic language. Java is platform independent. Because Java was designed is support distributed
application over computer networks, is can be used with a variety of CPU and Operating system. To achieve this objective this goal, a compiler was created those procedures architectural neural Object files from Java code.
Java is secure, Java allows virus free, tampers free system to be created, because of the number of security checks performed before a piece of code can be executed. This is possible because, pointers and memory allocations are removed during compile time. Java is distributed, Java is built with network communications in mind and it comes for free. Java has a comprehensive library of routines for dealing with network protocols such as transmission control protocol/Internet protocol etc. Java is object oriented; the object-oriented functions of Java are es sentially those of C++. The object-oriented approach supplies a generic form of Lego bricks, from which all other Lego bricks can be derived. Java is robust; Java required declarations, ensuring the data types more Java does not allow automatic casting of data types. Java is dynamic, Java is designed to adapt in a constantly evolving environment. Java is capable of dynamically linking in new class libraries, methods and instance variables as it goes without breaking. Java is strongly associated with the Internet because of the fact that, the first application program written in Java was Hot Java, a Web browser to run applets on Internet. Internet users can use Java to create applet programs and run them locally using a Java- enabled browser, they can also use a Java enabled browser to download an applet located on a computer anywhere in the Internet and runs it on his local computer. pass to routine are exactly
the data type that the routine required. The programmers must explicitly write casts further
IMPLEMENTATION
5.1. SOFTWARE DEVELOPMENT FLOW: SOFTWARE DEVELOPMENT LIFE CYCLE
Concept Exploration WHAT
Design
Implementation Test EVOLVE Installation and FEEDBACK Checkout Operation and
HOW
OPERATIO N
Maintenance Replacement Fig
5.1 Waterfall Cycle The waterfall life-cycle model describes a sequence of activities that begins with concept exploration and concludes with maintenance and eventual replacement. The waterfall model caters to forward engineering of software products. This means starting with a high-level conceptual model for a system. After a description of the conceptual model for a system has been worked out, the software process continues with the design, implementation, and testing of a physical model of the system:
5.2
Design:
System design is the process of planning a new system moving from the problem domain to the solution domain. The design phase translates the logical aspects of the system into physical aspects of the system.
JAVA:
Java is a new computer programming language developed by Sun Microsystems. Java has a good chance to be the first really successful new computer language in several decades. Advanced
programmers like it because it has a clean, well-designed definition. Business likes it because it dominates an important new application, Web programming. Java has several important features:
A Java program runs exactly the same way on all computers. Most other languages
allow small differences in interpretation of the standards.

It is not just the source that is portable. A Java program is a stream of bytes that
can be run on any machine. An interpreter program is built into Web browsers, though it can run separately. Java programs can be distributed through the Web to any client computer.
Java applets are safe. The interpreter program does not allow Java code loaded from
the network to access local disk files, other machines on the local network, or local databases. The code can display information on the screen and communicate back to the server from which it was loaded.
A group at Sun reluctantly invented Java when they decided that existing computer languages could not solve the problem of distributing applications over the network. C++ inherited many unsafe practices from the old C language. Basic was too static and constrained to support the development of large applications and libraries. Today, every major vendor supports Java. Netscape incorporates Java support in every version of its Browser and Server products. Oracle will support Java on the Client, the Web Server, and the Database Server. IBM looks to Java to solve the problems caused by its heterogeneous product line. The Java programming language and environment is designed to solve a number of problems in modern programming practice. It has many interesting features that make it an ideal language for software development. It is a high-level language that can be characterized by all of the following buzzwords:
Features
Sun describes Java as

Simple Object-oriented Distributed Robust Secure Architecture Neutral Portable Interpreted High performance
Java is simple. What it means by simple is being small and familiar.Sun designed Java as closely to C++ as possible in order to make the system more comprehensible, but removed many rarely used, poorly understood, confusing features of C++. These primarily include operator overloading, multiple inheritance, and extensive automatic coercions. The most important simplification is that Java does not use pointers and implements automatic garbage collection so that we don't need to worry about dangling pointers, invalid pointer references, and memory leaks and memory management.
Java is object-oriented.
This means that the programmer can focus on the data in his application and the interface to it. In Java, everything must be done via method invocation for a Java object. We must view our whole application as an object; an object of a particular class. .
Java is distributed. Java is designed to support applications on networks. Java supports various levels of network connectivity through classes in java. net. For instance, the URL class provides a very simple interface to networking. If we want more control over the downloading data than is through simpler URL methods, we would use a URLConnection object which is returned by a URL URL.openConnection () method. Also, you can do your own networking with the Socket and Server Socket classes.
Java is robust. Java is designed for writing highly reliable or robust software. Java puts a lot of emphasis on early checking for possible problems, later dynamic (runtime) checking, and eliminating situations that are error prone. The removal of pointers eliminates the possibility of overwriting memory and corrupting data.
Java is secure. Java is intended to be used in networked environments. Toward that end, Java implements several security mechanisms to protect us against malicious code that might try to invade your file system. Java provides a firewall between a networked application and our computer.
Java is architecture-neutral: Java program are compiled to an architecture neutral byte-code format. The primary advantage of this approach is that it allows a Java application to run on any system that implements the Java Virtual Machine. This is useful not only for the networks but also for single system software distribution. With the multiple flavors of Windows 95 and Windows NT on the PC, and the new PowerPC Macintosh, it is becoming increasing difficult to produce software that runs on all platforms.
Java is portable.
The portability actually comes from architecture-neutrality. But Java goes even further by explicitly specifying the size of each of the primitive data types to eliminate implementationdependence. The Java system itself is quite portable. The Java compiler is written in Java, while the Java run-time system is written in ANSI C with a clean portability boundary.
Java is interpreted.
The Java compiler generates byte-codes. The Java interpreter executes the translated byte codes directly on system that implements the Java Virtual Machine. Java's linking phase is only a process of loading classes into the environment.
Java is high-performance. Compared to those high-level, fully interpreted scripting languages, Java is highperformance. If the just-in-time compilers are used, Sun claims that the performance of byte-codes converted to machine code are nearly as good as native C or C++. Java, however, was designed to perform well on very low-power CPUs.
Java is multithreaded. Java provides support for multiple threads of execution that can handle different tasks with a Thread class in the java.lang Package. The thread class supports methods to start a thread, run a thread, stop a thread, and check on the status of a thread. This makes programming in Java with threads much easier than programming in the conventional single-threaded C and C++ style.
Java is dynamic. Java language was designed to adapt to an evolving environment. It is a more dynamic language than C or C++. Java loads in classes, as they are needed, even from across a network. This makes an upgrade to software much easier and effectively. With the compiler, first we translate a program into an intermediate language called Java byte codes ---the platform-independent codes interpreted by the interpreted on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happen s just once; interpretation occurs each time the program is executed. Java byte codes can be thought as the machine code instructions for the Java Virtual Machine (JVM). Every Java interpreter, whether its a development tool or a web browser that can run applets, is an implementation of the Java Virtual Machine.
The Java Platform A platform is the hardware or software environment in which a program runs. Most Platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that its software only platform that runs on top of other hardware-based platforms. The Java platform has two components: 1. The Java Virtual Machine (JVM). 2. The Java Application Programming Interfaces (Java API).
The JVM has been explained above. Its the base for the Java platform and is ported onto various hardware-based platforms. The Java API is a large collection of ready-made software components that provide many useful capabilities, such as graphical user interface (GUI). The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be bit slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability.
MODULE DESCRIPTION:
DATA ALLOCATION:
This modklikule is mainly designed to transfer data from distributor to agents. The same module can also be used for illegal data transfer from authorized to agents to other agents The distributor intelligently gives data to agents in order to improve the chances of detecting a guilty agent. There are four instances of this problem can be addressed, depending on the type of data requests made by agents and whether fake objects are allowed. The two types of requests handled are: sample and explicit. Fake objects are objects generated by the distributor. The objects are designed to look
like real objects, and are distributed to agents together, in order to increase the chances of detecting agents that leak data.
All agents make explicit requests and all agents make sample requests. The results can be extended to handle mixed cases, with some explicit and some sample requests.
GUILT MODEL
This module is designed using the agent guilt model. When the distributor sends data to agent, while run time this module allocates unique fake object in each and every tuple provided to agents. A copy of data, which is transferred to agents are stored in the distributors database. Distributor adds fake objects to the distributed data in order to improve the effectiveness in detecting guilty agents. However, fake objects may impact the correctness of what agents do, so they may not always be allowable. In this module, perturbing the set of distributor objects by adding fake elements has been done. In some applications, fake objects may cause fewer problems that perturbing real objects. For example, say the distributed data objects are patient records and the agents are researchers. In this case, even small modifications to the records of actual patient distribution records may be undesirable. However, the addition of some fake object may be acceptable.
AGENT-GUILT MODEL:
This module is mainly designed for determining fake agents. This module uses fake objects (which is stored in database from guilt model module) and determines the guilt agent along with the probability. Once the distributor finds his data in unauthorized places, he can able to compare the release data with his copy of data, which is distributed to agents, then he can able to find out the guilt agent. To compute this probability, we need an estimate for the probability that values can be guessed by the target. We use the probability of guessing to identify agents that have leaked information. The probabilities are estimated based on experiments.
OVERLAP MINIMIZATION
When the distributor allocates data upon request from agents, there may be chance of asking same tuples by more that one agent. Here come overlap between more that one agent, while giving same tuple to more than one agent. When the data is released, distributor must able to assess who is guilt agent. Thus we are arrive solution for this overlap minimization. When the distributor allocates data along with fake object, the fake objects are should be in the manner that for each and every agent and for each and every tuple the fake object is unique.
FINDING PROBABILITY
When the distributor finds the leaked data in unauthorized places like websites or laptop, then the distributor must be able to find out who is the guilt agent. For this, he can able to find the probability of the number of allocated data for a particular agent by number of found records in unauthorized place. From this it is clearly concluded that the agent who is having more probability must be the guilt agent. Fake objects are used here to confirm the guilt agent more clearly.
SYSTEM ARCHITECTURE
Compares with his DB to find guilt agent
DB
Distributor
Distributor sends sensitive data to agents In runtime distributor creates fake object and allocates to agents Explicit or sample request to distributor
Distributor finds his data leak in unauthorized place
Data Leak out
Agent 1
Agent 2
Agent 3
Agent 3
Agent 5 Unauthorized person
Leaked file
Data Flow Diagram: LEVEL 1: DATA TRANSFER PROCESS

Agent
Gives request
Distributor
Add Fake object
agent
Agent id
Data Transfe r
Fake object
Agent id, fake object
agent
Agents Only guilt agent data
Unauthorized person
Data Transfe r
Agent id, fake object, no. of file transfer
thirdpart y
LEVEL 2: GUILT MODEL ANALYSIS
agent
Agent id, fake object
Distributor
Find guilt agent
View released data
Find probability
Probability between guilt agents
UML DIAGRAM USE CASE DIAGRAM:
Explicit or sample request

Data transfer to unauthorized
Add fake objects

Agent
Distributo r
Distribute data to agents
Find probability
Find guilt agent
Unauthorized person
SEQUENCE DIAGRAM:
Login
Distribute Data to agents
View Distributed Data
Find Guilt Agents
Probability Distribution of Data
Login as Distributor Store data into database View from database for data leakage Find probability of data transfer to agents
COLLABORATION DIAGRAM:
1: Login as Distributor Login Distribute Data to agents
2: Store data into database View Distributed Data
3: View from database for data leakage
4: Find probability of data transfer to agents Find Guilt Agents Probability Distribution of Data
ACTIVITY DIAGRAM:
LOGIN
DISTRIBUTE DATA TO AGENTS
VIEW DATA DISTRIBUTED TO AGENTS
FIND GUILT AGENTS
FIND PROBABILITY OF DATA LEAKAGE
ARCHITECTURE DIAGRAM:
4. Major software functions Number 1 Description Functions
Windows XP/7 with MS- Maintaining compatibility with versions of the office software used. Maintained the record of the logs of users, intruder and signature.
MS-Access
3 4
Netbeans 1.6.0 JDK1.6.0
Coding done and user interface is created. Created Java Development and runtime environment.
4 RISK MANAGEMENT
This section discusses project risks and the approach to managing them. 4. 3.1 Project Risks RMMM plan tackles risk through Risk Assessment and Risk Control. Risk Assessment involves Risk Identification, Risk Analysis and Risk Prioritization. While Risk Control involves Risk Management Planning, Risk Resolution and Risk Monitoring. Project name: Intrusion Detection with Network Deception using Honeypots. Purpose: The RMMM plan outlines the risk management strategy adopted. We adopt a proactive approach to tackle risks and thus reduce the performance schedule and cost overruns, which we may incur due to occurrence of unexpected problems. This Risk Mitigation Monitoring and Management Plan identifies the risks associated with our project Intrusion Detection with Network Deception using Honeypots. In addition to project risk and technical risks, business risks are also identified, analyzed and documented. This document outlines the strategy that we have adopted to avoid these risks. A contingency plan is also prepared for each risk, in case it becomes a reality. Only those risks have been treated whose probability and impact are relatively high i.e above a referent level. 4.3.2 Risk Table Impact levels: The risks are categorized on the basis of their probability of occurrence and the impact that they would have, if they do occur. Their impact is rated as follows:
Catastrophic Critical Marginal Negligible
1 2 3 4
Sr No 1 2 3 4 5
Risk Increase of work load software environment Overly schedules Lack research of
Category Personal
Probability 20% 25% 20% 50% 50%
Impact 3 3 3 3 2
Inexperience in Project Technical optimistic Project sufficient Technical
Modules require more Project testing and further Project Table: 4.3 implementation work
Inconsistency in Input
30%
4.4 PROJECT TIMELINE AUGUST W1 1 Requirement gathering W2 W3 W4 W1 SEPTEMBER W2 W3 W4
information internet
from
information book
from
information stakeholders
from
2 a
Requirement analysis Analysis information to Mobile agent of related
analysis information to RMI
of related
analysis information to environment
of related wireless
analysis
of
NETBEANS IDE 6.9
OCTOBER
NOVEMBER
W1 3 a b Problem definition Meet Internal guide Identify constraint c Establish statement project project
W2
W3
W4
W1
W2
W3
W4
4 a b c
Feasibility Economic Feasibility Technical Feasibility Behavioral Feasibility
W1 5 a Planning Scheduling of task
W2
W3
W4
W1
W2
W3
W4
Task
division
and
Time Allocation c d Effort Allocation Resource Allocation
6 a
Designing RMI studied Logic are
b c
Designing of GUI Designing Database of
FEBRUARY W1 7 a b c Coding Coding of GUI Coding of RMI Coding of RMI W2 W3 W4 W1
MARCH W2 W3 W4
REGISTRY
Designing Database
of
APRIL W1 8 implementation Details a Linking GUI and W2 W3 W4 W1
MAY W2 W3 W4
Project Files
9 a b c
Testing Unit Testing Integration Testing System Testing
10 a
Evaluation Project Evaluation
Documentation Review Recommendation and
Table: 4.4
Testing
6.1 INTRODUCTION
Testing is the process of executing a program with the intent of finding errors. Testing is a process used to help identify the correctness, completeness and quality of developed computer
software. Testing helps is verifying and Validating if the Software is working as it is intended to be working. What is software testing? Exercising (analyzing) a system or component with defined inputs capturing monitored outputs comparing outputs with specified or intended requirements To maximize the number of errors found by a finite no of test cases. Testing is successful if you can prove that the product does what it should not do and does not do what it should do. Test cases are devised with this purpose in mind. A test case is set of data that the system will process as an input. However the data are created with intent of determining whether the system will process them correctly without any errors to produce the required output. 6.2 INTEGRATION TESTING
When the individual components are working correctly and meeting the specified objectives, they are combined into a working system. This integration is planned and co-coordinated so that when a failure occurs, there is some idea of what caused it. In addition, the order in which components are tested, affects the choice of test cases and tools. This test strategy explains why and how the components are combined to test the working system. It affects not only the integration timing and coding order, but also the cost and thoroughness of the testing.
6.2.1 BOTTOM-UP INTEGRATION One popular approach for merging components to the larger system is bottom-up testing. When this method is used, each component at the lowest level of the system hierarchy is tested individually. Then, the next components to be tested are those that call the previously tested ones. This approach is followed repeatedly until all components are included in the testing.
Bottom-up method is useful when many of the low-level components are general-purpose utility routines that are invoked often by others, when the design is object-oriented or when the system is integrated using a large number of stand-alone reused components. 6.2.2 TOP-DOWN INTEGRATION Many developers prefer to use a top-down approach, which in many ways is the reverse of bottom-up. The top level, usually one controlling component, is tested by itself. Then, all components called by the tested components are combined and tested as a larger unit. This approach is reapplied until all components are incorporated. 6.3 BLACK BOX TESTING
Black Box Testing involves testing without knowledge of the internal workings of the item being tested. For example, when black box testing is applied to software engineering, the tester would only know the "legal" inputs and what the expected outputs should be, but not how the program actually arrives at those outputs. It is because of this that black box testing can be considered testing with respect to the specifications, no other knowledge about the program is necessary. For this reason, the tester and the programmer can be independent of one another, avoiding programmer bias toward his own work. For this testing, test groups are often used. Also, due to the nature of black box testing, the test planning can begin as soon as the specifications are written. The opposite of this would be glass box testing where test data are derived from direct examination of the code to be tested. For glass box testing, the test cases cannot be determined until the code has actually been written. Both of these testing techniques have advantages and disadvantages, but when combined, they help to ensure thorough testing of the product.
6.3.1 TESTING STRATEGIES AND TECHNIQUES
The Black box testing should make use of randomly generated inputs (only a test range should be specified by the tester), to eliminate any guess work by the tester. The data outside of the specified input range should be tested to check the robustness of the program.
The boundary cases should be tested (top and bottom of specified range) to make sure that the highest and lowest allowable inputs produce proper output.
The number zero should be tested when numerical data is given as input. Stress testing (try to overload the program with inputs to see where it reaches its maximum capacity) should be performed, especially with real time systems.
Crash testing should be performed to identify the scenarios that would make the system down.
Test monitoring tools should be used whenever possible to track which tests have already been performed and the outputs of these tests is used to avoid repetition and also to aid in the software maintenance.
Other functional testing techniques include transaction testing, syntax testing, domain testing, logic testing, and state testing.
Finite state machine models can be used as a guide to design the functional tests.
6.3.2 ADVANTAGES OF BLACK BOX TESTING

It is more effective on larger units of code than that of glass box testing. The tester needs no knowledge of implementation, including specific programming languages.
The tester and the programmer are independent of one another.
6.3.3 DISADVANTAGES OF BLACK BOX TESTING

Only a small number of possible inputs can actually be tested. The test cases are hard to design without clear and concise specifications. WHITE BOX TESTING
6.4
White box testing uses an internal perspective of the system to design test cases based on internal structure. It is also known as glass box, structural, clear box and opens box testing. It requires programming skills to identify all paths of the software. The tester chooses test case inputs to exercise all paths and to determine the appropriate outputs. In electrical hardware, testing every node in a circuit may be probed and measured. E.g. in-circuit testing (ICT). Since the tests are based on the actual implementation, when the implementation changes the tests also change probably. For instance, ICT needs update if the component value changes, and needs modified/new fixture if the circuit changes. This adds financial resistance to the change process, thus buggy products may stay buggy. Automated Optical Inspection (AOI) offers similar component level correctness checking without the cost of ICT fixtures. However changes still require test updates. While white box testing is applicable at the unit, integration and system levels of the software testing process, it is typically applied to the unit. So when it normally tests paths within a unit, it can also test paths between units during integration, and between subsystems during a system level test. Though this method of test design can uncover an overwhelming number of test cases, it might not detect unimplemented parts of the specification or missing requirements. But it is sure that all the paths through the test objects are executed. 6.4.1 WHITE BOX TESTING STRATEGY
White box testing strategy deals with the internal logic and structure of the code. The tests that are written based on the white box testing strategy incorporates coverage of the code written, branches, paths, statements and internal logic of the code etc.In order to implement white box testing, the tester has to deal with the code and hence he should possess knowledge of coding and logic i.e. internal working of the code. White box testing also needs the tester to look into the code and find out which unit/ statement/ chunk of the code is malfunctioning.
6.4.2 ADVANTAGES OF WHITE BOX TESTING
As the knowledge of internal coding structure is prerequisite, it becomes very easy to find out which type of input/data can help in testing the application effectively.
It helps in optimizing the code. It helps in removing the extra lines of code, which can bring in hidden defects. Introspection: Introspection, or the ability to look inside the application, means that testers can identify objects programmatically. This is helpful when the GUI is changing frequently or the GUI is yet unknown as it allows testing to proceed. It also can, in some situations, decrease the fragility of test scripts provided the name of an object does not change.
6.4.3 DISADVANTAGES OF WHITE BOX TESTING As the knowledge of code and internal structure is a prerequisite, a skilled tester is needed to carry out this type of testing which increases the cost. It is nearly impossible to look into every bit of code to find out hidden errors, which may create problems resulting in failure of the application. Time is the sole biggest disadvantage of white box testing. Testing time can be extremely expensive.
SCREEN SHOTS
REFERENCES
[1] R. Agrawal and J. Kiernan. Watermarking relational databases. In VLDB 02: Proceedings of the 28th international conference on Very Large Data Bases, pages 155166. VLDB Endowment, 2002. [2] P. Bonatti, S. D. C. di Vimercati, and P. Samarati. An algebra for composing access control policies. ACM Trans. Inf. Syst. Secur., 5(1):135, 2002. [3] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In J. V. den Bussche and V. Vianu, editors, Database Theory ICDT 2001, 8th International Conference, London, UK, January 4-6, 2001, Proceedings, volume 1973 of Lecture Notes in Computer Science, pages 316330. Springer, 2001. [4] P. Buneman and W.-C. Tan. Provenance in databases. In SIGMOD 07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 11711173, [5] New York, NY, USA, 2007. ACM.
Y. Cui and J. Widom. Lineage tracing for general data warehouse
transformations. In The VLDB Journal, pages 471480, 2001.

Data Leakage Detection - Final 26 April

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Leakage Detection - Final 26 April

Uploaded by

Copyright:

Available Formats

ACHIEVING K-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION1

WATERMARKING DIGITAL IMAGES FOR COPYRIGHT PROTECTION

WHY AND WHERE: A CHARACTERIZATION OF DATA PROVENANCE

PRIVACY, PRESERVATION AND PERFORMANCE: THE 3 PS OF DISTRIBUTED DATA MANAGEMENT

Operating System Language Front End Database

: Windows Family. : JDK 1.5 : Java Swing : Msaccess

underlying architecture of a user computer. customized to the users.

Instead, the hot Java browser is

Features and advantage of JAVA:

5.1. SOFTWARE DEVELOPMENT FLOW: SOFTWARE DEVELOPMENT LIFE CYCLE

Concept Exploration WHAT

Implementation Test EVOLVE Installation and FEEDBACK Checkout Operation and

Maintenance Replacement Fig

allow small differences in interpretation of the standards.

Sun describes Java as

Distributor finds his data leak in unauthorized place

Data Leak out

Agent 5 Unauthorized person

Data Flow Diagram: LEVEL 1: DATA TRANSFER PROCESS

Add Fake object

Agents Only guilt agent data

Agent id, fake object, no. of file transfer

LEVEL 2: GUILT MODEL ANALYSIS

Find guilt agent

View released data

Probability between guilt agents

UML DIAGRAM USE CASE DIAGRAM:

Explicit or sample request

Add fake objects

Distribute data to agents

Find guilt agent

Distribute Data to agents

View Distributed Data

Find Guilt Agents

Probability Distribution of Data

1: Login as Distributor Login Distribute Data to agents

2: Store data into database View Distributed Data

3: View from database for data leakage

DISTRIBUTE DATA TO AGENTS

VIEW DATA DISTRIBUTED TO AGENTS

FIND GUILT AGENTS

FIND PROBABILITY OF DATA LEAKAGE

4. Major software functions Number 1 Description Functions

Netbeans 1.6.0 JDK1.6.0

Catastrophic Critical Marginal Negligible

Probability 20% 25% 20% 50% 50%

Inexperience in Project Technical optimistic Project sufficient Technical

4.4 PROJECT TIMELINE AUGUST W1 1 Requirement gathering W2 W3 W4 W1 SEPTEMBER W2 W3 W4

Requirement analysis Analysis information to Mobile agent of related

analysis information to RMI

analysis information to environment

NETBEANS IDE 6.9

Feasibility Economic Feasibility Technical Feasibility Behavioral Feasibility

W1 5 a Planning Scheduling of task

Time Allocation c d Effort Allocation Resource Allocation

Designing RMI studied Logic are

Designing of GUI Designing Database of

FEBRUARY W1 7 a b c Coding Coding of GUI Coding of RMI Coding of RMI W2 W3 W4 W1

APRIL W1 8 implementation Details a Linking GUI and W2 W3 W4 W1

Testing Unit Testing Integration Testing System Testing

Evaluation Project Evaluation

Documentation Review Recommendation and