You are on page 1of 32

A SEMINAR REPORT ON

DATA LEAKAGE DETECTION

SUBMITTED BY ANUSRI RAMCHANDRAN PRADEEP CHAUHAN ASHOK POOJA SHANTILAL CHHIPA (3) (7) (8)

UNDER THE GUIDANCE OF MRS.JYOTI WADMARE

DEPARTMENT OF COMPUTER ENGINEERING K.J. SOMAIYA INSTITUTE OF ENGINEERING AND INFORMATION TECHNOLOGY EVERARD NAGAR, EASTERN EXPRESS HIGHWAY,SION,MUMBAI-42

Certificate

This is to certify that the seminar entitled DATA LEAKAGE DETECTION has been submitted by the following students.

Anusri Ramchandran(3) Pradeep Chauhan Ashok(7) Pooja Shantilal Chhipa(8)

Under guidance of Prof. Jyoti Wadmare for the subject seminar in semester VI

Internal examiner (Prof._____________________)

Head of Department (Prof. Uday Rote)

External examiner (________________________)

Principal (Dr. S.G. Kirloskar)

Acknowledgement
Data Leakage Detection is a seminar report brought about by the effort of a dynamic team and lot of other people. We would like to extend our sincere thanks to them. We express our deepest gratitude to Mrs.Jyoti Wadmare who guided us through the entire seminar report. We sincerely thank our college staff for their help and guidance. We also extend our formal gratitude to Prof. Uday Rote, Head of Department of Computer Engineering who provided us the needful facilities in the implementation of the seminar. We are really indebted to those who are involved in realization of this project. We are also thankful to our family members and friends for their patience and encouragement.

Anusri Ramchandran Pradeep Chauhan Pooja Shantilal Chhipa

Abstract
We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebodys laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases, we can also inject realistic but fake data records to further improve our chances of detecting leakage and identifying the guilty party. An organization secures its data/information from intruders (i.e. Hackers) by protecting their network but data is growing rapidly as the sizes of organizations (e.g. due to globalization) grows and also to access these data number of data points (machines and servers) are rising which are simpler modes of communication. Sometimes intentional or even unintentionally data leaks from within the organization and become a painful reality. This has lead to growing information security awareness in general and about outbound content management in particular. A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebodys laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases we can also inject realistic but fake data records to further improve our chances of detecting leakage and identifying the guilty party.

Index SR.NO.
1. 2. 3.

TOPIC
Introduction Literature Review Report on Present Investigation 3.1 Introduction to Data Leakage 3.2 The Leaking Faucet 3.3 Data Leakage Detection 3.1.1 Data Allocation Strategy 3.2.2 Guilt Detection Model 3.2.3 Symbols and Trminology Agent Guilt Model 4.1 Guilty Agent 4.2 Guilt Agent Detection Allocation Strategy 5.1 Explicit Data Requests 5.2 Sample Data Requests 5.3 Data Allocation Problem 5.3.1 Fake Objects 5.3.2 Optimization Problem Intelligent Data Dstribution 6.1 Hashed Distribution Algorithm 6.2 Detection Process 6.3 Benefits of Hashed distribution Summary References and Links

PAGE NO. 1 2 4 4 6 8 9 10 11 13 13 14

4.

5.

16 16 19 21 22 22 24 24 25 25 26 27

6.

7. 8.

Chapter 1

Introduction
In the business process, sometimes sensitive data must be given to trusted third parties. For example, a company may have partnerships with other companies that require sharing customer data to complete any process of organization. Similarly, a hospital gives patient records to the researchers to devise new treatments. Another enterprise may outsource its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents [1]. Our goal is to detect the agent leaked the data when the distributor distributes sensitive data among various agent, and if possible to identify the guilty of agent that leaked the data. Leakage detection is traditionally handled by the process of watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification in the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious. In this paper we study unobtrusive techniques for detecting leakage of a set of objects or records. Specifically, we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. For example, the data may be found on a web site, or may be obtained through a legal discovery process. At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. To detect guilty of an agent intelligent distribution techniques must be developed to identify the agent leak the data object obtained from certain discovery process or from any authorized place and a model for assessing the guilt of agents. In these algorithms we consider the option of adding fake objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents, it works as watermark for distributed data set and used to identify the agent responsible for leakage of data object, fake records are added with some modification in the data set.

Chapter 2
Literature Review

In Introduction to Data leakage detection system guilt detection approach and data allocation strategies are introduced along with it various other works on mechanisms that allow only authorized users to access sensitive data through access control policies are performed. Such approaches prevent in some sense data leakage by sharing information only with trusted parties. However, these policies are restrictive and may make it impossible to satisfy agents requests. The guilt detection approach is related to the data provenance problem in which tracing the lineage of leaked objects implies essentially the detection of the guilty agents. As far as the data allocation strategies are concerned, it is mostly relevant to watermarking that is used as a means of establishing original ownership of distributed objects. All Algorithms presented in this paper have been implemented in a prototype lineage tracing system and preliminary performance results are reported. Enabling lineage tracing in a data warehousing environment has several benefits and applications, including in-depth data analysis and data mining, authorization management, view update, efficient warehouse recovery, and others as outlined [6]. Panagiotis Papadimitriou and Hector Garcia-Molina analyzed various data allocation strategies and related the data allocation strategy with the process of watermarking in which the work is mostly relevant to watermarking that is used as a means of establishing original ownership of distributed objects. Watermarks were initially used in images, video, and audio data whose digital representation includes considerable redundancy. Recently, other works have also studied marks insertion to relational data. The approach and watermarking are similar in the sense of providing agents with some kind of receiver identifying information. However, by its very nature, a watermark modifies the item being watermarked. If the object to be watermarked cannot be modified, then a watermark cannot be inserted. In such cases, methods that attach watermarks to the distributed data are not applicable [1].The authors conducted experiments with simulated data leakage problems to evaluate their performance and present the evaluation for sample requests and explicit data requests, respectively. To calculate the guilt probabilities and differences, use throughout this section p = 0.5. Although not reported here, this is experimented with other p values and observed that the relative performance of our algorithms and our main conclusions do not change. If p approaches to 0, it becomes easier to find guilty agents and algorithm performance converges. On the other hand, if p

approaches 1, the relative differences among algorithms grow since more evidence is needed to find an agent guilty. The presented experiments confirmed that fake objects can have a significant impact on our chances of detecting a guilty agent. This means that by allocating 10 percent of fake objects, the distributor can detect a guilty agent even in the worst-case leakage scenario, while without fake objects, he will be unsuccessful not only in the worst case but also in average case. With explicit data request few objects that are shared among multiple agents. These are the most interesting scenarios, since object sharing makes it difficult to distinguish a guilty from no guilty agents. Scenarios with more objects to distribute or scenarios with objects shared among fewer agents are obviously easier to handle. As far as scenarios with many objects to distribute and many overlapping agent requests are concerned, they are similar to the scenarios , since we can map them to the distribution of many small subsets. With sample data requests, agents are not interested in particular objects. Hence, object sharing is not explicitly defined by their requests. The distributor is forced to allocate certain objects to multiple agents only if the number of requested objects exceeds the number of objects in set T. The more data objects the agents request in total, the more recipients, on average, an object has; and the more objects are shared among different agents, the more difficult it is to detect a guilty agent. Piero Bonatti, Sabrina De Capitani Di Vimercati and Pierangela Samarati worked on access control policy which works as mechanisms that allow only authorized users to access sensitive data through access control policies are performed. Such approaches prevent in some sense data leakage by sharing information only with trusted parties. Access control are all based on monolithic and complete specifications. The approach for data leakage detection is similar to the watermark added in the image, audio or video for its detection by adding unique code for the identification. In case of data object set some fake records are added.

Chapter 3
Report on Present Investigation
Data leakage happens every day when confidential business information such as customer or

patient data, source code or design specifications, price lists, intellectual property and trade secrets, forecasts and budgets in spreadsheets are leaked out. In this report a problem is considered in which a data distributor has given sensitive data to a set of supposedly trusted agents and some of the data are leaked and found in an unauthorized place by any means. The problem with the data leakage is that once this data is no longer within the domain of distributor, then the company is at serious risk. The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data. In some cases, we can also inject realistic but fake data records to further improve our chances of detecting leakage and identifying the guilty party. Further modification is applied in order to overcome the problems of current algorithm by intelligently distributing data object among various agents in such a way that identification of guilty agent becomes simple. 3.1 Introduction to Data Leakage Data Leakage, put simply, is the unauthorized transmission of data (or information) from within an organization to an external destination or recipient. Leakage is possible either intentionally or unintentionally by internal or external user, internals are authorized user of the system who can access the data through valid access control policy, however external intruder access data through some attack on target machine which is either active or passive. This may be electronic, or may be via a physical method. Data Leakage is synonymous with the term Information Leakage and harms the image of organization and create threat in the mind for continuing the relation with the distributor, as it is not able to protect the sensitive information. The reader is encouraged to be mindful that unauthorized does not automatically mean intentional or malicious. Unintentional or inadvertent data leakage is also unauthorized.

Figure:3.1 Data Leakage

According to data compiled from EPIC.org and PerkinsCoie.com which conducted survey on data leakage by considering various different organizations and concluded that, 52% of Data Security breaches are from internal sources compared to the remaining 48% by external hackers hence the protection needed from internal users also.

Type of data leaked Confidential information Intellectual property Customer data Health record

Percentage 15 4 73 8

Table.3.2 Types of data leaked [11] The noteworthy aspect of these figures is that, when the internal breaches are examined, the percentage due to malicious intent is remarkably low. Less than 1%, the level of inadvertent data breach is significant (96%). This is further deconstructed to 46% being due to employee oversight, and 50% due to poor business process.

3.2 The Leaking Faucet Data protection programs at most organizations are concerned with protecting sensitive data from external malicious attacks, relying on technical controls that include perimeter security, network/wireless surveillance and monitoring, application and point security management, and user awareness and education. But what about inadvertent data leaks that isnt so sensational. For example unencrypted information on a lost or stolen laptop/ USB or other device? Like the steady drip from a leaking faucet, everyday data leaks are making headlines more often than the nefarious attack scenarios around which organizations plan most, if not all, of their data leakage prevention methods. However, to truly protect their critical data, organizations also need to plan a more data-centric approach to their security programs to protect against leaks that occur for sensitive data. Organizations are concerned with protecting sensitive data from external malicious attacks, relying on technical controls that include perimeter security and internal staff and from those who can access those data from any method, network/wireless surveillance and monitoring, application and point security management, and user awareness and education and DLP solutions. But what about data leaks by trusted third party called as agents who are not present inside the network and their activity is not easily traceable for this situation some care must be taken so that data is not misused by them. Various sensitive information such as Financial Data, Private Data, Credit Card Information, Health Record Information, Confidential Information, Personal Information etc are part of various different organizations which can be prevented by various different ways as shown in figure 3.3. The leaking faucets include Education, Prevention and Detection to control the leakage of sensitive information. Education remains the important factor among all the protection measure, which includes training and awareness program to handle the sensitive information and their importance for the organization. The figure 3.3 represents faucet for data leakage in the center different kinds of sensitive information is placed which is surrounded by protecting mechanism which will prevent leakage of valuable information from the organization which leads to major problem.

Figure 3.3: The Leaking Faucet

Prevention mechanism deals with the DLP mechanism which is suits of technology which prevent leakage of data by classifying sensitive information and monitoring them and through various accesses control policy the access is prevented from users. Education prevents the leakage as most of the time leakage occurs unintentionally by the internal users. Detection process detects the leakage of information distributed to trustworthy third party called as agent and to calculate their involvement in the process of leakage. 3.3 Data Leakage Detection Organizations thought of data/information security only in terms of protecting their network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the sizes of organizations (e.g. due to globalization), rise in number of data points (machines and servers)

and easier modes of communication, accidental or even deliberate leakage of data from within the organization has become a painful reality. This has lead to growing awareness about information security in general and about outbound content management in particular. Data Leakage, put simply, is the unauthorized transmission of data (or information) from within an organization to an external destination or recipient. This may be electronic, or may be via a physical method. Data Leakage is synonymous with the term Information Leakage. The reader is encouraged to be mindful that unauthorized does not automatically mean intentional or malicious. Unintentional or inadvertent data leakage is also unauthorized. In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, a hospital may give patient records to researchers who will devise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may outsource its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents. Our goal is to detect the guilty agent among all trustworthy agents when the distributors sensitive data have been leaked by any one agent, and if possible to identify the agent that leaked the data. We consider applications where the original sensitive data cannot be perturbed. Perturbation is a very useful technique where the data are modified and made less sensitive before being handed to agents, in such case sensitive data is sent as it is example is contact no of employee it cannot be altered and sent to third party for recruitment as it is sensitive or similarly its bank account number perturbation of those information to makes no effect after transmission to the receiver as this data is not useful for any process. To overcome such a situation an effective method is required through which the distribution of data is possible to modify the valuable information. In such case one can add random noise to certain attributes, or one can replace exact values by ranges [4]. However, in some cases, it is important not to alter the original distributors data. For example, if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers

Fig 3.4: Data Leakage Detection Process

If medical researchers treating the patients (as opposed to simply computing statistics), they may need accurate data for the patients. Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data by adding redundancy. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious as it is aware of various techniques to temper the watermark. Data Leakage Detection system mainly divided into two modules

3.3.1

Data allocation strategy

This module helps in intelligent distribution of data set so that if that data is leaked guilty agent is identified.

Fig 3.5: Data Distribution Scenario

3.3.2

Guilt detection model

This model helps to determine the agent is responsible for leakage of data or data set obtained by target is by some other means. It requires complete domain knowledge to calculate the probability p to evaluate guilty agent. From the domain knowledge and proper analysis and experiment probability facture is calculated which act as threshold for evidence to prove guilty of agent. It means when number of leaked record are more than the probability specified then agent is guilty and if number of leaked records is less then is said to be not guilty because in such situation it is possible that leaked object is obtained by target by some other means.

Fig 3.6: Guilt Detection Model An unobtrusive technique for detecting leakage of a set of objects or records is proposed in this report. After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. (For example, the data may be found on a website, or may be obtained through a legal discovery process.) At this point, the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently

gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with five cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees enough evidence that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings. In this paper, we develop a model for assessing the guilt of agents. We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding fake objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark for the entire set, without modifying any individual members. If it turns out that an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty. 3.3.3 Symbols and Terminology A distributor owns a set T = {t1, t2, t3, . } of valuable and sensitive data objects. The distributor wants to share some of the objects with a set of agents U1, U2 Un, but does not wish the objects be leaked to other third parties. The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a database. An agent Ui receives a subset of objects Ri subset of T, determined either by a sample request or an explicit request:

Distributor: A distributor owns a set T of valuable and sensitive data objects. Owner of data set T = {t1, t2, , tn} Agent (U): The distributor shares some of the objects with a set of agents U1, U2 Un, but does not wish the objects be leaked to other third parties. Receives set R

T from the distributor.

Target: Unauthorized third party caught with leaked data set S

T.

Example: Say T contains customer records for a given company A. Company C hires a marketing agency U1 to do an on-line survey of customers. Since any customers will do for the survey, U1 requests a sample of 1000 customer records. At the same time, company C subcontracts with agent U2 to handle billing for all California customers. Thus, U2 receives all T records that satisfy the condition state is California Suppose that after giving objects to agents, the distributor discovers that a set S subset of T has leaked. This means that some third party called the

target has been caught in possession of S. For example, this target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target turned over S to the distributor. Agents U1, U2 Un have some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue that they are innocent, and that the S data was obtained by the target through other means. For example, say one of the objects in S represents a customer X. Perhaps X is also a customer of some other company, and that company provided the data to the target. Or perhaps X can be reconstructed from various publicly available sources on the web. Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not leak anything. Similarly, the rarer the objects, the harder it is to argue that the target obtained them through other means. Not only do we want to estimate the likelihood the agents leaked data, but we would also like to find out if one of them in particular was more likely to be the leaker. For instance, if one of the S objects was only given to agent U1, while the other objects were given to all agents, we may suspect U1 more. The model we present next captures this intuition. We say an agent Ui is guilty if it contributes one or more objects to the target. While performing implementation and research work for the calculation of guilty of agent in order to reduce the complexity and computation we are following various assumption

Chapter 4

Agent Guilt Model


This model helps to determine the agent is responsible for leakage of data or data set obtained by target is by some other means. The distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with five cookies, it will be much harder for him to argue that his hands were not in the cookie jar. If the distributor sees enough evidence that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings. 4.1 Guilty Agent To compute the probability of guilty agent, we need an estimate for the probability that values in S can be guessed by the target. For instance, say some of the objects in T are emails of individuals. We can conduct an experiment and ask a person with approximately the expertise and resources of the target to find the email of say 100 individuals. If this person can find say 90 emails, then we can reasonably guess that the probability of finding one email is 0.9. On the other hand, if the objects in question are bank account numbers, the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt , the probability that object t can be guessed by the target. To simplify the formulas, we assume that all T objects have the same probability, which we call p. Next, we make two assumptions regarding the relationship among the various leakage events. The first assumption simply states that an agents decision to leak an object is not related to other objects. Suppose that after giving objects to agents, the distributor discovers that a set S T has leaked. This means that some third party called the target has been caught in possession of S. For example, this target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target turned over S to the distributor. Since the agents U 1, ,Un have some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue that they are innocent, and that the S data was obtained by the target through other means. For example, say one of the objects in S represents a customer X. Perhaps X is also a customer of some other company, and that company provided the data to Equations the target. Or

perhaps X can be reconstructed from various publicly available sources on the web. Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not leak anything. Similarly, the rarer the objects, the harder it is to argue that the target obtained them through other means. Not only do we want to estimate the likelihood the agents leaked data, but we would also like to find out if one of them in particular was more likely to be the leaker. For instance, if one of the S objects was only given to agent U1, while the other objects were given to all agents, we may suspect U1 more. The model we present next captures this intuition. We say an agent Ui is guilty if it contributes one or more objects to the target. We denote the event that agent Ui is guilty for a given leaked set S by {Gi | S}. Our next step is to estimate Pr {Gi | S}, i.e., the probability that agent Ui is guilty given evidence S. 4.2 Guilt Agent Detection We can conduct an experiment and ask a person with approximately the expertise and resources of the target to find the email of say 100 individuals. If this person can find say 90 emails, then we can reasonably guess that the probability of finding one email is 0.9. On the other hand, if the objects in question are bank account numbers, the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt, the probability that object t can be guessed by the target [2]. For simplicity we assume that all T objects have the same pt, which we call p. Next, make two assumptions regarding the relationship among the various leakage events. The first assumption simply states that an agents decision to leak an object is not related to other objects Assumption1. For all t, t provenance of t [1]. The term provenance in this assumption statement refers to the source of a value t that appears in the leaked set. The source can be any of the agents who have t in their sets or the target itself (guessing). The following assumption states that joint events have a negligible probability Assumption2. An object t S can only be obtained by the target in one of the two ways as follows.

S such that t

t the provenance of t is independent of the

A single agent Ui leaked t from its own Ri set. The target t guessed or obtained through other means without the help of any of the n agents. In other words, for all t the distributor set T, the agent sets Rs, and the target set S are: T = { t1, t2, t3 }, R1 = { t1, t2 }, R2 = { t1, t3 }, S = { t1, t2, t3 }. In this case, all three of the distributors objects have been leaked and appear in S. Let us first consider how the target may have obtained object t1, which was given to both agents. From Assumption 2, the target either guessed t1 or one of U1 or U2 leaked it. We know that the probability of the former event is p, so assuming that probability that each of the two agents leaked t1 is the same, we have the following cases:

S, the event that the

target guesses t and the events that agent Ui (i = 1, . . . , n) leaks object t are disjoint Assume that

The target guessed t1 with probability p, Agent U1 leaked t1 to S with probability (1-p)/2, Agent U2 leaked t1 to S with probability (1-p)/2.

Similarly, we find that agent U1 leaked t2 to S with probability (1-p) since he is the only agent that has t2. Given these values, the probability that agent U1 is not guilty, namely that U1 did not leak either object, is

Pr{G1 | S } = ( 1 (1 p) / 2 ) ( 1 (1 p) )
And the probability that U1 is guilty is:

Pr(G1 | S ) = 1 Pr{G1 }
If Assumption 2 did not hold, our analysis would be more complex because we would need to consider joint events, e.g., the target guesses t1, and at the same time, one or two agents leak the value. In our simplified analysis, we say that an agent is not guilty when the object can be guessed, regardless of whether the agent leaked the value. Since we are not counting instances when an agent leaks information, the simplified analysis yields conservative values (Smaller Probability) analysis.

Chapter 5
Allocation Strategy

Allocation strategies that are applicable to problem instances data requests are discussed. We deal with problems with explicit data requests, and problems with sample data requests. 5.1 Explicit Data Requests In problems the distributor is not allowed to add fake objects to the distributed data. So, the data allocation is fully defined by the agents data requests. In EF problems, objective values are initialized by agents data requests. Say, for example, that T= {t1, t2} and there are two agents with explicit data requests such that R1 = {t1, t2} and R2 = {t1}. The value of the sum objective is in this case is 1.5. The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 R2. If the distributor is able to create more fake objects, he could further improve the objective. The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1
R2.

However, say that the distributor can create one fake object (B = 1) and both agents can receive one fake object (b1 = b2 = 1). In this case, the distributor can add one fake object to either R1 or R2 to increase the corresponding denominator of the summation term. Assume that the distributor creates a fake object f and he gives it to agent R1. Agent U1 has now R1 = { t1, t2, f } and F1 = {f} and the value of the sum-objective decreases to 1:33 < 1:5. If the distributor is able to create more fake objects, he could further improve the objective. Algorithm 1 is a general driver that will be used for the allocation in case of explicit request with fake record. In the algorithm first the random agent is selected from the list and then its request is analyzed after the computation Fake records are created by function CREATEFAKEOBJECT() fake records are added in the data set and given back to the agent requested that data set. Fake records help in the process of identifying agent from the leaked data set.

Algorithm 1: Allocation for Explcit Data Requests (EF) Input: R ,R1, R2,,R, cond1 , condi , B1,Bn,B Output : R1,,Rn,F1,,Fn

R Null For i=1, , n do If bi > 0 then R R {i} Fi Null While B>0 do I SELECT AGENT (R, R1, R2, Rn) FCREATEFAKE OBJECT (Ri, Fi, condi) Ri Ri {F} Fi Fi {F} bi bi -1 if bi = 0 then R=R/R{i} BB-1 Algorithm 2: Agent Selection for e random function SELECT AGENT (R,.R1, ,Rn) I select random agent from R return i

Flow Chart for implementation of the following Algorithms: (a)Allocation for Explicit Data Request(EF) with fake objects (b) Agent Selection for
Check the Condition Select the e-random andagent and add e-optimal the fake. IF Start B> 0 Evaluate The Loop. User Request Loop Iterates for n number of requests Stop object Create FakeR ExplicitOutput. Object is User Receives the Invoked Else Exit

IN Both the above case CREATEFAKCEOBJECT() METHOD GENERATES A FAKEOBJECT.

5.2 Sample Data Requests

With sample data requests, each agent Ui may receive any T subset out of entire distribution set which are different ones. Hence, there are different object allocations. In every allocation, the distributor can permute T objects and keep the same chances of guilty agent detection. The reason is that the guilt probability depends only on which agents have received the leaked objects and not on the identity of the leaked objects. The distributors problem is to pick one out so that he optimizes his objective. The distributor can increase the number of possible allocations by adding fake objects. Algorithm 3: Allocation for Sample Data Requests (S F ) Input : m1, , mn |T| Output: R1 ,, Rn A O|T| R1 null, ,Rn null Rem mi

While rem > 0 do For i=1, , n:Ri <mi do k SELECT OBJECT(i, Ri) Ri Ri {tk} A[k] a[k] + 1 Rem rem - 1

Algorithm 4: Object Selection function SELECTOBJE CT(i, Ri) K select at random an element from set (K | tk Ri) Return k

Flow Chart for implementation of the following Algorithms: (a) Allocation for Sample Data Request(EF) without any fake objects: (b) Agent Selection for e-random and e-optimal In Both the following cases Select Method() returns the value of Ri n Rj
Start

User Request R Explicit

Check the Condition Select the agent and add the fake. IF B> 0 Else Exit

object Evaluate The Loop. SelectObject() Method is Invoked

Loop Iterates for n number of requests

User Receives the Output.

Stop

5.3 Data Allocation Problem The main focus of our work is the data allocation problem: how can the distributor intelligently give data to agents to improve the chances of detecting a guilty agent. As illustrated in Figure, there are four instances of this problem we address, depending on the type of data requests made by agents (E for Explicit and S for Sample requests) and whether fake objects are allowed (F for the use of fake objects, and F for the case where fake objects are not allowed). Fake objects are objects generated by the distributor that are not in set T. The objects are designed to look like real objects, and are distributed to agents together with the T objects, in order to increase the chances of detecting agents that leak data.

Fig6.1: Leakage problem instances.

Assume that we have two agents with requests R1 = EXPLICIT( T, cond1 ) and R2 = SAMPLE(T, 1), where T = EXPLICIT( T, cond2 ). Further, say that cond1 is state = CA (objects have a state field). If agent U2 has the same condition cond2 = cond1, we can create an equivalent problem with sample data requests on set T. That is, our problem will be how to distribute the CA objects to two agents, with R1 = SAMPLE( T, |T| ) and R2 = SAMPLE(T, 1). If instead U2 uses condition state = NY, we can solve two different problems for sets T and T T. In each problem, it will have only one agent. Finally, if the conditions partially overlap, R1 T NULL, but R1 different problems for sets R1 T, R1 T, and T R1.

T,

we can solve three

5.3.1 Fake Object The distributor may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty agents. However, fake objects may impact the correctness of what agents do, so they may not always be allowable. The idea of perturbing data to detect leakage is not new. However, in most cases, individual objects are perturbed, e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. In this case, perturbing the set of distributor objects by adding fake elements is done. For example, say the distributed data objects are medical records and the agents are hospitals. In this case, even small modifications to the records of actual patients may be undesirable. However, the addition of some fake medical records may be acceptable, since no patient matches these records, and hence no one will ever be treated based on fake records. A trace file is maintained to identify the guilty agent. Trace file are a type of fake objects that help to identify improper use of data. The creation of fake but real-looking objects is a nontrivial problem whose thorough investigation is beyond the scope of this paper. Here, we model the creation of a fake object for agent Ui as a black box function CREATEFAKEOBJECT ( Ri, Fi, condi ) that takes as input the set of all objects Ri, the subset of fake objects Fi that Ui has received so far, and condi, and returns a new fake object. This function needs condi to produce a valid object that satisfies Uis condition. Set Ri is needed as input so that the created fake object is not only valid but also indistinguishable from other real objects.
5.3.2 Optimization Problem

The distributors data allocation to agents has one constraint and one objective. The distributors constraint is to satisfy agents requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent who leaks any of his data objects. We consider the constraint as strict. The distributor may not deny serving an agent request and may not provide agents with different perturbed versions of the same objects. We consider fake object allocation as the only possible constraint relaxation. Our detection objective is ideal and intractable. Detection would be assured only if the distributor gave no data object to any agent.

We use instead the following objective: maximize the chances of detecting a guilty agent that leaks all his objects. We now introduce some notation to state formally the distributors objective. Recall that Pr {Gj | S = Ri} or simply Pr {Gj | S = Ri}, is the probability that agent Uj is guilty if the distributor discovers a leaked table S that contains all Ri objects. We define the difference functions (i, j) as: (i, j) = Pr {Gj | S = Ri}- Pr {Gi | S = Ri} i, j = 1,, n(6) Note that differences have non-negative values: given that set Ri contains all the leaked objects, agent Ui is at least as likely to be guilty as any other agent. Difference (i, j) is positive for any agent Uj, whose set Ri does not contain all data of S. It is zero, if Ri

Rj. In this case the distributor will consider both agents Ui and Uj equally guilty

since they have both received all the leaked objects. The larger a (i, j) value is, the easier it is to identify Ui as the leaking agent. Thus, we want to distribute data so that values are large.

Chapter 6
Intelligent Data Distribution
6.1 Hashed Distribution Algorithm Input: Agent ID (UID), Number of data item requested in the dataset (N), Fake records (F) Output: Distribution set (Dataset + Fake record) 1. Start 2. Accept the data request from the agent and analyze a. Type of request { Sample, Exclusive } b. Probability of getting records from other means other than the distributor Pr {guessing}
c.

No of records in the Dataset (N) to calculate number of fake record added in order to determine guilty agent.

d. 3.

Agent ID requesting data (UID)

Generate the list of data to be send to the agent (dataset), assign each record with unique distribution ID.

4.

For I =1 to F Mapping_Function (UID, FID) { Hash (UID) DID FID

: > For each fake record

Store DistributionDetails { FID,DID,UID }. } 5. For I=1 to F AddFakeRecord (DistributionDetails) Output: Distribution Set 6. Stop.

6.2 Detection Process Detection process starts when the set of distributed sensitive record found on some unauthorized places. Detection process completes in two phases: In phase one agent is identified by the presence of fake record in the obtained set, if no matching fake record is identified phase two begins which searches for missing record in the set on which fake record is substituted. The advantage of second phase is that it works in a situation in which agent identify and delete the fake record before leaking to the target. Inverse mapping function (Leaked Data Set)
1. 2.

Attach DID to every record Sort records in order of DID

3. Search and map fake record 4. For every Record If fake record = Yes MapAgent (FID) Else If fake records = No Map (UID) which gives hash location Identify the absence of substituted record MapAgent (DID) Else Objects are obtained by some other means. 6.3 Benefits of Hashed distribution: Once the data is distributed fake records are used to identify the guilty agent here instead of record we are using location to determine the guilty agent so even when the presence of fake record is identified by the agent it will delete record but the location is anyhow determine by the distributor so event absence of fake record will reveal the Identity of agent by tracking absence of original record.

This solves data distribution and optimization problem to some extent by its distribution technique.

Chapter 7
Summary
In a perfect world, there would be no need to hand over sensitive data to agents that may unknowingly or maliciously leak it. And even if we had to hand over sensitive data, in a perfect world, we could watermark each object so that we could trace its origins with absolute certainty. However, in many cases, we must indeed work with agents that may not be 100 percent trusted, and we may not be certain if a leaked object came from an agent or from some other source, since certain data cannot admit watermarks. In spite of these difficulties, we have shown that it is possible to assess the likelihood that an agent is responsible for a leak, based on the overlap of his data with the leaked data and the data of other agents, and based on the probability that objects can be guessed by other means. Our model is relatively simple, but we believe that it captures the essential trade-offs. The algorithms we have presented implement a variety of data distribution strategies that can improve the distributors chances of identifying a leaker. We have shown that distributing objects judiciously can make a significant difference in identifying guilty agents, especially in cases where there is large overlap in the data that agents must receive. The data distribution strategies improve the distributors chances of identifying a leaker. It has been shown that distributing objects judiciously can make a significant difference in identifying guilty agents, especially in cases where there is large overlap in the data that agents must receive. In some cases realistic but fake data records are injected to improve the chances of detecting leakage and identifying the guilty party. In future the extension of our allocation strategies can handle agent requests in an online fashion (the presented strategies assume that there is a fixed set of agents with requests known in advance) can be implemented.

References and Links

[1] Panagiotis Papadimitriou, Student Member, IEEE, and Hector Garcia-Molina, Member, IEEE Data Leakage Detection IEEE Transactions on knowledge and data engineering, Vol. 23, NO. 1, January 2011 [2]S.Umamaheswari, H.Arthi Geetha Detection of Guilty Agents Coimbatore Institute of Engineering and Technology. [3]J. Clerk Ma P. Papadimitriou and H. Garcia-Molina, Data leakage detection, Stanford University. [4]L.Sweeney, Achieving K-Anonymity Privacy Protection Using Generalization and Suppression, http://en.scientificcommons. org/43196131, 2002. [5]Peter Gordon Data Leakage Threats and Mitigation SANS Institute Reading Room October 15, 2007 [6]S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, Flexible Support for Multiple Access Control Policies, ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001.

You might also like