You are on page 1of 20

Detecting Data Leakage

Panagiotis Papadimitriou
papadimitriou@stanford.edu

Hector Garcia-Molina
hector@cs.stanford.edu

Leakage Problem
Name: Sarah
Sex: Female . Name: Mark

Sex: Male
. Jeremy Sarah App. U1 App. U2 Mark

Other Sources e.g. Sarahs Network

Kathryn

Stanford Infolab

Outline
Problem Description Guilt Models
Pr{U1 leaked data} = 0.7 Pr{U2 leaked data} = 0.2

Distribution Strategies

Stanford Infolab

Problem Description Guilt Models Distribution Strategies

Stanford Infolab

Problem Entities
Entity Distributor Facebook Dataset T Set of all Facebook profiles

Agents Facebook Apps U1, , Un

R1, , Rn Ri: Set of peoples profiles who have added the application Ui

Leaker

S Set of leaked profiles

Stanford Infolab

Agents Data Requests


Sample
100 profiles of Stanford people

Explicit
All people who added application
(example we used so far)

All Stanford profiles

Stanford Infolab

Problem Description Guilt Models Distribution Strategies

Stanford Infolab

Guilt Models (1/3)


p: posterior probability that a leaked profile comes from other sources

p p
Guilty Agent: Agent who leaks at least one profile Pr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S
Stanford Infolab

Other Sources e.g. Sarahs Network


8

Guilt Models (2/3)


Agents leak each of their data items independently
p2

Agents leak all their data items OR nothing

p(1-p)

(1-p)p

or

or

(1-p)2

or
Stanford Infolab 9

Guilt Models (3/3)


Independently NOT Independently

Pr{G2}

Pr{G2}

Pr{G1} Pr{G1}

Stanford Infolab

10

Problem Description Guilt Models Distribution Strategies

Stanford Infolab

11

The Distributors Objective (1/2)


R1 R2 U1 U2
R3

S (leaked)
R1

R3 R4

U3 U4
Stanford Infolab

Pr{G1|S}>>Pr{G2|S}
Pr{G1|S}>> Pr{G4|S}
12

The Distributors Objective (2/2)


To achieve his objective the distributor has to distribute sets Ri, , Rn that minimize
i

1 Ri

R R
j i i

, i, j 1,..., n

Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents
Stanford Infolab 13

Distribution Strategies Sample (1/4)


Set T has four profiles:
Kathryn, Jeremy, Sarah and Mark

There are 4 agents:


U1, U2, U3 and U4

Each agent requests a sample of any 2 profiles of T for a market survey


Stanford Infolab 14

Distribution Strategies Sample (2/4)


Poor Minimize Ri R j
i j

U1 U2 U3 U4

U1

U2
U3 U4

Stanford Infolab

15

Distribution Strategies Sample (3/4)


Optimal Distribution

U1 U2 U3 U4

Avoid full overlaps and minimize R R


j i i

1 Ri

Stanford Infolab

16

Distribution Strategies Sample (4/4)

Stanford Infolab

17

Distribution Strategies
Sample Data Requests
The distributor has the freedom to select the data items to provide the agents with General Idea:
Provide agents with as much disjoint sets of data as possible

Explicit Data Requests The distributor must provide agents with the data they request General Idea:
Add fake data to the distributed ones to minimize overlap of distributed data

Problem: There are cases where the distributed data must overlap E.g., |Ri|++|Rn|>|T|

Problem: Agents can collude and identify fake data NOT COVERED in this talk
Stanford Infolab 18

Conclusions
Data Leakage Modeled as maximum likelihood problem Data distribution strategies that help identify the guilty agents

Stanford Infolab

19

Thank You!

You might also like