You are on page 1of 31

Data Mining: Introduction

Lecture Note !or "ha#ter 1 Introduction to Data Mining


b$ Tan, Steinbach, Kumar

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Why Mine Data? Commercial Viewpoint


G

Lot o! data i being co%%ected and &arehou ed ' (eb data, e)commerce ' #urcha e at de#artment/ grocer$ tore ' *an+/"redit "ard tran action "om#uter ha,e become chea#er and more #o&er!u% "om#etiti,e -re ure i Strong ' -ro,ide better, cu tomi.ed er,ice !or an edge /e0g0 in "u tomer 1e%ation hi# Management2
Introduction to Data Mining 4/18/2004 2

G G

Tan,Steinbach, Kumar

Why Mine Data? Scientific Viewpoint


G

Data co%%ected and tored at enormou #eed /3*/hour2 ' remote en or on a ate%%ite ' te%e co#e canning the +ie

' microarra$ generating gene e4#re ion data ' cienti!ic imu%ation generating terab$te o! data
G G

Traditiona% techni5ue in!ea ib%e !or ra& data Data mining ma$ he%# cienti t ' in c%a i!$ing and egmenting data ' in 6$#othe i 7ormation

Mining Large Data Sets - Moti ation


G G G

!here is often information "hidden# in the data that is not readily e ident $uman analysts may ta%e wee%s to disco er useful information Much of the data is ne er analy&ed at all
4,000,000 9,800,000 9,000,000 2,800,000 2,000,000 1,800,000 1,000,000 800,000 0 1::8 1::; 1::<

The Data Gap


Total new disk (TB) since 1995

Number of analysts
1::8 1:::
4

Tan,Steinbach, Kumar Introduction to Data 4/18/2004 7rom= 10 3ro man, "0 Kamath, >0 Kumar, ?Data Mining !or Mining Scienti!ic and @ngineering A##%ication B

What is Data Mining?


G Many

Definitions

' 'on-tri ial e(traction of implicit) pre iously un%nown and potentially useful information from data ' *(ploration + analysis) ,y automatic or semi-automatic means) of large -uantities of data in order to disco er meaningful patterns

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

What is .not/ Data Mining?


What is not Data Mining?
G G

What is Data Mining? ' "ertain name are more #re,a%ent in certain CS %ocation /DE*rien, DE1ur+e, DE1ei%%$F in *o ton area2 ' 3rou# together imi%ar document returned b$ earch engine according to their conte4t /e0g0 Ama.on rain!ore t, Ama.on0com,2
4/18/2004 ;

' Loo+ u# #hone number in #hone director$ ' Guer$ a (eb earch engine !or in!ormation about ?Ama.onB
Tan,Steinbach, Kumar

Introduction to Data Mining

2rigins of Data Mining


Draws ideas from machine learning01I) pattern recognition) statistics) and data,ase systems G !raditional !echni-ues may ,e unsuita,le due to Stati tic / Machine Learning/ ' *normity of data AI -attern 1ecognition ' $igh dimensionality of data Data Mining ' $eterogeneous) distri,uted nature Databa e $ tem of data
G
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 <

Data Mining !as%s


G

-rediction Method ' C e ome ,ariab%e to #redict un+no&n or !uture ,a%ue o! other ,ariab%e 0 De cri#tion Method ' 7ind human)inter#retab%e #attern that de cribe the data0

rom ! ayyad" et#al#$ %d&ances in 'nowled(e Disco&ery and Data )inin(" 199* Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Data Mining !as%s333


"%a i!ication H-redicti,eI G "%u tering HDe cri#ti,eI G A ociation 1u%e Di co,er$ HDe cri#ti,eI G Se5uentia% -attern Di co,er$ HDe cri#ti,eI G 1egre ion H-redicti,eI G De,iation Detection H-redicti,eI
G

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Classification: Definition
G

3i,en a co%%ection o! record /training set 2


' @ach record contain a et o! attributes, one o! the attribute i the class0

G G

7ind a model !or c%a attribute a a !unction o! the ,a%ue o! other attribute 0 3oa%= #re,iou %$ un een record hou%d be a igned a c%a a accurate%$ a #o ib%e0
' A test set i u ed to determine the accurac$ o! the mode%0 C ua%%$, the gi,en data et i di,ided into training and te t et , &ith training et u ed to bui%d the mode% and te t et u ed to ,a%idate it0

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

Classification *(ample
al al us c c i i o u or or n i g g nt te te ss a a o a c c c cl
!a(a,le Income Cheat 128K 100K <0K 120K 'o 'o 'o 'o 5es 'o
10

Tid 4efund Marital Status 1 2 9 4 8 ; < 8 : 10


10

4efund Marital Status No Je No Je No No Sing%e Married Married

!a(a,le Income Cheat <8K 80K 180K ? ? ? ? ? ?

Je No No Je No No Je No No No

Sing%e Married Sing%e Married

Di,orced :0K Sing%e Married 40K 80K

Di,orced :8K Married ;0K

Di,orced 220K Sing%e Married Sing%e 88K <8K :0K

'o 5es 'o 5es

!est Set

!raining Set

Learn Classifier

Model

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

11

Classification: 1pplication 6
G

Direct Mar+eting ' 3oa%= 1educe co t o! mai%ing b$ targeting a et o! con umer %i+e%$ to bu$ a ne& ce%%)#hone #roduct0 ' A##roach=
C

e the data !or a imi%ar #roduct introduced be!ore0

(e

+no& &hich cu tomer decided to bu$ and &hich decided other&i e0 Thi {buy, dont buy} deci ion !orm the class attribute0 ,ariou demogra#hic, %i!e t$%e, and com#an$) interaction re%ated in!ormation about a%% uch cu tomer 0
+ Type of business" where they stay" how much they earn" etc#

"o%%ect

e thi in!ormation a in#ut attribute to %earn a c%a mode%0


Introduction to Data Mining 4/18/2004

i!ier

rom !Berry , -inoff$ Data )inin( Techni.ues" 199/ Tan,Steinbach, Kumar 12

Classification: 1pplication 7
G

7raud Detection ' 3oa%= -redict !raudu%ent ca e in credit card tran action 0 ' A##roach=
C

e credit card tran action and the in!ormation on it account)ho%der a attribute 0


+ 0hen does a customer buy" what does he buy" how often he pays on time" etc

Labe%

#a t tran action a !raud or !air tran action 0 Thi !orm the c%a attribute0 Learn a mode% !or the c%a o! the tran action 0 C e thi mode% to detect !raud b$ ob er,ing credit card tran action on an account0

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

19

Classification: 1pplication 8
G

"u tomer Attrition/"hurn= ' 3oa%= To #redict &hether a cu tomer i %i+e%$ to be %o t to a com#etitor0 ' A##roach=
C

e detai%ed record o! tran action &ith each o! the #a t and #re ent cu tomer , to !ind attribute 0
+ 1ow often the customer calls" where he calls" what time2of2the day he calls most" his financial status" marital status" etc#

Labe%

the cu tomer a %o$a% or di %o$a%0 7ind a mode% !or %o$a%t$0


rom !Berry , -inoff$ Data )inin( Techni.ues" 199/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Classification: 1pplication 9
G

S+$ Sur,e$ "ata%oging ' 3oa%= To #redict c%a / tar or ga%a4$2 o! +$ obKect , e #ecia%%$ ,i ua%%$ !aint one , ba ed on the te%e co#ic ur,e$ image /!rom -a%omar Db er,ator$20
+ 3444 ima(es with 53"464 7 53"464 pi7els per ima(e#

' A##roach=
Segment Mea

the image0 ba ed on the e !eature 0

ure image attribute /!eature 2 ) 40 o! them #er obKect0 the c%a

Mode% Succe

Stor$= "ou%d !ind 1; ne& high red) hi!t 5ua ar , ome o! the !arthe t obKect that are di!!icu%t to !indL

rom ! ayyad" et#al#$ %d&ances in 'nowled(e Disco&ery and Data )inin(" 199* Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Classifying :ala(ies
Courtesy: http://aps.umn.edu

Early

Class:
M Stages of Aormation

1ttri,utes:
M Image features) M Characteristics of light wa es recei ed) etc3

Intermediate

Late

Data Si&e:
M ;7 million stars) 7< million gala(ies M 2,=ect Catalog: > :? M Image Data,ase: 6@< :?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1;

Clustering Definition
G

3i,en a et o! data #oint , each ha,ing a et o! attribute , and a imi%arit$ mea ure among them, !ind c%u ter uch that ' Data #oint in one c%u ter are more imi%ar to one another0 ' Data #oint in e#arate c%u ter are %e imi%ar to one another0 Simi%arit$ Mea ure = ' @uc%idean Di tance i! attribute are continuou 0 ' Dther -rob%em) #eci!ic Mea ure 0
Introduction to Data Mining 4/18/2004 1<

Tan,Steinbach, Kumar

Illustrating Clustering
R @uc%idean Di tance *a ed "%u tering in 9)D #ace0

8ntracluster distances are minimi9ed

8ntercluster distances are ma7imi9ed

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

18

Clustering: 1pplication 6
G

Mar+et Segmentation= ' 3oa%= ubdi,ide a mar+et into di tinct ub et o! cu tomer &here an$ ub et ma$ concei,ab%$ be e%ected a a mar+et target to be reached &ith a di tinct mar+eting mi40 ' A##roach=
"o%%ect

di!!erent attribute o! cu tomer ba ed on their geogra#hica% and %i!e t$%e re%ated in!ormation0 7ind c%u ter o! imi%ar cu tomer 0 Mea ure the c%u tering 5ua%it$ b$ ob er,ing bu$ing #attern o! cu tomer in ame c%u ter , 0 tho e !rom di!!erent c%u ter 0

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

1:

Clustering: 1pplication 7
G

Document "%u tering= ' 3oa%= To !ind grou# o! document that are imi%ar to each other ba ed on the im#ortant term a##earing in them0 ' A##roach= To identi!$ !re5uent%$ occurring term in each document0 7orm a imi%arit$ mea ure ba ed on the !re5uencie o! di!!erent term 0 C e it to c%u ter0 ' 3ain= In!ormation 1etrie,a% can uti%i.e the c%u ter to re%ate a ne& document or earch term to c%u tered document 0
Introduction to Data Mining 4/18/2004 20

Tan,Steinbach, Kumar

Illustrating Document Clustering


G G

"%u tering -oint = 9204 Artic%e o! Lo Ange%e Time 0 Simi%arit$ Mea ure= 6o& man$ &ord are common in the e document /a!ter ome &ord !i%tering20
Category Financial Foreign National Metro Sports Entertainment Total Articles 888 941 2<9 :49 <98 984 Correctly Placed 9;4 2;0 9; <4; 8<9 2<8

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

21

Clustering of S+B @<< Stoc% Data


T Db er,e Stoc+ Mo,ement e,er$ da$0 T "%u tering #oint = Stoc+)NC-/DD(NO T Simi%arit$ Mea ure= T&o #oint are more imi%ar i! the e,ent de cribed b$ them !re5uent%$ ha##en together on the ame da$0
T (e u ed a ociation ru%e to 5uanti!$ a imi%arit$ mea ure0
Discovered Clusters Industry Group
Technolo(y12D:0N

1 2 3 4

%pplied2)atl2D:0 N"Bay2Net work2Down"32;:)2D:0N" ;abletron2<ys2D:0N";8<;:2D:0N"1=2D:0N" D<;2;o mm2D:0 N"8NT>-2D:0N"-<82-o(ic2D:0N" )icron2Tech2D:0N"Te7as28nst2Down"Tellabs28nc2Down" Natl2<emiconduct2D:0N":racl2D:0N"<G82D:0 N" <un2D:0 N %pple2;o mp2D:0 N"%utodesk2D:0N"D>;2D:0N" %D?2) icro2De&ice2D:0N"%ndrew2;orp2D:0N" ;o mputer2%ssoc2D:0N";ircuit2;ity2D:0N" ;o mpa.2D:0N" >) ;2;orp2D:0N" Gen28nst2D:0N" )otorola2D:0 N")icrosoft2D:0N"<cientific2%tl2D:0N annie2)ae2D:0N" ed21o me2-oan2D:0 N" )BN%2;orp 2D:0N")or(an2<tanley2D:0N Baker21u(hes2@="Dresser28nds2@="1alliburton21-D2@=" -ouisiana2-and2@="=hillips2=etro2@="@nocal2@=" <chlu mber(er2@=

Technolo(y52D:0N

inancial2D:0N :il2@=

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

22

1ssociation 4ule Disco ery: Definition


G

3i,en a et o! record each o! &hich contain ome number o! item !rom a gi,en co%%ectionP ' -roduce de#endenc$ ru%e &hich &i%% #redict occurrence o! an item ba ed on occurrence o! other item 0
Items

TID

1 2 3 4 5

Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk

Aules Disco&eredB
CMil%D --E CCo%eD CDiaper) Mil%D --E C?eerD

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

29

1ssociation 4ule Disco ery: 1pplication 6


G

Mar+eting and Sa%e -romotion= ' Let the ru%e di co,ered be {Bagels, } --> {Potato Chips} ' -otato "hi# a con e5uent QR "an be u ed to determine &hat hou%d be done to boo t it a%e 0 ' *age% in the antecedent QR "an be u ed to ee &hich #roduct &ou%d be a!!ected i! the tore di continue e%%ing bage% 0 ' *age% in antecedent and -otato chi# in con e5uent QR "an be u ed to ee &hat #roduct hou%d be o%d &ith *age% to #romote a%e o! -otato chi# L

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

24

1ssociation 4ule Disco ery: 1pplication 7


G

Su#ermar+et he%! management0 ' 3oa%= To identi!$ item that are bought together b$ u!!icient%$ man$ cu tomer 0 ' A##roach= -roce the #oint)o!) a%e data co%%ected &ith barcode canner to !ind de#endencie among item 0 ' A c%a ic ru%e ))
I!

a cu tomer bu$ dia#er and mi%+, then he i ,er$ %i+e%$ to bu$ beer0 So, donEt be ur#ri ed i! $ou !ind i4)#ac+ tac+ed ne4t to dia#er L
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

1ssociation 4ule Disco ery: 1pplication 8


G

In,entor$ Management= ' 3oa%= A con umer a##%iance re#air com#an$ &ant to antici#ate the nature o! re#air on it con umer #roduct and +ee# the er,ice ,ehic%e e5ui##ed &ith right #art to reduce on number o! ,i it to con umer hou eho%d 0 ' A##roach= -roce the data on too% and #art re5uired in #re,iou re#air at di!!erent con umer %ocation and di co,er the co)occurrence #attern 0

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

2;

Se-uential Battern Disco ery: Definition


G

3i,en i a et o! objects, &ith each obKect a ociated &ith it o&n timeline o e!ents, !ind ru%e that #redict trong e5uentia% de#endencie among di!!erent e,ent 0

(
G

B!

(C!

(D "!

1u%e are !ormed b$ !ir t di o,ering #attern 0 @,ent occurrence in the #attern are go,erned b$ timing con traint 0

B!
FG (g

(C! (D "!
Eng FG ms FG ws

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

2<

Se-uential Battern Disco ery: *(amples


G

In te%ecommunication a%arm %og , ' /In,erterS-rob%em @4ce i,eSLineS"urrent2 /1ecti!ierSA%arm2 ))R /7ireSA%arm2 In #oint)o!) a%e tran action e5uence , ' "om#uter *oo+ tore= /IntroSToS>i ua%S"2 /"TTS-rimer2 ))R /-er%S!orSdummie ,Tc%ST+2 ' Ath%etic A##are% Store= /Shoe 2 /1ac+et, 1ac+etba%%2 ))R /S#ort SUac+et2

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

28

4egression
G

G G

-redict a ,a%ue o! a gi,en continuou ,a%ued ,ariab%e ba ed on the ,a%ue o! other ,ariab%e , a uming a %inear or non%inear mode% o! de#endenc$0 3reat%$ tudied in tati tic , neura% net&or+ !ie%d 0 @4am#%e = ' -redicting a%e amount o! ne& #roduct ba ed on ad,eti ing e4#enditure0 ' -redicting &ind ,e%ocitie a a !unction o! tem#erature, humidit$, air #re ure, etc0 ' Time erie #rediction o! toc+ mar+et indice 0

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

2:

De iation01nomaly Detection
Detect igni!icant de,iation !rom norma% beha,ior G A##%ication = ' "redit "ard 7raud Detection
G

' Net&or+ Intru ion Detection

Tan,Steinbach, Kumar

Typical network traffic at University level may reach over 100 million connections per day
Introduction to Data Mining 4/18/2004

90

Challenges of Data Mining


G G G G G G G

Sca%abi%it$ Dimen iona%it$ "om#%e4 and 6eterogeneou Data Data Gua%it$ Data D&ner hi# and Di tribution -ri,ac$ -re er,ation Streaming Data

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

91

You might also like