Professional Documents
Culture Documents
Tan,Steinbach, Kumar
4/18/2004
Lot o! data i being co%%ected and &arehou ed ' (eb data, e)commerce ' #urcha e at de#artment/ grocer$ tore ' *an+/"redit "ard tran action "om#uter ha,e become chea#er and more #o&er!u% "om#etiti,e -re ure i Strong ' -ro,ide better, cu tomi.ed er,ice !or an edge /e0g0 in "u tomer 1e%ation hi# Management2
Introduction to Data Mining 4/18/2004 2
G G
Tan,Steinbach, Kumar
Data co%%ected and tored at enormou #eed /3*/hour2 ' remote en or on a ate%%ite ' te%e co#e canning the +ie
' microarra$ generating gene e4#re ion data ' cienti!ic imu%ation generating terab$te o! data
G G
Traditiona% techni5ue in!ea ib%e !or ra& data Data mining ma$ he%# cienti t ' in c%a i!$ing and egmenting data ' in 6$#othe i 7ormation
!here is often information "hidden# in the data that is not readily e ident $uman analysts may ta%e wee%s to disco er useful information Much of the data is ne er analy&ed at all
4,000,000 9,800,000 9,000,000 2,800,000 2,000,000 1,800,000 1,000,000 800,000 0 1::8 1::; 1::<
Number of analysts
1::8 1:::
4
Tan,Steinbach, Kumar Introduction to Data 4/18/2004 7rom= 10 3ro man, "0 Kamath, >0 Kumar, ?Data Mining !or Mining Scienti!ic and @ngineering A##%ication B
Definitions
' 'on-tri ial e(traction of implicit) pre iously un%nown and potentially useful information from data ' *(ploration + analysis) ,y automatic or semi-automatic means) of large -uantities of data in order to disco er meaningful patterns
Tan,Steinbach, Kumar
4/18/2004
What is Data Mining? ' "ertain name are more #re,a%ent in certain CS %ocation /DE*rien, DE1ur+e, DE1ei%%$F in *o ton area2 ' 3rou# together imi%ar document returned b$ earch engine according to their conte4t /e0g0 Ama.on rain!ore t, Ama.on0com,2
4/18/2004 ;
' Loo+ u# #hone number in #hone director$ ' Guer$ a (eb earch engine !or in!ormation about ?Ama.onB
Tan,Steinbach, Kumar
-rediction Method ' C e ome ,ariab%e to #redict un+no&n or !uture ,a%ue o! other ,ariab%e 0 De cri#tion Method ' 7ind human)inter#retab%e #attern that de cribe the data0
rom ! ayyad" et#al#$ %d&ances in 'nowled(e Disco&ery and Data )inin(" 199* Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
G
G G
7ind a model !or c%a attribute a a !unction o! the ,a%ue o! other attribute 0 3oa%= #re,iou %$ un een record hou%d be a igned a c%a a accurate%$ a #o ib%e0
' A test set i u ed to determine the accurac$ o! the mode%0 C ua%%$, the gi,en data et i di,ided into training and te t et , &ith training et u ed to bui%d the mode% and te t et u ed to ,a%idate it0
Tan,Steinbach, Kumar
4/18/2004
10
Classification *(ample
al al us c c i i o u or or n i g g nt te te ss a a o a c c c cl
!a(a,le Income Cheat 128K 100K <0K 120K 'o 'o 'o 'o 5es 'o
10
Je No No Je No No Je No No No
!est Set
!raining Set
Learn Classifier
Model
Tan,Steinbach, Kumar
4/18/2004
11
Classification: 1pplication 6
G
Direct Mar+eting ' 3oa%= 1educe co t o! mai%ing b$ targeting a et o! con umer %i+e%$ to bu$ a ne& ce%%)#hone #roduct0 ' A##roach=
C
(e
+no& &hich cu tomer decided to bu$ and &hich decided other&i e0 Thi {buy, dont buy} deci ion !orm the class attribute0 ,ariou demogra#hic, %i!e t$%e, and com#an$) interaction re%ated in!ormation about a%% uch cu tomer 0
+ Type of business" where they stay" how much they earn" etc#
"o%%ect
i!ier
Classification: 1pplication 7
G
7raud Detection ' 3oa%= -redict !raudu%ent ca e in credit card tran action 0 ' A##roach=
C
Labe%
#a t tran action a !raud or !air tran action 0 Thi !orm the c%a attribute0 Learn a mode% !or the c%a o! the tran action 0 C e thi mode% to detect !raud b$ ob er,ing credit card tran action on an account0
Tan,Steinbach, Kumar
4/18/2004
19
Classification: 1pplication 8
G
"u tomer Attrition/"hurn= ' 3oa%= To #redict &hether a cu tomer i %i+e%$ to be %o t to a com#etitor0 ' A##roach=
C
e detai%ed record o! tran action &ith each o! the #a t and #re ent cu tomer , to !ind attribute 0
+ 1ow often the customer calls" where he calls" what time2of2the day he calls most" his financial status" marital status" etc#
Labe%
Classification: 1pplication 9
G
S+$ Sur,e$ "ata%oging ' 3oa%= To #redict c%a / tar or ga%a4$2 o! +$ obKect , e #ecia%%$ ,i ua%%$ !aint one , ba ed on the te%e co#ic ur,e$ image /!rom -a%omar Db er,ator$20
+ 3444 ima(es with 53"464 7 53"464 pi7els per ima(e#
' A##roach=
Segment Mea
Mode% Succe
Stor$= "ou%d !ind 1; ne& high red) hi!t 5ua ar , ome o! the !arthe t obKect that are di!!icu%t to !indL
rom ! ayyad" et#al#$ %d&ances in 'nowled(e Disco&ery and Data )inin(" 199* Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Classifying :ala(ies
Courtesy: http://aps.umn.edu
Early
Class:
M Stages of Aormation
1ttri,utes:
M Image features) M Characteristics of light wa es recei ed) etc3
Intermediate
Late
Data Si&e:
M ;7 million stars) 7< million gala(ies M 2,=ect Catalog: > :? M Image Data,ase: 6@< :?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1;
Clustering Definition
G
3i,en a et o! data #oint , each ha,ing a et o! attribute , and a imi%arit$ mea ure among them, !ind c%u ter uch that ' Data #oint in one c%u ter are more imi%ar to one another0 ' Data #oint in e#arate c%u ter are %e imi%ar to one another0 Simi%arit$ Mea ure = ' @uc%idean Di tance i! attribute are continuou 0 ' Dther -rob%em) #eci!ic Mea ure 0
Introduction to Data Mining 4/18/2004 1<
Tan,Steinbach, Kumar
Illustrating Clustering
R @uc%idean Di tance *a ed "%u tering in 9)D #ace0
Tan,Steinbach, Kumar
4/18/2004
18
Clustering: 1pplication 6
G
Mar+et Segmentation= ' 3oa%= ubdi,ide a mar+et into di tinct ub et o! cu tomer &here an$ ub et ma$ concei,ab%$ be e%ected a a mar+et target to be reached &ith a di tinct mar+eting mi40 ' A##roach=
"o%%ect
di!!erent attribute o! cu tomer ba ed on their geogra#hica% and %i!e t$%e re%ated in!ormation0 7ind c%u ter o! imi%ar cu tomer 0 Mea ure the c%u tering 5ua%it$ b$ ob er,ing bu$ing #attern o! cu tomer in ame c%u ter , 0 tho e !rom di!!erent c%u ter 0
Tan,Steinbach, Kumar
4/18/2004
1:
Clustering: 1pplication 7
G
Document "%u tering= ' 3oa%= To !ind grou# o! document that are imi%ar to each other ba ed on the im#ortant term a##earing in them0 ' A##roach= To identi!$ !re5uent%$ occurring term in each document0 7orm a imi%arit$ mea ure ba ed on the !re5uencie o! di!!erent term 0 C e it to c%u ter0 ' 3ain= In!ormation 1etrie,a% can uti%i.e the c%u ter to re%ate a ne& document or earch term to c%u tered document 0
Introduction to Data Mining 4/18/2004 20
Tan,Steinbach, Kumar
"%u tering -oint = 9204 Artic%e o! Lo Ange%e Time 0 Simi%arit$ Mea ure= 6o& man$ &ord are common in the e document /a!ter ome &ord !i%tering20
Category Financial Foreign National Metro Sports Entertainment Total Articles 888 941 2<9 :49 <98 984 Correctly Placed 9;4 2;0 9; <4; 8<9 2<8
Tan,Steinbach, Kumar
4/18/2004
21
1 2 3 4
%pplied2)atl2D:0 N"Bay2Net work2Down"32;:)2D:0N" ;abletron2<ys2D:0N";8<;:2D:0N"1=2D:0N" D<;2;o mm2D:0 N"8NT>-2D:0N"-<82-o(ic2D:0N" )icron2Tech2D:0N"Te7as28nst2Down"Tellabs28nc2Down" Natl2<emiconduct2D:0N":racl2D:0N"<G82D:0 N" <un2D:0 N %pple2;o mp2D:0 N"%utodesk2D:0N"D>;2D:0N" %D?2) icro2De&ice2D:0N"%ndrew2;orp2D:0N" ;o mputer2%ssoc2D:0N";ircuit2;ity2D:0N" ;o mpa.2D:0N" >) ;2;orp2D:0N" Gen28nst2D:0N" )otorola2D:0 N")icrosoft2D:0N"<cientific2%tl2D:0N annie2)ae2D:0N" ed21o me2-oan2D:0 N" )BN%2;orp 2D:0N")or(an2<tanley2D:0N Baker21u(hes2@="Dresser28nds2@="1alliburton21-D2@=" -ouisiana2-and2@="=hillips2=etro2@="@nocal2@=" <chlu mber(er2@=
Technolo(y52D:0N
inancial2D:0N :il2@=
Tan,Steinbach, Kumar
4/18/2004
22
3i,en a et o! record each o! &hich contain ome number o! item !rom a gi,en co%%ectionP ' -roduce de#endenc$ ru%e &hich &i%% #redict occurrence o! an item ba ed on occurrence o! other item 0
Items
TID
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
Aules Disco&eredB
CMil%D --E CCo%eD CDiaper) Mil%D --E C?eerD
Tan,Steinbach, Kumar
4/18/2004
29
Mar+eting and Sa%e -romotion= ' Let the ru%e di co,ered be {Bagels, } --> {Potato Chips} ' -otato "hi# a con e5uent QR "an be u ed to determine &hat hou%d be done to boo t it a%e 0 ' *age% in the antecedent QR "an be u ed to ee &hich #roduct &ou%d be a!!ected i! the tore di continue e%%ing bage% 0 ' *age% in antecedent and -otato chi# in con e5uent QR "an be u ed to ee &hat #roduct hou%d be o%d &ith *age% to #romote a%e o! -otato chi# L
Tan,Steinbach, Kumar
4/18/2004
24
Su#ermar+et he%! management0 ' 3oa%= To identi!$ item that are bought together b$ u!!icient%$ man$ cu tomer 0 ' A##roach= -roce the #oint)o!) a%e data co%%ected &ith barcode canner to !ind de#endencie among item 0 ' A c%a ic ru%e ))
I!
a cu tomer bu$ dia#er and mi%+, then he i ,er$ %i+e%$ to bu$ beer0 So, donEt be ur#ri ed i! $ou !ind i4)#ac+ tac+ed ne4t to dia#er L
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
In,entor$ Management= ' 3oa%= A con umer a##%iance re#air com#an$ &ant to antici#ate the nature o! re#air on it con umer #roduct and +ee# the er,ice ,ehic%e e5ui##ed &ith right #art to reduce on number o! ,i it to con umer hou eho%d 0 ' A##roach= -roce the data on too% and #art re5uired in #re,iou re#air at di!!erent con umer %ocation and di co,er the co)occurrence #attern 0
Tan,Steinbach, Kumar
4/18/2004
2;
3i,en i a et o! objects, &ith each obKect a ociated &ith it o&n timeline o e!ents, !ind ru%e that #redict trong e5uentia% de#endencie among di!!erent e,ent 0
(
G
B!
(C!
(D "!
1u%e are !ormed b$ !ir t di o,ering #attern 0 @,ent occurrence in the #attern are go,erned b$ timing con traint 0
B!
FG (g
(C! (D "!
Eng FG ms FG ws
Tan,Steinbach, Kumar
4/18/2004
2<
In te%ecommunication a%arm %og , ' /In,erterS-rob%em @4ce i,eSLineS"urrent2 /1ecti!ierSA%arm2 ))R /7ireSA%arm2 In #oint)o!) a%e tran action e5uence , ' "om#uter *oo+ tore= /IntroSToS>i ua%S"2 /"TTS-rimer2 ))R /-er%S!orSdummie ,Tc%ST+2 ' Ath%etic A##are% Store= /Shoe 2 /1ac+et, 1ac+etba%%2 ))R /S#ort SUac+et2
Tan,Steinbach, Kumar
4/18/2004
28
4egression
G
G G
-redict a ,a%ue o! a gi,en continuou ,a%ued ,ariab%e ba ed on the ,a%ue o! other ,ariab%e , a uming a %inear or non%inear mode% o! de#endenc$0 3reat%$ tudied in tati tic , neura% net&or+ !ie%d 0 @4am#%e = ' -redicting a%e amount o! ne& #roduct ba ed on ad,eti ing e4#enditure0 ' -redicting &ind ,e%ocitie a a !unction o! tem#erature, humidit$, air #re ure, etc0 ' Time erie #rediction o! toc+ mar+et indice 0
Tan,Steinbach, Kumar
4/18/2004
2:
De iation01nomaly Detection
Detect igni!icant de,iation !rom norma% beha,ior G A##%ication = ' "redit "ard 7raud Detection
G
Tan,Steinbach, Kumar
Typical network traffic at University level may reach over 100 million connections per day
Introduction to Data Mining 4/18/2004
90
Sca%abi%it$ Dimen iona%it$ "om#%e4 and 6eterogeneou Data Data Gua%it$ Data D&ner hi# and Di tribution -ri,ac$ -re er,ation Streaming Data
Tan,Steinbach, Kumar
4/18/2004
91