You are on page 1of 13

A Tutorial for Text Classification using SQL Server 2005

Beta2 Data Mining


Peter Pyungchul Kim
SQL Business Intelligence
Microsoft Corporation
Introduction
This tutorial presents details steps for you to take to perform a typical text classification
task using SQL Serer !""# Beta!$ The sample dataset is o%tained from http&''((()
!$cs$cmu$edu'afs'cs$cmu$edu'pro*ect'theo)!"'((('data'ne(s!"$html$ The dataset is a
small su%set of +S,-,T ne(sgroup postings that %elong to # different groups$ The task
is to %uild a mining model to classify each posting into its group$ This tutorial document
should %e aaila%le together (ith an import)ready file. -/0rticles$txt 1or
-/0rticles$2ip3$
1 Create a dataase
4$4 In SQL Mgmt Studio. connect to the local SQL serer 1localhost3$
4$! Create a ne( data%ase and name it 5T6M7$

2 I!"ort #e$s %rou" Articles to t&e dataase
!$4 8ight click the data%ase. T6M. and Task Import$
Source& -/0rticles$txt 19lat 9ile. un2ipped from -/0rticles$2ip proided3
:eader ro( delimiter& ;;;;
Check 5Column names in the first data ro(7
8o( delimiter& ;;;;
Column delimiter& <<<<
Column property for 50rticleText7& Change 6ataType to 6T=-T,>T
6estination&
Serer& local SQL serer 1localhost3
6ata%ase& T6M
Ta%le& -/0rticle




' Build a dictionar(
?$4 Start Business Intelligence 6eelopment Studio (ith a ne( Integration Serices
pro*ect called 5Text6ataMining7$ This (ill create a solution and a Integration
Serices pro*ect in it. %oth of (hich are named 5Text6ataMining7$
?$! 8ename the Integration Serices pro*ect as 5Prepare0rticles7 *ust for conenience$

?$? Create a ne( 6TS 1SSIS3 package
?$@ 8ename the package to Build6ictionary$dtsx
?$# /o to 6ata 9lo( ta% and add a ne( 6ata 9lo( task
?$A In the data flo( task. add a 5BL, 6B Source7 transform
Connection& create a ne( for localhost$T6M
Ta%le& -/0rticles
Columns& 0rticleText only
?$C 0dd a 5Term ,xtraction7 transform and connect from the BL, 6B Source transform
Term Type& -oun and -oun Phrase
Score Type& T9I69
Parameters& 9reDuencyE4". LengthE!
?$F 0dd a 5Sort7 transform and connect it$
Sort 5Term7 in ascending order
6onGt pass through Score column
?$H 0dd an 5BL, 6B 6estination7 transform and connect it$
+se the connection& localhost$T6M
Click 5-e(7 and name it 56ictionary7
In Mappings. connect the column. 5Term7
?$4" ,xecute the package
It automatically enters into de%ugging mode
It may take a fe( minutes
?$44 Stop de%ugging
) Build ter! vectors
@$4 Create a ne( 6TS 1SSIS3 package
@$! 8ename the package to BuildTermIectors$dtsx
@$? /o to 6ata 9lo( ta% and add a ne( 6ata 9lo( task
@$@ In the data flo( task. add a 5BL, 6B Source7 transform
Connection& create a ne( for localhost$T6M
Ta%le& -/0rticles
Columns& I6. 0rticleText only
@$# 0dd a 5Term Lookup7 transform and connect from the preious transform
8eference ta%le& 6ictionary
PassThru column& I6
Lookup input column& 0rticleText

@$A 0dd a 5Sort7 transform and connect it$
Sort 5I67 in ascending order. then. 5Term7 in ascending order. no duplicates
@$C 0dd an 5BL, 6B 6estination7 transform and connect it$
+se the connection& localhost$T6M
Click 5-e(7 and name it 5TermIectors7
In Mappings. make sure to connect all columns. 5Term7. 59reDuency7. 5I67
@$F ,xecute the package
It automatically enters into de%ugging mode
It may take a fe( minutes
@$H Stop de%ugging

1-ote that the picture doesnGt include the 6eried Column transform %uilt in step @$#$3
5 *re"are train+test sa!"les
#$4 Create a ne( 6TS 1SSIS3 package
#$! 8ename the package to PrepareSamples$dtsx
#$? /o to 6ata 9lo( ta% and add a ne( 6ata 9lo( task
#$@ In the data flo( task. add a 5BL, 6B Source7 transform
Connection& create a ne( for localhost$T6M
Ta%le& -/0rticles
Columns& I6. -e(s/roup only
#$# 0dd a 5Percentage Sampling7 transform and connect from the BL, 6B Source
transform
Sampling rate& C"J
Selected ro(s& Train sample 1C"J3
+nselected ro(s& Test sample 1?"J3

#$A 0dd t(o 5BL, 6B 6estination7 transforms and connect them from the Percentage
Sampling 1one from Train sample. another from Test sample3
+se the connection& localhost$T6M
Click 5-e(7 and name them 5Train0rticles7 and 5Test0rticles7 respectiely$
In Mappings. make sure to connect all columns. 5I67. 5-e(s/roup7
#$C ,xecute the package
It automatically enters into de%ugging mode
#$F Stop the de%ugging mode$
, Build+Test+-efine data !ining !odels
A$4 0dd a ne( 0nalysis Serices pro*ect. and name it as 56ataMining7$

A$! Create a 6ata Source to refer the data%ase. T6M. in the local SQL serer$
A$? Create a 6ata Source Iie( using the data source. T6M$ 0dd the follo(ing ta%les in
the 6SI& Train0rticles. Test0rticles. and TermIectors$

A$@ Create a Mining Structure as follo(s&
0lgorithm& Microsoft=6ecision=Trees
6SI to use& T6M
Case ta%le& Train0rticles
-ested ta%le& TermIectors
Columns usage&

-ame the structure as 5-/0rticles6M7 and the model as 5-/0rticles6M=6T7
A$# 8ight click the model. -/0rticles6M=6T and select 5-e( Mining ModelK7 to add
the follo(ing t(o additional models&
-/0rticles6M=-B (ith Microsoft=-aie=Bayes algorithm
-/0rticles6M=-- (ith Microsoft=Logistic=8egression algorithm
A$A 8ight)click each model and set the algorithm parameters as follo(s&
-/0rticles6M=6T&
6isa%le automatic feature selection
1M0>IM+M=I-P+T=0TT8IB+T,SE"3
-/0rticles6M=-B&
6isa%le automatic feature selection
1M0>IM+M=I-P+T=0TT8IB+T,SE"3
-/0rticles6M=--&
6isa%le automatic feature selection
1M0>IM+M=I-P+T=0TT8IB+T,SE"3
A$C 6eploy the pro*ect %y pressing 9#$ It may take seeral minutes to train all the three
models$
A$F Select 5Mining 0ccuracy7 ta% to see the lift chart using 5Test0rticles7 and
5TermIectors7 to compare the classification accuracy of the three models trained$



A$H Bro(se models$ -ote that %ro(sing the modelGs content may take considera%ly long
time due to the complexity of models$ ,$g$. -/0rticles6M=-B. -/0rticles6M=--
inoles more than #.""" attri%utes 1scoring'coefficients3$ 9or instance. %ro(sing
-/0rticles6M=-- took ? minutes in ?/:2 >eon CP+. !/B memory PC$
. De"lo(!ent data !ining !odels
-ot coered in this tutorial at this moment$

You might also like