Professional Documents
Culture Documents
MSHV: CH0401059
K thut pht hin tri thc v khai ph d liu, ng dng trong bi ton d bo t thng tin kinh t - x hi
Su tm S pht trin ca cng ngh thng tin v vic ng dng cng ngh thng tin trong nhiu lnh vc ca i sng, kinh t x hi trong nhiu nm qua cng ng ngha vi lng d liu c cc c quan thu thp v lu tr ngy mt tch lu nhiu ln. H lu tr cc d liu ny v cho rng trong n n cha nhng gi tr nht nh no . Tuy nhin, theo thng k th ch c mt lng nh ca nhng d liu ny (khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s phi lm g hoc c th lm g vi chng nhng h vn tip tc thu thp rt tn km vi ngh lo s rng s c ci g quan trng b b qua sau ny c lc cn n n. Mt khc, trong mi trng cnh tranh, ngi ta ngy cng cn c nhiu thng tin vi tc nhanh tr gip vic ra quyt nh v ngy cng c nhiu cu hi mang tnh cht nh tnh cn phi tr li da trn mt khi lng d liu khng l c. Vi nhng l do nh vy, cc phng php qun tr v khai thc c s d liu truyn thng ngy cng khng p ng c thc t lm pht trin mt khuynh hng k thut mi l K thut pht hin tri thc v khai ph d liu (KDD - Knowledge Discovery and Data Mining). K thut pht hin tri thc v khai ph d liu v ang c nghin cu, ng dng trong nhiu lnh vc khc nhau cc nc trn th gii, ti Vit Nam k thut ny tng i cn mi m tuy nhin cng ang c nghin cu v dn a vo ng dng. Trong bi vit ny, tc gi s trnh by mt cch tng quan v K thut pht hin tri thc v khai ph d liu. Trn c s a ra mt bi ton d bo v dn s th gii v gii quyt bi ton bng phng php hi qui n nhm cung cp cho bn c mt cch nhn khi qut v k thut mi ny cng nh mi tng quan vi phng php thng k truyn thng. 1. Tng quan v k thut pht hin tri thc v khai ph d liu (KDD - Knowledge Discovery and Data Mining) Pht hin tri thc v khai ph d liu l g? Nu cho rng cc in t v cc sng in t chnh l bn cht ca cng ngh in t truyn thng th d liu, thng tin v tri thc hin ang l tiu im ca mt lnh vc mi trong nghin cu v ng dng v pht hin tri thc (Knowledge Discovery) v khai ph d liu (Data Mining). Thng thng chng ta coi d liu nh mt dy cc bit, hoc cc s v cc k hiu, hoc cc i tng vi mt ngha no khi c gi cho mt chng trnh di mt dng nht nh. Chng ta s dng cc bit o lng cc thng tin v xem n nh l cc d liu c lc b cc d tha, c rt gn ti mc ti thiu c trng mt cch c bn cho d liu. Chng ta c th xem tri thc nh l cc thng tin tch hp , bao gm cc s kin v cc mi quan h gia chng. Cc mi quan h ny c th c hiu ra, c th c pht hin, hoc c th c hc. Ni cch khc, tri thc c th c coi l d liu c tru tng v t chc cao.
Trang
MSHV: CH0401059
Pht hin tri thc trong cc c s d liu l mt qui trnh nhn bit cc mu hoc cc m hnh trong d liu vi cc tnh nng: hp thc, mi, kh ch, v c th hiu c. Cn khai thc d liu l mt bc trong qui trnh pht hin tri thc gm c cc thut ton khai thc d liu chuyn dng di mt s qui nh v hiu qu tnh ton chp nhn c tm ra cc mu hoc cc m hnh trong d liu. Ni mt cch khc, mc ch ca pht hin tri thc v khai ph d liu chnh l tm ra cc mu v/hoc cc m hnh ang tn ti trong cc c s d liu nhng vn cn b che khut bi hng ni d liu. Cn cc nh thng k th xem Khai ph d liu nh l mt qui trnh phn tch c thit k thm d mt lng cc ln cc d liu nhm pht hin ra cc mu thch hp v/hoc cc mi quan h mang tnh h thng gia cc bin v sau s hp thc ho cc kt qu tm c bng cch p dng cc mu pht hin c cho cc tp con mi ca d liu. Qui trnh ny bao gm ba giai on c bn: thm d, xy dng m hnh hoc nh ngha mu, hp thc/kim chng. Qui trnh pht hin tri thc Qui trnh pht hin tri thc c m t tm tt trn Hnh 1:
Hnh 1. Quy trnh pht hin tri thc Bc th nht l tm hiu lnh vc ng dng v hnh thnh bi ton, bc ny s quyt nh cho vic rt ra c cc tri thc hu ch v cho php chn cc phng php khai ph d liu thch hp vi mc ch ng dng v bn cht ca d liu. Bc th hai l thu thp v x l th, cn c gi l tin x l d liu nhm loi b nhiu, x l vic thiu d liu, bin i d liu v rt gn d liu nu cn thit, bc ny thng chim nhiu thi gian nht trong ton b qui trnh pht hin tri thc. Bc th ba l khai ph d liu, hay ni cch khc l trch ra cc mu hoc/v cc m hnh
Trang
MSHV: CH0401059
Bc th t l hiu tri thc tm c, c bit l lm sng t cc m t v d on. Cc bc trn c th lp i lp li mt s ln, kt qu thu c c th c ly trung bnh trn tt c cc ln thc hin. Cc phng php khai ph d liu Vi hai ch chnh ca khai ph d liu l D on (Prediction) v M t (Description), ngi ta thng s dng cc phng php sau cho khai ph d liu: - Phn loi (Classification) - Hi qui (Regression) - Phn nhm (Clustering) - Tng hp (Summarization) - M hnh rng buc (Dependency modeling) - D tm bin i v lch (Change and Deviation Dectection) - Biu din m hnh (Model Representation) - Kim nh m hnh (Model Evaluation) - Phng php tm kim (Search Method) Cc lnh vc lin quan n Pht hin tri thc v khai ph d liu Pht hin tri thc v khai ph d liu lin quan n nhiu ngnh, nhiu lnh vc: thng k, tr tu nhn to, c s d liu, thut ton hc, tnh ton song song v tc cao, thu thp tri thc cho cc h chuyn gia, quan st d liu... c bit Pht hin tri thc v khai ph d liu rt gn gi vi lnh vc thng k, s dng cc phng php thng k m hnh d liu v pht hin cc mu, lut... Ngn hng d liu (Data Warehousing) v cc cng c phn tch trc tuyn (OLAP) cng lin quan rt cht ch vi Pht hin tri thc v khai ph d liu. Bng 1. Dn s th gii tnh ti thi im gia nm Nm 1950 1951 Dn s th gii (triu ngi) 2,555 2,593 Nm 1970 1971 Dn s th gii (triu ngi) 3,708 3,785 Nm 1990 1991 Dn s th gii (triu ngi) 5,275 5,359
Trang
Hc vin : Nguyn Vn Thun 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 2,635 2,680 2,728 2,779 2,832 2,888 2,945 2,997 3,039 3,080 3,136 3,206 3,277 3,346 3,416 3,486 3,558 3,632
MSHV: CH0401059 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 3,862 3,938 4,014 4,087 4,159 4,231 4,303 4,378 4,454 4,530 4,610 4,690 4,769 4,850 4,932 5,017 5,102 5,188 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 5,443 5,524 5,604 5,685 5,764 5,844 5,923 6,001 6,078 6,153 6,228
Ngun: U.S. Bureau of the Census, International Data Base. Cp nht ngy 10/10/2002. Cc ng dng ca Pht hin tri thc v khai ph d liu - Thng tin thng mi: + Phn tch d liu marketing, khch hng + Phn tch u t + Ph duyt cho vay vn + Pht hin gian ln.. - Thng tin k thut: + iu khin v lp lch trnh + Qun tr mng
Trang
Hc vin : Nguyn Vn Thun + Phn tch cc kt qu th nghim.. - Thng tin khoa hc - Thng tin c nhn...
MSHV: CH0401059
Cc thch thc vi Pht hin tri thc v khai ph d liu - Cc c s d liu ln - S chiu ln - Thay i d liu v tri thc c th lm cho cc mu pht hin khng cn ph hp. - D liu b thiu hoc nhiu - Quan h gia cc trng phc tp - Giao tip vi ngi s dng v kt hp vi cc tri thc c. - Tch hp vi cc h thng khc... Bng 2. Thng s c bn ca m hnh hi qui M hnh Linear Logarit (Ln) Polynomial Exponential H s tng quan (R 2 ) 0.9947 0.7689 0.9994 0.9972 Hm biu din y=73.173x + 2256.3 y=1112.6Ln(x) + 866.28 y=0.3657x2 + 53.426x + 2437.3 y = 2527.6e0.0177x
Bng 3. D bo dn s th gii theo cc m hnh hi qui Nm Linear 2003 2004 2005 2006 2007 2008 6,208 6,281 6,354 6,427 6,500 6,574 D bo dn s (triu ngi) theo m hnh Logarit (Ln) 5,304 5,325 5,345 5,365 5,384 5,403 Polynomial 6,389 6,482 6,576 6,671 6,766 6,862 Exponential 6,574 6,691 6,811 6,932 7,056 7,182
Trang
Hc vin : Nguyn Vn Thun 2009 2010 2011 2012 2013 2014 2015 6,647 6,720 6,793 6,866 6,939 7,013 7,086
MSHV: CH0401059 5,422 5,440 5,458 5,476 5,493 5,511 5,528 6,959 7,057 7,155 7,255 7,354 7,455 7,556 7,310 7,441 7,574 7,709 7,847 7,987 8,129
2. ng dng trong bi ton d bo t thng tin kinh t - x hi Trong phn ny, tc gi s ly mt bi ton d bo v dn s th gii n nm 2015 da trn nhng s liu thng k dn s th gii t nm 1950 - 2002 bng phng php hi quy (Regression). Mc d s lng cc d liu khng ln nh trong cc d liu kinh t - x hi khc, nhng bi ton ny cng cho ta thy cc m hnh phn tch khc nhau v cc kt qu khc nhau khi khai ph nhng d liu . n gin, ta khng cp n bc thu thp v tin x l d liu, cc d liu ti bng di c coi l hon thin trong bi ton ny. Mt khc, cc d liu thc t c tnh vo gia cc nm do vy cc d liu dn s ta tnh ton cng c hiu ngm nh l vo gia nm. Sau khi thc hin khai ph d liu dn s bng phng php hi qui n vi bn m hnh khc nhau: Linear (hm tuyn tnh), Logarit (hm lgarit t nhin), Polynomial (hm a thc - trong v d ny ta chn a thc bc 2), Exponential (hm m), ta xc nh c kt qu (Xem bng 2, 3, hnh 2, 3, 4, 5).
Trang
MSHV: CH0401059
Hnh 3. th biu din dn s th gii thc t v l thuyt theo nm vi m hnh Logarit (Ln)
Trang
MSHV: CH0401059
Hnh 5. th biu din dn s th gii thc t v l thuyt theo nm vi m hnh Exponential Trong cc kt qu , ta thy m hnh a thc bc 2 - Polynomial c tng quan cao hn cc m hnh khc, do vy, trong trng hp c th ny ta c th s dng cc kt qu d bo ca m hnh ny. Tc gi xin dng ti y v khng i su phn tch vic p dng d liu d bo c vo cc lnh vc khc nhau. 3. Kt lun Qua cc vn c trnh by trong mc 1 v bi ton ng dng trong mc 2, chng ta nhn thy vi mt lng d liu thc t nh v vi mc ch bi ton c th nhng ta c th tip cn theo nhiu hng khc nhau ca cng mt phng php khai ph d liu v t c kt qu khc nhau, iu cng lm sng t kh nng ng dng thc t to ln ng thi vi nhng thch thc i vi k thut Pht hin tri thc v khai ph d liu trong cc bi ton kinh t - x hi v trong nhiu lnh vc khc. Ti liu Tham kho [1] Knowledge Discovery Nuggets: http://www.kdnuggets.com/ [2] Ho Tu Bao: Introduction to Knowledge Discovery and Data Mining, Institute of Information Technology. [3] TS. Hn Vit Thun - Ch bin: Gio trnh Tin hc ng dng, NXB Thng k, 1999 [4] Dr. Dang Quang A and Dr. Bui The Hong, Statistical data analysis, Institute of Information Technology. [5] Mt s thng tin trn Website http://blue.census.gov/ipc/www/
Trang
MSHV: CH0401059
Trang