You are on page 1of 118

Phn tch s liu v biu bng R Nguyn Vn Tun

1









Phn tch s liu v biu bng








Nguyn Vn Tun
Garvan Institute of Medical Research
Sydney, Australia

Phn tch s liu v biu bng R Nguyn Vn Tun

2
Mc lc

1 Ti R xung v ci t vo my tnh 4

2 Ti R package v ci t vo my tnh 6

3 Vn phm R 7
3.1 Cch t tn trong R 9
3.2 H tr trong R 9

4 Cch nhp d liu vo R 10
4.1 Nhp s liu trc tip: c() 10
4.2 Nhp s liu trc tip: edit(data.frame()) 12
4.3 Nhp s liu t mt text file: read.table 13
4.4 Nhp s liu t Excel 14
4.5 Nhp s liu t SPSS 15
4.6 Thng tin v s liu 16
4.7 To dy s bng hm seq, rep v gl 17

5 Bin tp s liu 19
5.1 Tch ri s liu: subset 19
5.2 Chit s liu t mt data .frame 20
5.3 Nhp hai data.frame thnh mt: merge 21
5.4 Bin i s liu (data coding) 22
5.5 Bin i s liu bng cch dng replace 23
5.6 Bin i thnh yu t (factor) 23
5.7 Phn nhm s liu bng cut2 (Hmisc) 24

6 S dng R cho tnh ton n gin 24
6.1 Tnh ton n gin 24
6.2 S dng R cho cc php tnh ma trn 26

7 S dng R cho tnh ton xc sut 31
7.1 Php hon v (permutation) 31
7.2 Bin s ngu nhin v hm phn phi 32
7.3 Bin s ngu nhin v hm phn phi 32
7.3.1 Hm phn phi nh phn (Binomial distribution) 33
7.3.2 Hm phn phi Poisson (Poisson distribution) 35
7.3.3 Hm phn phi chun (Normal distribution) 36
7.3.4 Hm phn phi chun chun ha (Standardized Normal distribution) 38
7.4 Chn mu ngu nhin (random sampling) 41

8 Biu 42
8.1 S liu cho phn tch biu 42
8.2 Biu cho mt bin s ri rc (discrete variable): barplot 44
8.3 Biu cho hai bin s ri rc (discrete variable): barplot 45
8.4 Biu hnh trn 46
8.5 Biu cho mt bin s lin tc: stripchart v hist 47
8.5.1 Stripchart 47
8.5.2 Histogram 48
8.6 Biu hp (boxplot) 49
8.7 Phn tch biu cho hai bin lin tc 50
8.7.1 Biu tn x (scatter plot) 50
8.8 Phn tch Biu cho nhiu bin: pairs 53
Phn tch s liu v biu bng R Nguyn Vn Tun

3
8.9 Biu vi sai s chun (standard error) 54

9 Phn tch thng k m t 55
9.1 Thng k m t (descriptive statistics, summary) 55
9.2 Thng k m t theo tng nhm 60
9.3 Kim nh t (t.test) 61
9.3.1 Kim nh t mt mu 61
9.3.2 Kim nh t hai mu 62
9.4 Kim nh Wilcoxon cho hai mu (wilcox.test) 63
9.5 Kim nh t cho cc bin s theo cp (paired t-test, t.test) 64
9.6 Kim nh Wilcoxon cho cc bin s theo cp (wilcox.test) 65
9.7 Tn s (frequency) 66
9.8 Kim nh t l (proportion test, prop.test, binom.test) 67
9.9 So snh hai t l (prop.test, binom.test) 68
9.10 So snh nhiu t l (prop.test, chisq.test) 69
9.10.1 Kim nh Chi bnh phng (Chi squared test, chisq.test) 70
9.10.2 Kim nh Fisher (Fishers exact test, fisher.test) 71

10 Phn tch hi qui tuyn tnh 71
10.1 H s tng quan 73
10.1.1 H s tng quan Pearson 73
10.1.2 H s tng quan Spearman 74
10.1.3 H s tng quan Kendall 74
10.2 M hnh ca hi qui tuyn tnh n gin 75
10.3 M hnh hi qui tuyn tnh a bin (multiple linear regression) 82

11 Phn tch phng sai 85
11.1 Phn tch phng sai n gin (one-way analysis of variance) 85
11.2 So snh nhiu nhm v iu chnh tr s p 87
11.3 Phn tch bng phng php phi tham s 90
11.4 Phn tch phng sai hai chiu (two-way ANOVA) 91

12 Phn tch hi qui logistic 94
12.1 M hnh hi qui logistic 95
12.2 Phn tch hi qui logistic bng R 97
12.3 c tnh xc sut bng R 101

13 c tnh c mu (sample size estimation) 103
13.1 Khi nim v power 104
13.2 S liu c tnh c mu 106
13.4 c tnh c mu 107
13.4.1 c tnh c mu cho mt ch s trung bnh 107
13.4.2 c tnh c mu cho so snh hai s trung bnh 108
13.4.3 c tnh c mu cho phn tch phng sai 110
13.4.4 c tnh c mu c tnh mt t l 111
13.4.5 c tnh c mu cho so snh hai t l 112

14 Ti liu tham kho 115

15 Thut ng dng trong sch 117

Phn tch s liu v biu bng R Nguyn Vn Tun

4
Gii thiu R

Phn tch s liu v biu thng c tin hnh bng cc phn mm thng
dng nh SAS, SPSS, Stata, Statistica, v S-Plus. y l nhng phn mm c cc
cng ti phn mm pht trin v gii thiu trn th trng khong ba thp nin qua, v
c cc trng i hc, cc trung tm nghin cu v cng ti k ngh trn ton th gii
s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn mm ny
tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc
cc nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng
ti chnh s dng chng mt cch lu di. Do , cc nh nghin cu thng k trn
th gii hp tc vi nhau pht trin mt phn mm mi, vi ch trng m ngun
m, sao cho tt c cc thnh vin trong ngnh thng k hc v ton hc trn th gii c
th s dng mt cch thng nht v hon ton min ph.

Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k
hc Ross Ihaka v Robert Gentleman [lc ] thuc Trng i hc Auckland, New
Zealand pht ho mt ngn ng mi cho phn tch thng k m h t tn l R [1]. Sng
kin ny c rt nhiu nh thng k hc trn th gii tn thnh v tham gia vo vic
pht trin R.

Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng
k hc, ton hc, nghin cu trong mi lnh vc chuyn sang s dng R phn tch
d liu khoa hc. Trn ton cu, c mt mng li hn mt triu ngi s dng R,
v con s ny ang tng rt nhanh. C th ni trong vng 10 nm na, vai tr ca cc
phn mm thng k thng mi s khng cn ln nh trong thi gian qua na.

Vy R l g? Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch
thng k v v biu . Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s
dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational
mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt
ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn
mn cho mt vn tnh ton c bit.

V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh
nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn
ny s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v
R, nhng ti k vng bn c bit qua v cch s dng my tnh.


1. Ti R xung v ci t vo my tnh

s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh.
lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R
Archive Network (CRAN) sau y:

http://cran.R-project.org.
Phn tch s liu v biu bng R Nguyn Vn Tun

5

Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u bng mu t
R v s phin bn (version). Chng hn nh phin bn ti s dng vo cui nm 2005 l
2.2.1, nn tn ca ti liu cn ti l:

R-2.2.1-win32.zip

Ti liu ny khong 26 MB, v a ch c th ti l:

http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe

Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng
R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny ca ti
c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu
khc.

Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh.
lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn
cch ci t trn mn hnh. y l mt bc rt n gin, ch cn 1 pht l vic ci t R
c th hon tt.

Sau khi hon tt vic ci t, mt icon

R 2.2.1.lnk


s xut hin trn desktop ca my tnh. n y th chng ta sn sng s dng R. C
th nhp chut vo icon ny v chng ta s c mt window nh sau:

Phn tch s liu v biu bng R Nguyn Vn Tun

6



2. Ti R package v ci t vo my tnh

R cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc
phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta
cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc
nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R.
Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc
ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc
package nh lme4. Cc package ny cn phi c ti v v ci t vo my tnh.

a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn
Packages xut hin bn tri ca mc lc trang web. Theo ti, mt s package cn ti
v my tnh s dng cho cc phn tch dch t hc l:

Tn package Chc nng
trellis
Dng v th v lm cho th p hn
lattice
Dng v th v lm cho th p hn
Hmisc
Mt s phng php m hnh d liu ca F. Harrell
Design
Mt s m hnh thit k nghin cu ca F. Harrell
Epi
Dng cho cc phn tch dch t hc
epitools
Mt package khc chuyn cho cc phn tch dch t hc
Foreign
Dng nhp d liu t cc phn mm khc nh
SPSS, Stata, SAS, v.v
Rmeta
Dng cho phn tch tng hp (meta-analysis)
meta
Mt package khc cho phn tch tng hp
Phn tch s liu v biu bng R Nguyn Vn Tun

7
survival
Chuyn dng cho phn tch theo m hnh Cox (Coxs
proportional hazard model)
Zelig
Package dng cho cc phn tch thng k trong lnh
vc x hi hc
Genetics
Package dng cho phn tch s liu di truyn hc
BMA
Bayesian Model Average

Cc package ny c th ci t trc tuyn bng cch chn Install packages trong phn
packages ca R nh hnh di y. Ngoi ra, nu package c ti xung my tnh
c nhn, vic ci t c th nhanh hn bng cch chn Install package(s) from local zip
file cng trong phn packages (xem hnh di y).




3. Vn phm R

R l mt ngn ng tng tc (interactive language), c ngha l khi chng ta ra
lnh, v nu lnh theo ng vn phm, R s p li bng mt kt qu. V, s tng
tc tip tc cho n khi chng ta t c yu cu. Vn phm chung ca R l mt lnh
(command) hay function (ti s thnh thong cp n l hm). M l hm th
phi c thng s; cho nn theo sau hm l nhng thng s m chng ta phi cung cp.
C php chung ca R l nh sau:

i tng <- hm(thng s 1, thng s 2, , thng s n)

Phn tch s liu v biu bng R Nguyn Vn Tun

8
Chng hn nh:

> reg <- lm(y ~ x)

th reg l mt i tng (object), cn lm l mt hm, v y ~ x l thng s ca hm.
Hay:

> setwd(c:/works/stats)

th setwd l mt hm, cn c:/works/stats l thng s ca hm.

bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args
vit tt ch arguments) m trong x l mt hm chng ta cn bit:

> args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL

R l mt ngn ng i tng (object oriented language). iu ny c ngha l
cc d liu trong R c cha trong object. nh hng ny cng c vi nh hng n
cch vit ca R. Chng hn nh thay v vit x = 5 nh thng thng chng ta vn vit,
th R yu cu vit l x == 5.

i vi R, x = 5 tng ng vi x <- 5. Cch vit sau (dng k hiu <-)
c khuyn khch hn l cch vit trc (=). Chng hn nh:

> x <- rnorm(10)

c ngha l m phng 10 s liu v cha trong object x. Chng ta cng c th vit x =
rnorm(10).

Mt s k hiu hay dng trong R l:

x == 5 x bng 5
x != 5 x khng bng 5
y < x y nh hn x
x > y x ln hn y
z <= 7 z nh hn hoc bng 7
p >= 1 p ln hn hoc bng 1
is.na(x) C phi x l bin s trng khng (missing value)
A & B A v B (AND)
A | B A hoc B (OR)
! Khng l (NOT)

Phn tch s liu v biu bng R Nguyn Vn Tun

9
Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu
dnh cho ngi s dng thm vo cc ghi ch, v d:

> # lnh sau y s m phng 10 gi tr normal
> x <- rnorm(10)


3.1 Cch t tn trong R

t tn mt i tng (object) hay mt bin s (variable) trong R kh linh hot,
v R khng c nhiu gii hn nh cc phn mm khc. Tn mt object phi c vit
lin nhau (tc khng c cch ri bng mt khong trng). Chng hn nh R chp
nhn myobject nhng khng chp nhn my object.

> myobject <- rnorm(10)
> my object <- rnorm(10)
Error: syntax error in "my object"

Nhng i khi tn myobject kh c, cho nn chng ta nn tc ri bng . Nh
my.object.

> my.object <- rnorm(10)

Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng. Cho nn
My.object khc vi my.object. V d:

> My.object.u <- 15
> my.object.L <- 5
> My.object.u + my.object.L
[1] 20

Mt vi iu cn lu khi t tn trong R l:

Khng nn t tn mt bin s hay variable bng k hiu _ (underscore) nh
my_object hay my-object.

Khng nn t tn mt object ging nh mt bin s trong mt d liu. V d,
nu chng ta c mt data.frame (d liu hay dataset) vi bin s age trong
, th khng nn c mt object trng tn age, tc l khng nn vit: age <-
age. Tuy nhin, nu data.frame tn l data th chng ta c th cp n bin
s age vi mt k t $ nh sau: data$age. (Tc l bin s age trong
data.frame data), v trong trng hp , age <- data$age c th chp
nhn c.

3.2 H tr trong R

Phn tch s liu v biu bng R Nguyn Vn Tun

10
Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu
vn phm ca tng hm. Chng hn nh mun bit hm lm c nhng thng s
(arguments) no, chng ta ch n gin lnh:

> help(lm)

hay

> ?lm

Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c
v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh.

Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn
c sn trong R bng cch chn mc help v sau chn Html help nh hnh di
y bit thm chi tit. Bn c cng c th copy v dn cc lnh trong mc ny vo R
xem cho bit cch vn hnh ca R.


4. Cch nhp d liu vo R

Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c
th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame.
C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n
nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht:

4.1 Nhp s liu trc tip: c()

V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh sau, v
mun nhp vo R.

50 16.5
62 10.8
60 32.3
40 19.3
48 14.2
47 11.3
57 15.5
70 15.8
48 16.2
67 11.2

Chng ta c th s dng function c tn c nh sau:

> age <- c(50,62, 60,40,48,47,57,70,48,67)
> insulin <- c(16.5,10.8,32.3,19.3,14.2,11.3,15.5,15.8,16.2,11.2)

Phn tch s liu v biu bng R Nguyn Vn Tun

11
Lnh th nht cho R bit rng chng ta mun to ra mt ct d liu (t nay ti s
gi l bin s, tc variable) c tn l age, v lnh th hai l to ra mt ct khc c tn l
insulin. Tt nhin, chng ta c th ly mt tn khc m mnh thch.

Chng ta dng function c (vit tt ca ch concatenation c ngha l mc
ni vo nhau) nhp d liu. Ch rng mi s liu cho mi bnh nhn c cch
nhau bng mt du phy.

K hiu insulin <- (cng c th vit l insulin =) c ngha l cc s liu
theo sau s c nm trong bin s insulin. Chng ta s gp k hiu ny rt nhiu ln
trong khi s dng R.

R l mt ngn ng cu trc theo dng i tng (thut ng chuyn mn l
object-oriented language), v mi ct s liu hay mi mt data.frame l mt i
tng (object) i vi R. V th, age v insulin l hai i tng ring l. By gi
chng ta cn phi nhp hai i tng ny thnh mt data.frame R c th x l sau
ny. lm vic ny chng ta cn n function data.frame:

> tuan <- data.frame(age, insulin)

Trong lnh ny, chng ta mun cho R bit rng nhp hai ct (hay hai i tng) age v
insulin vo mt i tng c tn l tuan.

n y th chng ta c mt i tng hon chnh tin hnh phn tch thng k.
kim tra xem trong tuan c g, chng ta ch cn n gin g:

> tuan

V R s bo co:

age insulin
1 50 16.5
2 62 10.8
3 60 32.3
4 40 19.3
5 48 14.2
6 47 11.3
7 57 15.5
8 70 15.8
9 48 16.2
10 67 11.2

Nu chng ta mun lu li cc s liu ny trong mt file theo dng R, chng ta
cn dng lnh save. Gi d nh chng ta mun lu s liu trong directory c tn l
c:\works\insulin, chng ta cn g nh sau:

> setwd(c:/works/insulin)
> save(tuan, file=tuan.rda)
Phn tch s liu v biu bng R Nguyn Vn Tun

12

Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng
chng ta mun lu cc s liu trong directory c tn l c:\works\insulin. Lu rng
thng thng Windows dng du backward slash /, nhng trong R chng ta dng du
forward slash /.

Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu
trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn
tuan.rda s c mt trong directory .


4.2 Nhp s liu trc tip: edit(data.frame())

V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh
nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny,
R s cung cp cho chng ta mt window mi vi mt dy ct v dng ging nh Excel,
v chng ta c th nhp s liu trong bng . V d:

> ins <- edit(data.frame())

Chng ta s c mt ca s nh sau:



y, R khng bit chng ta c bin s no, cho nn R lit k cc bin s var1,
var2, v.v Nhp chut vo ct var1 v thay i bng cch g vo age. Nhp
chut vo ct var2 v thay i bng cch g vo insulin. Sau g s liu cho
Phn tch s liu v biu bng R Nguyn Vn Tun

13
tng ct. Sau khi xong, bm nt cho X gc phi ca spreadsheet, chng ta s c mt
data.frame tn ins vi hai bin s age v insulin.

4.3 Nhp s liu t mt text file: read.table

V d 2: Chng ta thu thp s liu v tui v cholesterol t mt nghin cu
50 bnh nhn mc bnh cao huyt p. Cc s liu ny c lu trong mt text file c tn
l chol.txt ti directory c:\works\insulin. S liu ny nh sau: ct 1 l m s
ca bnh nhn, ct 2 l gii tnh, ct 3 l body mass index (bmi), ct 4 l HDL
cholesterol (vit tt l hdl), k n l LDL cholesterol, total cholesterol (tc) v
triglycerides (tg).

id sex age bmi hdl ldl tc tg
1 Nam 57 17 5.000 2.0 4.0 1.1
2 Nu 64 18 4.380 3.0 3.5 2.1
3 Nu 60 18 3.360 3.0 4.7 0.8
4 Nam 65 18 5.920 4.0 7.7 1.1
5 Nam 47 18 6.250 2.1 5.0 2.1
6 Nu 65 18 4.150 3.0 4.2 1.5
7 Nam 76 19 0.737 3.0 5.9 2.6
8 Nam 61 19 7.170 3.0 6.1 1.5
9 Nam 59 19 6.942 3.0 5.9 5.4
10 Nu 57 19 5.000 2.0 4.0 1.9
...
46 Nu 52 24 3.360 2.0 3.7 1.2
47 Nam 64 24 7.170 1.0 6.1 1.9
48 Nam 45 24 7.880 4.0 6.7 3.3
49 Nu 64 25 7.360 4.6 8.1 4.0
50 Nu 62 25 7.750 4.0 6.2 2.5



Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng
ta s s dng lnh read.table nh sau:

> setwd(c:/works/insulin)
> chol <- read.table("chol.txt", header=TRUE)

Lnh th nht chng ta mun m bo R truy nhp ng directory m s liu
ang c lu gi. Lnh th hai yu cu R nhp s liu t file c tn l chol.txt
(trong directory c:\works\insulin) v cho vo i tng chol. Trong lnh ny,
header=TRUE c ngha l yu cu R c dng u tin trong file nh l tn ca
tng ct d kin.

Chng ta c th kim tra xem R c ht cc d liu hay cha bng cch ra lnh:

> chol

Hay

Phn tch s liu v biu bng R Nguyn Vn Tun

14
> names(chol)

R s cho bit c cc ct nh sau trong d liu (names l lnh hi trong d liu c nhng
ct no v tn g):

[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg"

By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra lnh:

> save(chol, file="chol.rda")


4.4 Nhp s liu t Excel: read.csv

nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc:

Bc 1: Dng lnh Save as trong Excel v lu s liu di dng csv;
Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng ta mun
chuyn vo R phn tch. D liu ny c tn l excel.xls.

ID Age Sex Ethnicity IGFI IGFBP3 ALS PINP ICTP P3NP
1 18 1 1 148.27 5.14 316.00 61.84 5.81 4.21
2 28 1 1 114.50 5.23 296.42 98.64 4.96 5.33
3 20 1 1 109.82 4.33 269.82 93.26 7.74 4.56
4 21 1 1 112.13 4.38 247.96 101.59 6.66 4.61
5 28 1 1 102.86 4.04 240.04 58.77 4.62 4.95
6 23 1 4 129.59 4.16 266.95 48.93 5.32 3.82
7 20 1 1 142.50 3.85 300.86 135.62 8.78 6.75
8 20 1 1 118.69 3.44 277.46 79.51 7.19 5.11
9 20 1 1 197.69 4.12 335.23 57.25 6.21 4.44
10 20 1 1 163.69 3.96 306.83 74.03 4.95 4.84
11 22 1 1 144.81 3.63 295.46 68.26 4.54 3.70
12 27 0 2 141.60 3.48 231.20 56.78 4.47 4.07
13 26 1 1 161.80 4.10 244.80 75.75 6.27 5.26
14 33 1 1 89.20 2.82 177.20 48.57 3.58 3.68
15 34 1 3 161.80 3.80 243.60 50.68 3.52 3.35
16 32 1 1 148.50 3.72 234.80 83.98 4.85 3.80
17 28 1 1 157.70 3.98 224.80 60.42 4.89 4.09
18 18 0 2 222.90 3.98 281.40 74.17 6.43 5.84
19 26 0 2 186.70 4.64 340.80 38.05 5.12 5.77
20 27 1 2 167.56 3.56 321.12 30.18 4.78 6.12

Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu di dng csv:
Vo Excel, chn File Save as
Chn Save as type CSV (Comma delimited)
Phn tch s liu v biu bng R Nguyn Vn Tun

15



Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory
c:\works\insulin.

Vic th hai l vo R v ra nhng lnh sau y:

> setwd(c:/works/insulin)
> gh <- read.csv ("excel.txt", header=TRUE)

Lnh th hai read.csv yu cu R c s liu t excel.csv, dng dng th nht l tn
ct, v lu cc s liu ny trong mt object c tn l gh.

By gi chng ta c th lu gh di dng R x l sau ny bng lnh sau y:

> save(gh, file="gh.rda")


4.5 Nhp s liu t mt SPSS: read.spss

Phn mm thng k SPSS lu d liu di dng sav. Chng hn nh nu
chng ta c mt d liu c tn l testo.sav trong directory c:\works\insulin, v mun
chuyn d liu ny sang dng R c th hiu c, chng ta cn s dng lnh
read.spss trong package c tn l foreign. Cc lnh sau y s hon tt d dng
vic ny:

Vic u tin chng ta cho truy nhp foreign bng lnh library:
Phn tch s liu v biu bng R Nguyn Vn Tun

16

> library(foreign)

Vic th hai l lnh read.spss:

> setwd(c:/works/insulin)
> testo <- read.spss(testo.sav, to.data.frame=TRUE)

Lnh th hai read.spss yu cu R c s liu t testo.sav, v cho vo mt
data.frame c tn l testo.

By gi chng ta c th lu testo di dng R x l sau ny bng lnh sau y:

> save(testo, file="testo.rda")


4.6 Thng tin v d liu

Gi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh trong v d
1. tm hiu xem trong d liu ny c g, chng ta c th nhp vo R nh sau:

Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi
arg l tn ca d liu..

> attach(chol)

Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh
is.data.frame(arg) vi arg l tn ca d liu. V d:

> is.data.frame(chol)
[1] TRUE

R cho bit chol qu l mt data.frame.

C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu
ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch
dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):

> dim(chol)
[1] 50 8

Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s ny tn g?
Chng ta dng lnh names(arg) vi arg l tn ca d liu. V d:

> names(chol)
[1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg"

Phn tch s liu v biu bng R Nguyn Vn Tun

17
Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi ny, chng
ta c th dng lnh table(arg) vi arg l tn ca bin s. V d:

> table(sex)
sex
nam Nam Nu
1 21 28

Kt qu cho thy d liu ny c 21 nam v 28 n.


4.7 To dy s bng hm seq, rep v gl

R cn c cng dng to ra nhng dy s rt tin cho vic m phng v thit k th
nghim. Nhng hm thng thng cho dy s l seq (sequence), rep (repetition) v
gl (generating levels):

p dng seq

To ra mt vector s t 1 n 12:
> x <- (1:12)
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12

> seq(12)
[1] 1 2 3 4 5 6 7 8 9 10 11 12

To ra mt vector s t 12 n 5:
> x <- (12:5)
> x
[1] 12 11 10 9 8 7 6 5

> seq(12,7)
[1] 12 11 10 9 8 7

Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to,
length.out= ). Cch s dng s c minh ho bng vi v d sau y:

To ra mt vector s t 4 n 6 vi khong cch bng 0.25:
> seq(4, 6, 0.25)
[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00

To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15
> seq(length=10, from=2, to=15)
[1] 2.000000 3.444444 4.888889 6.333333 7.777778 9.222222
10.666667 12.111111 13.555556 15.000000


Phn tch s liu v biu bng R Nguyn Vn Tun

18
p dng rep

Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times
l s ln lp li. V d:

To ra s 10, 3 ln:
> rep(10, 3)
[1] 10 10 10

To ra s 1 n 4, 3 ln:
> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4

To ra s 1.2, 2.7, 4.8, 5 ln:
> rep(c(1.2, 2.7, 4.8), 5)
[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8

To ra s 1.2, 2.7, 4.8, 5 ln:
> rep(c(1.2, 2.7, 4.8), 5)
[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8


p dng gl

gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh
ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k,
labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi
v d sau y:

To ra bin gm bc 1 v 2; mi bc c lp li 8 ln:
> gl(2, 8)
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Levels: 1 2

Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln:
> gl(3, 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3

To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20):
> gl(2, 10, length=20)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2

Hay:

> gl(2, 2, length=20)
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2

Cho thm k hiu:
Phn tch s liu v biu bng R Nguyn Vn Tun

19
> gl(2, 5, label=c("C", "T"))
[1] C C C C C T T T T T
Levels: C T

To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln.
> rep(1:4, c(2,2,2,2))
[1] 1 1 2 2 3 3 4 4

Cng tng ng vi:
> rep(1:4, each = 2)
[1] 1 1 2 2 3 3 4 4

Vi ngy gi thng:
> x <- .leap.seconds[1:3]
> rep(x, 2)
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31 16:00:00
Pacific Standard Time"
[3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30 17:00:00
Pacific Standard Time"
[5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00
Pacific Standard Time"

> rep(as.POSIXlt(x), rep(2, 3))
[1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30 17:00:00
Pacific Standard Time"
[3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31 16:00:00
Pacific Standard Time"
[5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00
Pacific Standard Time"


5. Bin tp s liu

5.1 Tch ri d liu: subset

Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v
hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R
c tn l chol t mt text file c tn l chol.txt:

> setwd(c:/works/insulin)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)

Nu chng ta, v mt l do no , ch mun phn tch ring cho nam gii, chng
ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu. lm chuyn ny,
chng ta dng lnh subset(data, cond), trong data l data.frame m chng ta
mun tch ri, v cond l iu kin. V d:

> nam <- subset(chol, sex==Nam)
> nu <- subset(chol, sex==Nu)

Phn tch s liu v biu bng R Nguyn Vn Tun

20
Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu.
Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch
iu kin chnh xc.

Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng
iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame
mi tn l old vi nhng bnh nhn trn 60 tui:

> old <- subset(chol, age>=60)
> dim(old)
[1] 25 8

Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii:

> n60 <- subset(chol, age>=60 & sex==Nam)
> dim(n60)
[1] 9 8


5.2 Chit s liu t mt data .frame

Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li
nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t
lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s
7. Chng ta c th dng lnh sau y:

> data2 <- chol[, c(1,3,7)]

y, chng ta lnh cho R bit rng chng ta mun chn ct s 1, 3 v 7, v a tt c
s liu ca hai ct ny vo data.frame mi c tn l data2. Ch chng ta s dng
ngoc kp vung [] ch khng phi ngoc kp vng (), v chol khng phi lm mt
function. Du phy pha trc c, c ngha l chng ta chn tt c cc dng s liu trong
data.frame chol.

Nhng nu chng ta ch mun chn 10 dng s liu u tin, th lnh s l:

> data3 <- chol[1:10, c(1,3,7)]
> print(data3)
id sex tc
1 1 Nam 4.0
2 2 Nu 3.5
3 3 Nu 4.7
4 4 Nam 7.7
5 5 Nam 5.0
6 6 Nu 4.2
7 7 Nam 5.9
8 8 Nam 6.1
Phn tch s liu v biu bng R Nguyn Vn Tun

21
9 9 Nam 5.9
10 10 Nu 4.0

Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra,
chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).


5.3 Nhp hai data.frame thnh mt: merge

Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1
gm 3 ct: id, sex, tc nh sau:

id sex tc
1 Nam 4.0
2 Nu 3.5
3 Nu 4.7
4 Nam 7.7
5 Nam 5.0
6 Nu 4.2
7 Nam 5.9
8 Nam 6.1
9 Nam 5.9
10 Nu 4.0

D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:

id sex tg
1 Nam 1.1
2 Nu 2.1
3 Nu 0.8
4 Nam 1.1
5 Nam 2.1
6 Nu 1.5
7 Nam 2.6
8 Nam 1.5
9 Nam 5.4
10 Nu 1.9
11 Nu 1.7

Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d
liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch
dng lnh merge nh sau:

> d <- merge(d1, d2, by="id", all=TRUE)
> d
id sex.x tc sex.y tg
Phn tch s liu v biu bng R Nguyn Vn Tun

22
1 1 Nam 4.0 Nam 1.1
2 2 Nu 3.5 Nu 2.1
3 3 Nu 4.7 Nu 0.8
4 4 Nam 7.7 Nam 1.1
5 5 Nam 5.0 Nam 2.1
6 6 Nu 4.2 Nu 1.5
7 7 Nam 5.9 Nam 2.6
8 8 Nam 6.1 Nam 1.5
9 9 Nam 5.9 Nam 5.4
10 10 Nu 4.0 Nu 1.9
11 11 <NA> NA Nu 1.7

Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo
data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s
11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available).

5.4 Bin i s liu (data coding)

Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin
lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long
xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral
density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c
BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V
d, chng ta c s liu BMD t 10 bnh nhn nh sau:

-0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11

nhp cc s liu ny vo R chng ta c th s dng function c nh sau:

bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)

phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th dng m
s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc (hy gi l
diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm vic ny, chng ta s
dng lnh:

# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd

# bin i bmd thnh diagnosis
> diagnosis[bmd <= -2.5] <- 1
> diagnosis[bmd > -2.5 & bmd <= 1.0] <- 2
> diagnosis[bmd > -1.0] <- 3

# to thnh mt data frame
> data <- data.frame(bmd, diagnosis)

# lit k kim tra xem lnh c hiu qu khng
> data
Phn tch s liu v biu bng R Nguyn Vn Tun

23
bmd diagnosis
1 -0.92 3
2 0.21 3
3 0.17 3
4 -3.21 1
5 -1.80 2
6 -2.60 1
7 -2.00 2
8 1.71 3
9 2.12 3
10 -2.11 2


5.5 Bin i s liu bng cch dng replace

Mt cch bin i s liu khc l dng replace, d cch ny c v rm r cht t.
Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau:

> diagnosis <- bmd
> diagnosis <- replace(diagnosis, bmd <= -2.5, 1)
> diagnosis <- replace(diagnosis, bmd > -2.5 & bmd <= 1.0, 2)
> diagnosis <- replace(diagnosis, bmd > -1.0, 3)


5.6 Bin i thnh yu t (factor)

Trong phn tch thng k, chng ta phn bit mt bin s mang tnh yu t (factor) v
bin s lin tc bnh thng. Bin s yu t khng th dng tnh ton nh cng tr
nhn chia, nhng bin s s hc c th s dng tnh ton. Chng hn nh trong v d
bmd v diagnosis trn, diagnosis l yu t v gi tr trung bnh gia 1 v 2 chng
c ngha thc t g c; cn bmd l bin s s hc.

Nhng hin nay, diagnosis c xem l mt bin s s hc. bin thnh bin s
yu t, chng ta cn s dng function factor nh sau:

> diag <- factor(diagnosis)
> diag
[1] 3 3 3 1 2 1 2 3 3 2
Levels: 1 2 3

Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu chng ta yu
cu R tnh s trung bnh ca diag, R s khng lm theo yu cu ny, v khng phi l
mt bin s s hc:

> mean(diag)
[1] NA
Warning message:
argument is not numeric or logical: returning NA in: mean.default(diag)

D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:

Phn tch s liu v biu bng R Nguyn Vn Tun

24
> mean(diagnosis)
[1] 2.3

nhng kt qu 2.3 ny khng c ngha g trong thc t c.


5.7 Phn nhm s liu bng cut2 (Hmisc)

Trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin tc thnh
nhiu nhm da vo phn phi ca bin s. Chng hn nh i vi bin s bmd chng ta
c th ct dy s thnh 3 nhm tng ng nhau bng cch dng function cut2
(trong th vin Hmisc) nh sau:

> # nhp th vin Hmisc c th dng function cut2
> library(Hmisc)
> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)
> # chia bin s bmd thnh 2 nhm v trong i tng group
> group <- cut2(bmd, g=2)
> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5 5

Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g = group). R t ng
chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi
nhm gm c 5 s.

Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh:

> group <- cut2(bmd, g=3)

V vi lnh table chng ta s bit c 3 nhm, nhm 1 gm 4 s, nhm 2 v 3 mi nhm
c 3 s:

> table(group)
group
[-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12]
4 3 3


6. S dng R cho tnh ton n gin

Mt trong nhng li th ca R l c th s dng nh mt my tnh cm tay.
Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Trong
chng ny ti ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th
s dng lp tc trong khi c nhng dng ch ny.

6.1 Tnh ton n gin

Phn tch s liu v biu bng R Nguyn Vn Tun

25
Cng hai s hay nhiu s vi nhau:
> 15+2997
[1] 3012

Cng v tr:
> 15+2997-9768
[1] -6756

Nhn v chia
> -27*12/21
[1] -15.42857

S ly tha: (25 5)
3

> (25 - 5)^3
[1] 8000

Cn s bc hai: 10
> sqrt(10)
[1] 3.162278

S pi ()
> pi
[1] 3.141593
> 2+3*pi
[1] 11.42478

Logarit: log
e

> log(10)
[1] 2.302585

Logarit: log
10

> log10(100)
[1] 2

S m: e
2.7689

> exp(2.7689)
[1] 15.94109

> log10(2+3*pi)
[1] 1.057848

Hm s lng gic
> cos(pi)
[1] -1


Vector
> x <- c(2,3,1,5,4,6,7,6,8)
> x
[1] 2 3 1 5 4 6 7 6 8

> sum(x)
[1] 42

> x*2
[1] 4 6 2 10 8 12 14 12 16

> exp(x/10)
[1] 1.221403 1.349859 1.105171 1.648
1.491825 1.822119 2.013753 1.822119
[9] 2.225541

> exp(cos(x/10))
[1] 2.664634 2.599545 2.704736 2.405
2.511954 2.282647 2.148655 2.282647
[9] 2.007132

Tnh tng bnh phng (sum of squares): 1
2

+ 2
2
+ 3
2
+ 4
2
+ 5
2
= ?
> x <- c(1,2,3,4,5)
> sum(x^2)
[1] 55

Tnh tng bnh phng iu chnh
(adjusted sum of squares): ( )
2
1
n
i
i
x x
=

= ?
> x <- c(1,2,3,4,5)
> sum((x-mean(x))^2)
[1] 10

Trong cng thc trn mean(x) l s trung
bnh ca vector x.

Tnh sai s bnh phng (mean square): Tnh phng sai (variance) v lch
chun (standard deviation):
Phn tch s liu v biu bng R Nguyn Vn Tun

26
( )
2
1
/
n
i
i
x x n
=

= ?
> x <- c(1,2,3,4,5)
> sum((x-mean(x))^2)/length(x)
[1] 2

Trong cng thc trn, length(x) c
ngha l tng s phn t (elements) trong
vector x.

Phng sai: ( ) ( )
2
2
1
/ 1
n
i
i
s x x n
=
=

= ?
> x <- c(1,2,3,4,5)
> var(x)
[1] 2.5
lch chun:
2
s :
> sd(x)
[1] 1.581139



6.2 S dng R cho cc php tnh ma trn

Nh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v ct
(column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n ct. Trong R,
chng ta cng c th th hin nh th. V d: chng ta mun to mt ma trn vung A
gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4, 5, 6, 7, 8, 9, chng ta vit:

1 4 7
2 5 8
3 6 9
A
| |
|
=
|
|
\ .


V vi R:

> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Nhng nu chng ta lnh:

> A <- matrix(y, nrow=3, byrow=TRUE)
> A

th kt qu s l:

[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt ma trn
hon v l dng t(). V d:
Phn tch s liu v biu bng R Nguyn Vn Tun

27

> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

v B = A' c th din t bng R nh sau:

> B <- t(A)
> B
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v
tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho
l 1. Chng ta c th to mt ma trn nh th bng R nh sau:

> # to ra m ma trn 3 x 3 vi tt c phn t l 0.
> A <- matrix(0, 3, 3)

> # cho cc phn t ng cho bng 1
> diag(A) <- 1
> diag(A)
[1] 1 1 1

> # by gi ma trn A s l:
> A
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1


6.2.1 Chit phn t t ma trn

> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> A
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

> # ct 1 ca ma trn A
> A[,1]
Phn tch s liu v biu bng R Nguyn Vn Tun

28
[1] 1 4 7

> # ct 3 ca ma trn A
> A[3,]
[1] 7 8 9

> # dng 1 ca ma trn A
> A[1,]
[1] 1 2 3

> # dng 2, ct 3 ca ma trn A
> A[2,3]
[1] 6

> # tt c cc dng ca ma trn A, ngoi tr dng 2
> A[-2,]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 3 6 9

> # tt c cc ct ca ma trn A, ngoi tr ct 1
> A[,-1]
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9

> # xem phn t no cao hn 3.
> A>3
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE


6.2.2 Tnh ton vi ma trn

Cng v tr hai ma trn. Cho hai ma trn A v B nh sau:

> A <- matrix(1:12, 3, 4)
> A
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

> B <- matrix(-1:-12, 3, 4)
> B
[,1] [,2] [,3] [,4]
[1,] -1 -4 -7 -10
Phn tch s liu v biu bng R Nguyn Vn Tun

29
[2,] -2 -5 -8 -11
[3,] -3 -6 -9 -12

Chng ta c th cng A+B:

> C <- A+B
> C
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0

Hay A-B:

> D <- A-B
> D
[,1] [,2] [,3] [,4]
[1,] 2 8 14 20
[2,] 4 10 16 22
[3,] 6 12 18 24


Nhn hai ma trn. Cho hai ma trn:

1 4 7
2 5 8
3 6 9
A
| |
|
=
|
|
\ .
v
1 2 3
4 5 6
7 8 9
B
| |
|
=
|
|
\ .


Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau:

> y <- c(1,2,3,4,5,6,7,8,9)
> A <- matrix(y, nrow=3)
> B <- t(A)
> AB <- A%*%B
> AB
[,1] [,2] [,3]
[1,] 66 78 90
[2,] 78 93 108
[3,] 90 108 126


Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau:

> BA <- B%*%A
> BA
[,1] [,2] [,3]
[1,] 14 32 50
[2,] 32 77 122
[3,] 50 122 194
Phn tch s liu v biu bng R Nguyn Vn Tun

30

Nghch o ma trn v gii h phng trnh. V d chng ta c h phng trnh sau
y:

1 2
1 2
3 4 4
6 2
x x
x x
+ =
+ =


H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :

3 4
1 6
A
| |
=
|
\ .
,
1
2
x
X
x
| |
=
|
\ .
, v
4
2
Y
| |
=
|
\ .


Nghim ca h phng trnh ny l: X = A
-1
Y, hay trong R:

> A <- matrix(c(3,1,4,6), nrow=2)
> Y <- matrix(c(4,2), nrow=2)
> X <- solve(A)%*%Y
> X
[,1]
[1,] 1.1428571
[2,] 0.1428571

Chng ta c th kim tra:

> 3*X[1,1]+4*X[2,1]
[1] 4

Tr s eigen cng c th tnh ton bng function eigen nh sau:

> eigen(A)
$values
[1] 7 2

$vectors
[,1] [,2]
[1,] -0.7071068 -0.9701425
[2,] -0.7071068 0.2425356

nh thc (determinant). Lm sao chng ta xc nh mt ma trn c th o nghch
hay khng? Ma trn m nh thc bng 0 l ma trn suy bin (singular matrix) v
khng th o nghch. kim tra nh thc, R dng lnh det():

> E <- matrix((1:9), 3, 3)
> E
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Phn tch s liu v biu bng R Nguyn Vn Tun

31
> det(E)
[1] 0

Nhng ma trn F sau y th c th o nghch:

> F <- matrix((1:9)^2, 3, 3)
> F
[,1] [,2] [,3]
[1,] 1 16 49
[2,] 4 25 64
[3,] 9 36 81
> det(F)
[1] -216

V nghch o ca ma trn F (F
-1
) c th tnh bng function solve() nh sau:

> solve(F)
[,1] [,2] [,3]
[1,] 1.291667 -2.166667 0.9305556
[2,] -1.166667 1.666667 -0.6111111
[3,] 0.375000 -0.500000 0.1805556


Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh
phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t
do to ra nhng php tnh ph hp cho tng vn c th. R c mt package Matrix
chuyn thit k cho tnh ton ma trn. Bn c c th ti package xung, ci vo my, v
s dng, nu cn. a ch ti l: http://cran.au.r-project.org/bin/windows/contrib/r-
release/Matrix_0.995-8.zip cng vi ti liu ch dn cch s dng (di khong 80 trang):
http://cran.au.r-project.org/doc/packages/Matrix.pdf.


7. S dng R cho tnh ton xc sut

7.1 Php hon v (permutation)

Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho
mt s n l: ( )( )( ) ! 1 2 3 ... 1 n n n n n = . Trong R cch tnh ny rt n gin vi
lnh prod() nh sau:

Tm 3!
> prod(3:1)
[1] 6

Tm 10!
> prod(10:1)
[1] 3628800

Phn tch s liu v biu bng R Nguyn Vn Tun

32
Tm 10.9.8.7.6.5.4
> prod(10:4)
[1] 604800

Tm (10.9.8.7.6.5.4) / (40.39.38.37.36)
> prod(10:4) / prod(40:36)
[1] 0.007659481

7.2 T hp (combination)

S ln chn k ngi t n phn t l:
( )
!
! !
n
n
k k n k
| |
=
|

\ .
. Cng thc ny cng c khi vit l
n
k
C thay v
n
k
| |
|
\ .
. Vi R, php tnh ny rt n gin bng hm choose(n, k). Sau
y l vi v d minh ha:

Tm
5
2
| |
|
\ .

> choose(5, 2)
[1] 10

Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v:
> 1/choose(5, 2)
[1] 0.1

7.3 Bin s ngu nhin v hm phn phi

Khi ni n phn phi (hay distribution) l cp n cc gi tr m bin s c
th c. Cc hm phn phi (distribution function) l hm nhm m t cc bin s mt
cch c h thng. C h thng y c ngha l theo m m hnh ton hc c th vi
nhng thng s cho trc. Trong xc sut thng k c kh nhiu hm phn phi, v
y chng ta s xem xt qua mt s hm quan trng nht v thng dng nht: l phn
phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4
loi hm quan trng m chng ta cn bit:

hm mt xc sut (probability density distribution);
hm phn phi tch ly (cumulative probability distribution);
hm nh bc (quantile); v
hm m phng (simulation).

R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi hm
c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn ca hm .
Cc tip u ng l d (ch distribution hay xc sut), p (ch cumulative probability, xc
sut tch ly), q (ch nh bc hay quantile), v r (ch random hay s ngu nhin). Cc
Phn tch s liu v biu bng R Nguyn Vn Tun

33
tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh
phn), pois (Poisson, phn phi Poisson), v.v Bng sau y tm tt cc hm v thng
s cho tng hm:

Hm phn
phi
Mt Tch ly nh bc M phng
Chun
dnorm(x, mean,
sd)
pnorm(q, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)
Nh phn
dbinom(k, n, p) pbinom(q, n, p) qbinom (p, n, p) rbinom(k, n, prob)
Poisson
dpois(k, lambda) ppois(q, lambda) qpois(p, lambda) rpois(n, lambda)
Uniform
dunif(x, min,
max)
punif(q, min, max) qunif(p, min, max) runif(n, min, max)
Negative
binomial
dnbinom(x, k, p) pnbinom(q, k, p) qnbinom (p,k,prob) rbinom(n, n, prob)
Beta
dbeta(x, shape1,
shape2)
pbeta(q, shape1,
shape2)
qbeta(p, shape1,
shape2)
rbeta(n, shape1,
shape2)
Gamma
dgamma(x, shape,
rate, scale)
gamma(q, shape,
rate, scale)
qgamma(p, shape,
rate, scale)
rgamma(n, shape,
rate, scale)
Geometric
dgeom(x, p) pgeom(q, p) qgeom(p, prob) rgeom(n, prob)
Exponential
dexp(x, rate) pexp(q, rate) qexp(p, rate) rexp(n, rate)
Weibull
dnorm(x, mean,
sd)
pnorm(q, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)
Cauchy
dcauchy(x,
location, scale)
pcauchy(q,
location, scale)
qcauchy(p,
location, scale)
rcauchy(n,
location, scale)
F
df(x, df1, df2) pf(q, df1, df2) qf(p, df1, df2) rf(n, df1, df2)
T
dt(x, df) pt(q, df) qt(p, df) rt(n, df)
Chi-squared
dchisq(x, df) pchi(q, df) qchisq(p, df) rchisq(n, df)
Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample
size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut
phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0.
Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn.

7.3.1 Hm phn phi nh phn (Binomial distribution)

Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c / khng,
v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim c tin
hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm xc sut
thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:
( ) ( ) | , 1
n k
n k
k
P k n p C p p

= , trong k = 0, 1, 2, . . . , n. Trong R, c hm dbinom(k,
n, p) c th gip chng ta tnh cng thc ( ) ( ) | , 1
n k
n k
k
P k n p C p p

= mt cch nhanh
chng. Trong trng hp trn, chng ta ch cn n gin lnh:

> dbinom(2, 3, 0.60)
[1] 0.432

V d 2: Hm nh phn tch ly (Cumulative Binomial probability
distribution). Xc sut thuc chng long xng c hiu nghim l khong 70% (tc l
p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu 8 bnh nhn vi kt
qu tch cc l bao nhiu? Ni cch khc, nu gi X l s bnh nhn c iu tr thnh
cng, chng ta cn tm P(X 8) = ? tr li cu hi ny, chng ta s dng hm
Phn tch s liu v biu bng R Nguyn Vn Tun

34
pbinom(k, n, p). Xin nhc li rng hm pbinom(k, n, p)cho chng ta P(X
k). Do , P(X 8) = 1 P(X 7). Thnh ra, p s bng R cho cu hi l:

> 1-pbinom(7, 10, 0.70)
[1] 0.3827828

V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s c
khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu 1000 ln,
mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn phi s bnh
nhn cao huyt p s nh th no? tr li cu hi ny, chng ta c th ng dng hm
rbinom (n, k, p) trong R vi nhng thng s nh sau:

> b <- rbinom(1000, 20, 0.20)

Trong lnh trn, kt qu m phng c tm thi cha trong i tng tn l b. bit
b c g, chng ta m bng lnh table:

> table(b)
b
0 1 2 3 4 5 6 7 8 9 10
6 45 147 192 229 169 105 68 23 13 3

Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p
trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn
mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45
mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s
trn bng lnh hist nh sau:

> hist(b, main="Number of hypertensive patients")

Number of hypertensive patients
b
F
r
e
q
u
e
n
c
y
0 2 4 6 8 10
0
5
0
1
0
0
1
5
0
2
0
0
Phn tch s liu v biu bng R Nguyn Vn Tun

35
Biu 1. Phn phi s bnh nhn cao huyt
p trong s 20 ngi c chn ngu nhin
trong mt qun th gm 20% bnh nhn cao
huyt p, v chn mu c lp li 1000 ln.

Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn
mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao
huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi
c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l
c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp
(ch 3/1000).

7.3.2 Hm phn phi Poisson (Poisson distribution)

Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi tr thng
s p thng rt nh v n thng rt ln. V th, hm Poisson thng c s dng
m t cc bin s rt him xy ra (nh s ngi mc ung th trong mt dn s chng
hn). Hm Poisson cn c ng dng kh nhiu v thnh cng trong cc nghin cu k
thut v th trng nh s lng khch hng n mt nh hng mi gi.

V d 4: Hm mt Poisson (Poisson density probability function). Qua
theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th k nh my.
Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch. Hi xc sut m th k
nh sai chnh t 2 ch, hn 2 ch l bao nhiu?

V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t (tm t
tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson. y, chng ta c
t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut
m X = k, vi iu kin t l trung bnh , :

( ) |
!
k
e
P X k
k


= =

Do , p s cho cu hi trn l: ( )
2 2
1
2 | 1 0.1839
2!
e
P X

= = = = . p s ny c th
tnh bng R mt cch nhanh chng hn bng hm dpois nh sau:

> dpois(2, 1)
[1] 0.1839397

Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no:

> dpois(1, 1)
[1] 0.3678794

> dpois(0, 1)
Phn tch s liu v biu bng R Nguyn Vn Tun

36
[1] 0.3678794

Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1. Trn y
l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m th k nh sai
chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:

( ) ( ) ( ) 2 3 4 ( 5) ... P X P X P X P X > = = + = + = +
= ( ) 1 2 P X
= 1 0.3678 0.3678 0.1839
= 0.08

Bng R, chng ta c th tnh nh sau:

# P(X 2)
> ppois(2, 1)
[1] 0.9196986

# 1-P(X 2)
> 1-ppois(2, 1)
[1] 0.0803014

7.3.3 Hm phn phi chun (Normal distribution)

Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi
p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c
nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi
thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng
quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt
thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi
chun c hai thng s: trung bnh v phng sai
2
(hay lch chun ). Gi X l
mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc
sut m X = x l:
( ) ( )
( )
2
2
2
1
| , exp
2 2
x
P X x f x



(

= = = (
(



V d 5: Hm mt phn phi chun (Normal density probability function).
Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi lch chun l 4.6
cm. Cng bit rng chiu cao ny tun theo lut phn phi chun. Vi hai thng s
=156, =4.6, chng ta c th xy dng mt hm phn phi chiu cao cho ton b qun
th ph n Vit Nam, v hm ny c hnh dng nh sau:

Phn tch s liu v biu bng R Nguyn Vn Tun

37
130 140 150 160 170 180 190 200
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
Probability distribution of height in Vietnamese women
Height
f
(
h
e
i
g
h
t
)
Biu 2. Phn phi chiu cao ph n Vit
Nam vi trung bnh 156 cm v lch chun 4.6
cm. Trng honh l chiu cao v trc tung l xc
sut cho mi chiu cao.

Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s
height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin
trung bnh l 156 cm v lch chun l 4.6 cm.

> height <- seq(130, 200, 1)
> plot(height, dnorm(height, 156, 4.6),
type="l",
ylab=f(height),
xlab=Height,
main="Probability distribution of height in Vietnamese women")


Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho bt c
chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu cao 160 cm l:

P(X = 160 | =156, =4.6) =
( )
( )
2
2
160 156
1
exp
4.6 2 3.1416 2 4.6
(

(


= 0.0594

Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt
cch gn nh:

> dnorm(160, mean=156, sd=4.6)
[1] 0.05942343
Phn tch s liu v biu bng R Nguyn Vn Tun

38

Hm xc sut chun tch ly (cumulative normal probability function). V
chiu cao l mt bin s lin tc, trong thc t chng ta t khi no mun tm xc sut cho
mt gi tr c th x, m thng tm xc sut cho mt khong gi tr a n b. Chng hn
nh chng ta mun bit xc sut chiu cao t 150 n 160 cm (tc l P(160 X 150),
hay xc sut chiu cao thp hn 145 cm, tc P(X < 145). tm p s cc cu hi nh
th, chng ta cn n hm xc sut chun tch ly, c nh ngha nh sau:

P(a X b) = ( )
b
a
f x dx



Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu
2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho
mt phn phi chun rt c ch.

pnorm (a, mean, sd) = ( )
a
f x dx

= P(X a | mean, sd)



Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%:

> pnorm(150, 156, 4.6)
[1] 0.0960575

Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l:

> 1-pnorm(164, 156, 4.6)
[1] 0.04100591

Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165
cm.


V d 6: ng dng lut phn phi chun: Trong mt qun th, chng ta bit
rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao
nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr
li bng R l:

> 1-pnorm(120, mean=100, sd=13)
[1] 0.0619679

Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120
mmHg.


7.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)

Phn tch s liu v biu bng R Nguyn Vn Tun

39
Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai
2
thng c vit tt l:
X ~ N( ,
2
)

y v
2
ty thuc vo n v o lng ca bin s. Chng hn nh chiu
cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm,
v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n
gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1.
Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu
kin trn l:

X
Z

=

Ni theo ngn ng ton: nu X ~ N( ,
2
), th (X )/
2
~ N(0, 1). Nh vy qua
cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh bng s lch
chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu Z = -1, chng ta bit
rng X thp hn ng 1 lch chun. Tng t, Z = 2.5, chng ta bit rng X cao hn
ng 2.5 lch chun. v.v

Biu phn phi chiu cao ca ph n Vit Nam c th m t bng mt n v
mi, l ch s z nh sau:


-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Probability distribution of height in Vietnamese women
z
f
(
z
)

Biu 3. Phn phi chun ha chiu cao ph
n Vit Nam.

Biu trn c v bng hai lnh sau y:

Phn tch s liu v biu bng R Nguyn Vn Tun

40
> height <- seq(-4, 4, 0.1)
> plot(height, dnorm(height, 0, 1),
type="l",
ylab=f(z),
xlab=z,
main="Probability distribution of height in Vietnamese women")

Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n m t v so
snh mt phn phi ca bt c bin no, v tt c u c chuyn sang ch s z.

Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta c th
tnh ton xc sut z nh hn mt hng s (constant) no d dng bng R. V d,
chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v lch
chun l 1.

> pnorm(-1.96, mean=0, sd=1)
[1] 0.02499790

Hay P(z 1.96) = ?

> pnorm(1.96, mean=0, sd=1)
[1] 0.9750021

Do , P(-1.96 < z < 1.96) chnh l:

> pnorm(1.96) - pnorm(-1.96)
[1] 0.9500042

Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti
khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default
value) ca thng s mean l 0 v sd l 1).

V d 5 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph
n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170
cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit
Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%.

> 1-pnorm(3.04)
[1] 0.001182891

Tm nh lng (quantile) ca mt phn phi chun. i khi chng ta cn
lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu xc sut Z nh
hn mt hng s z no cho trc bng p, th z l bao nhiu? Din t theo k hiu xc
sut, chng ta mun tm z trong nu:

P(Z < z) = p

tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=).

Phn tch s liu v biu bng R Nguyn Vn Tun

41
V d 7: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z.

> qnorm(0.95, mean=0, sd=1)
[1] 1.644854

Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1:

> qnorm(0.975, mean=0, sd=1)
[1] 1.959964


7.4 Chn mu ngu nhin (random sampling)

Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m bo
tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R, chng ta c th
ly mu mt mu ngu nhin bng cch s dng hm sample.

V d 8: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40). Nu
chng ta mun chn 5 i tng qun th , ai s l ngi c chn? Chng ta c th
dng lnh sample() tr li cu hi nh sau:

> sample(1:40, 5)
[1] 32 26 6 18 9

Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh ny, R s
chn mt mu khc, ch khng hon ton ging nh mu trn. V d:

> sample(1:40, 5)
[1] 5 22 35 19 4

> sample(1:40, 5)
[1] 24 26 12 6 22

> sample(1:40, 5)
[1] 22 38 11 6 18

v.v

Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling
without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn
vo qun th.

Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng,
chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10
ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with
replacement), chng ta ch cn thm tham s replace = TRUE:

> sample(1:50, 10, replace=T)
Phn tch s liu v biu bng R Nguyn Vn Tun

42
[1] 31 44 6 8 47 50 10 16 29 23

Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v kt qu
10 ln c th l:

> sample(c("H", "T"), 10, replace=T)
[1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T"

Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh mu (D)
trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri li vo bao; ri li
chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C nh th, chng ta chn 20
ln, kt qu c th l:

> sample(c("X", "D"), 20, replace=T)
[1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D" "X" "X" "X" "X"
[20] "D"

Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau y,
chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau:

> sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T)
[1] 3 1 3 2 2 2 2 2 5 1

i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c chn 2 ln,
v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh cung cp v s mu
cn nh, nhng cng khng qu xa vi k vng.


8. Biu

Trong ngn ng R c rt nhiu cch thit k mt biu gn v p. Phn ln
nhng hm thit k biu c sn trong R, nhng mt s loi biu tinh vi v phc
tp khc c th thit k bng cc package chuyn dng nh lattice hay trellis c
th ti t website ca R. Trong chng ny ti s ch cch v cc biu thng dng
bng cch s dng cc hm ph bin trong R.


8.1 S liu cho phn tch biu

Sau khi bit qua mi trng v nhng la chn thit k mt biu , by
gi chng ta c th s dng mt s hm thng dng v cc biu cho s liu. Theo
ti, biu c th chia thnh 2 loi chnh: biu dng m t mt bin s v biu
v mi lin h gia hai hay nhiu bin s. Tt nhin, bin s c th l lin tc hay khng
lin tc, cho nn, trong thc t, chng ta c 4 loi biu . Trong phn sau y, ti s
im qua cc loi biu , t n gin n phc tp.

C l cch tt nht tm hiu cch v th bng R l bng mt d liu thc t.
Ti s quay li v d 2 (phn 4.2). Trong v d , chng ta c d liu gm 8 ct (hay
Phn tch s liu v biu bng R Nguyn Vn Tun

43
bin s): id, sex, age, bmi, hdl, ldl, tc, v tg. (Ch , id l m s
ca 50 i tng nghin cu; sex l gii tnh (nam hay n); age l tui; bmi l t
s trng lng; hdl l high density cholesterol; ldl l low density cholesterol; tc l
tng s - total cholesterol; v tg triglycerides). D liu c cha trong directory
directory c:\works\insulin di tn chol.txt. Trc khi v th, chng ta
bt u bng cch nhp d liu ny vo R.

> setwd(c:/works/stats)
> cong <- read.table(chol.txt, header=TRUE, na.strings=.)
> attach(cong)

Hay tin vic theo di ti s nhp cc d liu bng cc lnh sau y:

sex <- c(Nam, Nu, Nu,Nam,Nam, Nu,Nam,Nam,Nam, Nu,
Nu,Nam, Nu,Nam,Nam, Nu, Nu, Nu, Nu, Nu,
Nu, Nu, Nu, Nu,Nam,Nam, Nu,Nam, Nu, Nu,
Nu,Nam,Nam, Nu, Nu,Nam, Nu,Nam, Nu, Nu,
Nam, Nu,Nam,Nam,Nam, Nu,Nam,Nam, Nu, Nu)

age <- c(57, 64, 60, 65, 47, 65, 76, 61, 59, 57,
63, 51, 60, 42, 64, 49, 44, 45, 80, 48,
61, 45, 70, 51, 63, 54, 57, 70, 47, 60,
60, 50, 60, 55, 74, 48, 46, 49, 69, 72,
51, 58, 60, 45, 63, 52, 64, 45, 64, 62)

bmi <- c( 17, 18, 18, 18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22,
22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24,
24, 24, 24, 25, 25)

hdl <- c(5.000,4.380,3.360,5.920,6.250,4.150,0.737,7.170,6.942,5.000,
4.217,4.823,3.750,1.904,6.900,0.633,5.530,6.625,5.960,3.800,
5.375,3.360,5.000,2.608,4.130,5.000,6.235,3.600,5.625,5.360,
6.580,7.545,6.440,6.170,5.270,3.220,5.400,6.300,9.110,7.750,
6.200,7.050,6.300,5.450,5.000,3.360,7.170,7.880,7.360,7.750)

ldl <- c(2.0, 3.0, 3.0, 4.0, 2.1, 3.0, 3.0, 3.0, 3.0, 2.0,
5.0, 1.3, 1.2, 0.7, 4.0, 4.1, 4.3, 4.0, 4.3, 4.0,
3.1, 3.0, 1.7, 2.0, 2.1, 4.0, 4.1, 4.0, 4.2, 4.2,
4.4, 4.3, 2.3, 6.0, 3.0, 3.0, 2.6, 4.4, 4.3, 4.0,
3.0, 4.1, 4.4, 2.8, 3.0, 2.0, 1.0, 4.0, 4.6, 4.0)

tc <-c (4.0, 3.5, 4.7, 7.7, 5.0, 4.2, 5.9, 6.1, 5.9, 4.0,
6.2, 4.1, 3.0, 4.0, 6.9, 5.7, 5.7, 5.3, 7.1, 3.8,
4.3, 4.8, 4.0, 3.0, 3.1, 5.3, 5.3, 5.4, 4.5, 5.9,
5.6, 8.3, 5.8, 7.6, 5.8, 3.1, 5.4, 6.3, 8.2, 6.2,
6.2, 6.7, 6.3, 6.0, 4.0, 3.7, 6.1, 6.7, 8.1, 6.2)

tg <- c(1.1, 2.1, 0.8, 1.1, 2.1, 1.5, 2.6, 1.5, 5.4, 1.9,
1.7, 1.0, 1.6, 1.1, 1.5, 1.0, 2.7, 3.9, 3.0, 3.1,
2.2, 2.7, 1.1, 0.7, 1.0, 1.7, 2.9, 2.5, 6.2, 1.3,
3.3, 3.0, 1.0, 1.4, 2.5, 0.7, 2.4, 2.4, 1.4, 2.7,
2.4, 3.3, 2.0, 2.6, 1.8, 1.2, 1.9, 3.3, 4.0, 2.5)

cong <- data.frame(sex, age, bmi, hdl, ldl, tc, tg)


Phn tch s liu v biu bng R Nguyn Vn Tun

44
8.2 Biu cho mt bin s ri rc (discrete variable): barplot

Bin sex trong d liu trn c hai gi tr (nam v nu), tc l mt bin khng lin
tc. Chng ta mun bit tn s ca gii tnh (bao nhiu nam v bao nhiu n) v v mt
biu n gin. thc hin nh ny, trc ht, chng ta cn dng hm table
bit tn s:

> sex.freq <- table(sex)
> sex.freq
sex
Nam Nu
22 28

C 22 nam v 28 na trong nghin cu. Sau dng hm barplot th hin tn s
ny nh sau:

> barplot(sex.freq, main=Frequency of males and females)

Biu trn cng c th c c bng mt lnh n gin hn (Biu 8a):

> barplot(table(sex), main=Frequency of males and females)


Nam Nu
Frequency of males and females
0
5
1
0
1
5
2
0
2
5

N
a
m
N
u
Frequency of males and females
0 5 10 15 20 25

Biu 8a. Tn s gii tnh th hin bng
ct s.
Biu 8b. Tn s gii tnh th hin bng
dng s.

Thay v th hin tn s nam v n bng 2 ct, chng ta c th th hin bng hai dng
bng thng s horiz = TRUE, nh sau (xem kt qu trong Biu 6b):

> barplot(sex.freq,
horiz = TRUE,
col = rainbow(length(sex.freq)),
main=Frequency of males and females)

Phn tch s liu v biu bng R Nguyn Vn Tun

45
8.3 Biu cho hai bin s ri rc (discrete variable): barplot

Age l mt bin s lin tc. Chng ta c th chia bnh nhn thnh nhiu nhm
da vo tui. Hm cut c chc nng ct mt bin lin tc thnh nhiu nhm ri
rc. Chng hn nh:

> ageg <- cut(age, 3)
> table(ageg)
ageg
(42,54.7] (54.7,67.3] (67.3,80]
19 24 7

C hiu qu chia bin age thnh 3 nhm. Tn s ca ba nhm ny l: 42 tui n 54.7
tui thnh nhm 1, 54.7 n 67.3 thnh nhm 2, v 67.3 n 80 tui thnh nhm 3.
Nhm 1 c 19 bnh nhn, nhm 2 v 3 c 24 v 7 bnh nhn.

By gi chng ta mun bit c bao nhiu bnh nhn trong tng tui v tng gii tnh
bng lnh table:

> age.sex <- table(sex, ageg)
> age.sex
ageg
sex (42,54.7] (54.7,67.3] (67.3,80]
Nam 10 10 2
Nu 9 14 5

Kt qu trn cho thy chng ta c 10 bnh nhn nam v 9 n trong nhm tui th nht,
10 nam v 14 na trong nhm tui th hai, v.v th hin tn s ca hai bin ny,
chng ta vn dng barplot:

> barplot(age.sex, main=Number of males and females in each age
group)


Phn tch s liu v biu bng R Nguyn Vn Tun

46
(42,54.7] (54.7,67.3] (67.3,80]
Number of males and females in each age group
0
5
1
0
1
5
2
0
(42,54.7] (54.7,67.3] (67.3,80]
Age group
0
2
4
6
8
1
0
1
2
1
4
Biu 9a. Tn s gii tnh v nhm tui
th hin bng ct s.
Biu 9b. Tn s gii tnh v nhm tui
th hin bng hai dng s.

Trong Biu 9a, mi ct l cho mt tui, v phn m ca ct l n, v phn mu
nht l tn s ca nam gii. Thay v th hin tn s nam n trong mt ct, chng ta cng
c th th hin bng 2 ct vi beside=T nh sau (Biu 9b):

barplot(age.sex, beside=TRUE, xlab="Age group")


8.4 Biu hnh trn

Tn s mt bin ri rc cng c th th hin bng biu hnh trn. V d sau y v
biu tn s ca tui. Biu 10a l 3 nhm tui, v Biu 10b l biu tn
s cho 5 nhm tui:

> pie(table(ageg))
pie(table(cut(age,5)))

Phn tch s liu v biu bng R Nguyn Vn Tun

47
(42,54.7]
(54.7,67.3]
(67.3,80]
(42,49.6]
(49.6,57.2]
(57.2,64.8]
(64.8,72.4]
(72.4,80]

Biu 10a. Tn s cho 3 nhm tui Biu 10b. Tn s cho 5 nhm tui


8.5 Biu cho mt bin s lin tc: stripchart v hist

8.5.1 Stripchart

Biu strip cho chng ta thy tnh lin tc ca mt bin s. Chng hn nh
chng ta mun tm hiu tnh lin tc ca triglyceride (tg), hm stripchart() s gip
trong mc tiu ny:

> stripchart(tg,
main=Strip chart for triglycerides, xlab=mg/L)
1 2 3 4 5 6
Strip chart for triglycerides
mg/L

Phn tch s liu v biu bng R Nguyn Vn Tun

48

Chng ta thy bin s tg c s bt lin tc, nht l cc i tng c tg cao. Trong khi
phn ln i tng c tg thp hn 5, th c 2 i tng vi tg rt cao (>5).

8.5.2 Histogram

Age l mt bin s lin tc. v biu tn s ca bin s age, chng ta ch
n gin lnh hist(age). Nh cp trn, chng ta c th ci tin th ny bng
cch cho thm ta chnh (main) v ta ca trc honh (xlab) v trc tung
(ylab):

> hist(age)
> hist(age, main="Frequency distribution by age group", xlab="Age
group", ylab="No of patients")

Histogram of age
age
F
r
e
q
u
e
n
c
y
40 50 60 70 80
0
2
4
6
8
1
0
1
2
Frequency distribution by age group
Age group
N
o

o
f

p
a
t
i
e
n
t
s
40 50 60 70 80
0
2
4
6
8
1
0
1
2
Biu 11a. Trc tung l s bnh nhn (i
tng nghin cu) v trc honh l tui.
Chng hn nh tui 40 n 45 c 6 bnh nhn,
t 70 n 80 tui c 4 bnh nhn.
Biu 11b. Thm tn biu v tn ca trc
trung v trc honh bng xlab v ylab.


Chng ta cng c th bin i biu thnh mt th phn phi xc sut bng hm
plot(density) nh sau (kt qu trong Biu 12a):

> plot(density(age),add=TRUE)

Phn tch s liu v biu bng R Nguyn Vn Tun

49
30 40 50 60 70 80 90
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
density.default(x = age)
N = 50 Bandwidth = 3.806
D
e
n
s
i
t
y
Histogram of age
age
D
e
n
s
i
t
y
40 50 60 70 80
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
Biu 12a. Xc sut phn phi mt cho
bin age ( tui).
Biu 12b. Xc sut phn phi mt cho
bin age ( tui) vi nhiu interquartile.

Chng ta c th v hai th chng ln bng cch dng hm interquartile nh sau (kt
qu xem Biu 12b):

8.6 Biu hp (boxplot)

v biu hp ca bin s tc, chng ta ch n gin lnh:

> boxplot(tc, main="Box plot of total cholesterol", ylab="mg/L")

3
4
5
6
7
8
Box plot of total cholesterol
m
g
/
L

Biu 13. Trong biu ny, chng ta thy median
(trung v) khong 5.6 mg/L, 25% total cholesterol thp
hn 4.1, v 75% thp hn 6.2. Total cholesterol thp nht
Phn tch s liu v biu bng R Nguyn Vn Tun

50
l khoang 3, v cao nht l trn 8 mg/L.


Trong biu sau y, chng ta so snh tc gia hai nhm nam v n:

> boxplot(tc ~ sex, main=Box plot of total cholestrol by sex,
ylab="mg/L")

Kt qu trnh by trong Biu 14a. Chng ta c th bin giao din ca th bng
cch dng thng s horizontal=TRUE v thay i mu bng thng s col nh sau
(Biu 14b):

> boxplot(tc~sex, horizontal=TRUE, main="Box plot of total
cholesterol", ylab="mg/L", col = "pink")


Nam Nu
3
4
5
6
7
8
Box plot of total cholesterol by sex
m
g
/
L
N
a
m
N
u
3 4 5 6 7 8
Box plot of total cholesterol
m
g
/
L
Biu 14a. Trong biu ny, chng ta
thy trung v ca total cholesterol n gii
thp hn nam gii, nhng dao ng gia
hai nhm khng khc nhau bao nhiu.
Biu 14b. Total cholesterol cho tng
gii tnh, vi mu sc v hnh hp nm
ngang.


8.7 Phn tch biu cho hai bin lin tc

8.7.1 Biu tn x (scatter plot)

tm hiu mi lin h gia hai bin, chng ta dng biu tn x. v biu tn x
v mi lin h gia bin s tc v hdl, chng ta s dng hm plot. Thng s th nht
ca hm plot l trc honh (x-axis) v thng s th 2 l trc tung. tm hiu mi lin
h gia tc v hdl chng ta n gin lnh:

> plot(tc, hdl)

Phn tch s liu v biu bng R Nguyn Vn Tun

51
3 4 5 6 7 8
2
4
6
8
tc
h
d
l

Biu 15. Mi lin h gia tc v hdl. Trong biu
ny, chng ta v bin s hdl trn trc tung v tc trn
trc honh.

Chng ta mun phn bit gii tnh (nam v n) trong biu trn. v biu ,
chng ta phi dng n hm ifelse. Trong lnh sau y, nu sex==Nam th v k
t s 16 ( trn), nu khng nam th v k t s 22 (tc vung):

> plot(hdl, tc, pch=ifelse(sex=="Nam", 16, 22))

Kt qu l Biu 16a. Chng ta cng c th thay k t thnh M (nam) v F
n(xem Biu 16b):

> plot(hdl, tc, pch=ifelse(sex=="Nam", M, F))

Phn tch s liu v biu bng R Nguyn Vn Tun

52
3 4 5 6 7 8
2
4
6
8
tc
h
d
l
M
F
F
M
M
F
M
M
M
F
F
M
F
M
M
F F
F
F
F
F
F
F
F
M
M F
M
F
F
F
M
M
F
F
M
F
M
F
F M
F
M
M
M
F
M
M
F
F
2 4 6 8
3
4
5
6
7
8
hdl
t
c
Biu 16a. Mi lin h gia tc v hdl theo
tng gii tnh c th hin bng hai k hiu
du.
Biu 16a. Mi lin h gia tc v hdl theo
tng gii tnh c th hin bng hai k t.


Chng ta cng c th v mt ng biu din hi qui tuyn tnh (regression line) qua cc
im trn bng cch tip tc ra cc lnh sau y:

> plot(hdl ~ tc, pch=16, main="Total cholesterol and HDL cholesterol",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)
> reg <- lm(hdl ~ tc)
> abline(reg)

Kt qu l Biu 17a di y. Chng ta cng c th dng hm trn (smooth function)
biu din mi lin h gia hai bin s. th sau y s dng lowess (mt hm
thng thng nht) trong vic lm trn s liu tc v hdl (Biu 17b).

> plot(hdl ~ tc, pch=16,
main="Total cholesterol and HDL cholesterol with LOEWSS smooth
function",
xlab="Total cholesterol", ylab="HDL cholesterol", bty=l)

> lines(lowess(hdl, tc, f=2/3, iter=3), col="red")


Phn tch s liu v biu bng R Nguyn Vn Tun

53
3 4 5 6 7 8
2
4
6
8
Total cholesterol and HDL cholesterol
Total cholesterol
H
D
L

c
h
o
l
e
s
t
e
r
o
l
3 4 5 6 7 8
2
4
6
8
Total cholesterol and HDL cholesterol with LOEWSS smooth function
Total cholesterol
H
D
L

c
h
o
l
e
s
t
e
r
o
l
Biu 17a. Trong lnh trn, reg<-
lm(hdl~tc) c ngha l tm phng trnh
lin h gia hdl v tc bng linear model
(lm) v 8t kt qu vo i tng reg.
Lnh th hai abline(reg) yu cu R v
ng thng t phng trnh trong reg
Biu 17b. Thay v dng abline, chng ta
dng hm lowess th hin mi lin h gia
tc v hdl.

Bn c c th th nghim vi nhiu thng s f=1/2, f=2/5, hay thm ch f=1/10
s thy th bin i mt cch th v.

8.8 Phn tch Biu cho nhiu bin: pairs

Chng ta c th tm hiu mi lin h gia cc bin s nh age, bmi, hdl, ldl v
tc bng cch dng lnh pairs. Nhng trc ht, chng ta phi a cc bin s ny
vo mt data.frame ch gm nhng bin s c th v c, v sau s dng hm
pairs trong R.

> lipid <- data.frame(age,bmi,hdl,ldl,tc)
> pairs(lipid, pch=16)

Kt qu s l:

Phn tch s liu v biu bng R Nguyn Vn Tun

54
age
18 20 22 24 1 2 3 4 5 6
5
0
6
0
7
0
8
0
1
8
2
0
2
2
2
4
bmi
hdl
2
4
6
8
1
2
3
4
5
6
ldl
50 60 70 80 2 4 6 8 3 4 5 6 7 8
3
4
5
6
7
8
tc


8.9 Biu vi sai s chun (standard error)

Trong biu sau y, chng ta c 5 nhm (bin s x c m phng ch khng phi s
liu tht), v mi nhm c gi tr trung bnh mean, v tin cy 95% (lcl v ucl).
Thng thng lcl=mean-1.96*SE v ucl = mean+1.96*SE (SE l sai s
chun). Chng ta mun v biu cho 5 nhm vi sai s chun . Cc lnh v hm sau
y s cn thit:

> group <- c(1,2,3,4,5)
> mean <- c(1.1, 2.3, 3.0, 3.9, 5.1)
> lcl <- c(0.9, 1.8, 2.7, 3.8, 5.0)
> ucl <- c(1.3, 2.4, 3.5, 4.1, 5.3)
> plot(group, mean, ylim=range(c(lcl, ucl)))
> arrows(group, ucl, group, lcl, length=0.5, angle=90, code=3)

Phn tch s liu v biu bng R Nguyn Vn Tun

55
1 2 3 4 5
1
2
3
4
5
group
m
e
a
n



9. Phn tch thng k m t

9.1 Thng k m t (descriptive statistics, summary)

minh ha cho vic p dng R vo thng k m t, ti s s dng mt d liu
nghin cu c tn l igfdata. Trong nghin cu ny, ngoi cc ch s lin quan n
gii tnh, tui, trng lng v chiu cao, chng ti o lng cc hormone lin quan
n tnh trng tng trng nh igfi, igfbp3, als, v cc markers lin quan n
s chuyn ha ca xng pinp, ictp v pinp. C 100 i tng nghin cu. D
liu ny c cha trong directory c:\works\stats. Trc ht, chng ta cn phi
nhp d liu vo R vi nhng lnh sau y (cc cu ch theo sau du # l nhng ch
thch bn c theo di):

> options(width=100)
# chuyn directory
> setwd("c:/works/stats")

# c d liu vo R
> igfdata <- read.table("igf.txt", header=TRUE, na.strings=".")
> attach(igfdata)

# xem xt cc ct s trong d liu
> names(igfdata)
[1] "id" "sex" "age" "weight" "height" "ethnicity"
[7] "igfi" "igfbp3" "als" "pinp" "ictp" "p3np"

> igfdata
id sex age weight height ethnicity igfi igfbp3 als pinp ictp p3np
Phn tch s liu v biu bng R Nguyn Vn Tun

56
1 1 Female 15 42 162 Asian 189.000 4.00000 323.667 353.970 11.2867 8.3367
2 2 Male 16 44 160 Caucasian 160.000 3.75000 333.750 375.885 10.4300 6.7450
3 3 Female 15 43 157 Asian 146.833 3.43333 248.333 199.507 8.3633 12.5000
4 4 Female 15 42 155 Asian 185.500 3.40000 251.000 483.607 13.3300 14.2767
5 5 Female 16 47 167 Asian 192.333 4.23333 322.000 105.430 7.9233 4.5033
6 6 Female 25 45 160 Asian 110.000 3.50000 284.667 76.487 4.9833 4.9367
7 7 Female 19 45 161 Asian 157.000 3.20000 274.000 75.880 6.3500 5.3200
8 8 Female 18 43 153 Asian 146.000 3.40000 303.000 86.360 7.3700 4.6700
9 9 Female 15 41 149 Asian 197.667 3.56667 308.500 254.803 11.8700 6.8200
10 10 Female 24 45 157 African 148.000 3.40000 273.000 44.720 3.7400 6.1600
...
...
97 97 Female 17 54 168 Caucasian 204.667 4.96667 441.333 64.130 5.1600 4.4367
98 98 Male 18 55 169 Asian 178.667 3.86667 273.000 185.913 7.5267 8.8333
99 99 Female 18 48 151 Asian 237.000 3.46667 324.333 105.127 5.9867 5.6600
100 100 Male 15 54 168 Asian 130.000 2.70000 259.333 325.840 10.2767 6.5933


Trn y ch l mt phn s liu trong s 100 i tng.

Cho mt bin s
1 2 3
, , ,...,
n
x x x x chng ta c th tnh ton mt s ch s thng k m t
nh sau:

L thuyt Hm R
S trung bnh: x
n
x
i
i
n
=
=

1
1
.

mean(x)
Phng sai: ( )

=
=
n
i
i
x x
n
s
1
2 2
1
1


var(x)
lch chun:
2
s s =
sd(x)
Sai s chun (standard error):
s
SE
n
=
Khng c
Tr s thp nht min(x)
Tr s cao nht max(x)
Ton c (range) range(x)

V d 9: tm gi tr trung bnh ca tui, chng ta ch n gin lnh:

> mean(age)
[1] 19.17

Hay phng sai v c lch chun ca tui:

> var(age)
[1] 15.33444

> sd(age)
[1] 3.915922
Phn tch s liu v biu bng R Nguyn Vn Tun

57

Tuy nhin, R c lnh summary c th cho chng ta tt c thng tin thng k v mt bin
s:

> summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 16.00 19.00 19.17 21.25 34.00

Ni chung, kt qu ny n gin v cc vit tt cng c th d hiu. Ch , trong
kt qu trn, c hai ch s 1st Qu v 3rd Qu c ngha l first quartile (tng
ng vi v tr 25%) v third quartile (tng ng vi v tr 75%) ca mt bin s.
First quartile = 16 c ngha l 25% i tng nghin cu c tui bng hoc nh hn
16 tui. Tng t, Third quartile = 34 c ngha l 75% i tng c tui bng hoc
thp hn 34 tui. Tt nhin s trung v (median) 19 cng c ngha l 50% i tng c
tui 19 tr xung (hay 19 tui tr ln).

R khng c hm tnh sai s chun, v trong hm summary, R cng khng cung
cp lch chun. c cc s ny, chng ta c th t vit mt hm n gin (hy gi
l desc) nh sau:

desc <- function(x)
{
av <- mean(x)
sd <- sd(x)
se <- sd/sqrt(length(x))
c(MEAN=av, SD=sd, SE=se)
}

V c th gi hm ny tnh bt c bin no chng ta mun, nh tnh bin als sau
y:

> desc(als)
MEAN SD SE
301.841120 58.987189 5.898719


c mt quang cnh chung v d liu igfdata chng ta ch n gin lnh
summary nh sau:

> summary(igfdata)
id sex age weight height ethnicity
Min. : 1.00 Female:69 Min. :13.00 Min. :41.00 Min. :149.0 African : 8
1st Qu.: 25.75 Male :31 1st Qu.:16.00 1st Qu.:47.00 1st Qu.:157.0 Asian :60
Median : 50.50 Median :19.00 Median :50.00 Median :162.0 Caucasian:30
Mean : 50.50 Mean :19.17 Mean :49.91 Mean :163.1 Others : 2
3rd Qu.: 75.25 3rd Qu.:21.25 3rd Qu.:53.00 3rd Qu.:168.0
Max. :100.00 Max. :34.00 Max. :60.00 Max. :196.0

igfi igfbp3 als pinp ictp
Phn tch s liu v biu bng R Nguyn Vn Tun

58
Min. : 85.71 Min. :2.000 Min. :192.7 Min. : 26.74 Min. : 2.697
1st Qu.:137.17 1st Qu.:3.292 1st Qu.:256.8 1st Qu.: 68.10 1st Qu.: 4.878
Median :161.50 Median :3.550 Median :292.5 Median :103.26 Median : 6.338
Mean :165.59 Mean :3.617 Mean :301.8 Mean :167.17 Mean : 7.420
3rd Qu.:186.46 3rd Qu.:3.875 3rd Qu.:331.2 3rd Qu.:196.45 3rd Qu.: 8.423
Max. :427.00 Max. :5.233 Max. :471.7 Max. :742.68 Max. :21.237

p3np
Min. : 2.343
1st Qu.: 4.433
Median : 5.445
Mean : 6.341
3rd Qu.: 7.150
Max. :16.303


R tnh ton tt c cc bin s no c th tnh ton c! Thnh ra, ngay c ct id
(tc m s ca i tng nghin cu) R cng tnh lun! (v chng ta bit kt qu ca ct
id chng c ngha thng k g). i vi cc bin s mang tnh phn loi nh sex v
ethnicity (sc tc) th R ch bo co tn s cho mi nhm.

Kt qu trn cho tt c i tng nghin cu. Nu chng ta mun kt qu cho
tng nhm nam v n ring bit, hm by trong R rt hu dng. Trong lnh sau y,
chng ta yu cu R tm lc d liu igfdata theo sex.

> by(igfdata, sex, summary)

sex: Female
id sex age weight height
Min. : 1.0 Female:69 Min. :13.00 Min. :41.00 Min. :149.0
1st Qu.:21.0 Male : 0 1st Qu.:17.00 1st Qu.:47.00 1st Qu.:156.0
Median :47.0 Median :19.00 Median :50.00 Median :162.0
Mean :48.2 Mean :19.59 Mean :49.35 Mean :161.9
3rd Qu.:75.0 3rd Qu.:22.00 3rd Qu.:52.00 3rd Qu.:166.0
Max. :99.0 Max. :34.00 Max. :60.00 Max. :196.0
ethnicity igfi igfbp3 als
African : 4 Min. : 85.71 Min. :2.767 Min. :204.3
Asian :43 1st Qu.:136.67 1st Qu.:3.333 1st Qu.:263.8
Caucasian:22 Median :163.33 Median :3.567 Median :302.7
Others : 0 Mean :167.97 Mean :3.695 Mean :311.5
3rd Qu.:186.17 3rd Qu.:3.933 3rd Qu.:361.7
Max. :427.00 Max. :5.233 Max. :471.7
pinp ictp p3np
Min. : 26.74 Min. : 2.697 Min. : 2.343
1st Qu.: 62.75 1st Qu.: 4.717 1st Qu.: 4.337
Median : 78.50 Median : 5.537 Median : 5.143
Mean :108.74 Mean : 6.183 Mean : 5.643
3rd Qu.:115.26 3rd Qu.: 7.320 3rd Qu.: 6.143
Max. :502.05 Max. :13.633 Max. :14.420
------------------------------------------------------------
sex: Male
id sex age weight height
Min. : 2.00 Female: 0 Min. :14.00 Min. :44.00 Min. :155.0
1st Qu.: 34.50 Male :31 1st Qu.:15.00 1st Qu.:48.50 1st Qu.:161.5
Median : 56.00 Median :17.00 Median :51.00 Median :164.0
Mean : 55.61 Mean :18.23 Mean :51.16 Mean :165.6
3rd Qu.: 75.00 3rd Qu.:20.00 3rd Qu.:53.50 3rd Qu.:169.0
Max. :100.00 Max. :27.00 Max. :59.00 Max. :191.0
ethnicity igfi igfbp3 als
Phn tch s liu v biu bng R Nguyn Vn Tun

59
African : 4 Min. : 94.67 Min. :2.000 Min. :192.7
Asian :17 1st Qu.:138.67 1st Qu.:3.183 1st Qu.:249.8
Caucasian: 8 Median :160.00 Median :3.500 Median :276.0
Others : 2 Mean :160.29 Mean :3.443 Mean :280.2
3rd Qu.:183.00 3rd Qu.:3.775 3rd Qu.:311.3
Max. :274.00 Max. :4.500 Max. :388.7
pinp ictp p3np
Min. : 56.28 Min. : 3.650 Min. : 3.390
1st Qu.:135.07 1st Qu.: 6.900 1st Qu.: 5.375
Median :245.92 Median : 9.513 Median : 7.140
Mean :297.21 Mean :10.173 Mean : 7.895
3rd Qu.:450.38 3rd Qu.:13.517 3rd Qu.:10.010
Max. :742.68 Max. :21.237 Max. :16.303


xem qua phn phi ca cc hormones v ch s sinh ha cng mt lc, chng
ta c th v th cho tt c 6 bin s. Trc ht, chia mn nh thnh 6 ca s (vi 2
dng v 3 ct); sau ln lt v:

> op <- par(mfrow=c(2,3))
> hist(igfi)
> hist(igfbp3)
> hist(als)
> hist(pinp)
> hist(ictp)
> hist(p3np)

Phn tch s liu v biu bng R Nguyn Vn Tun

60
Histogram of igfi
igf i
F
r
e
q
u
e
n
c
y
100 200 300 400
0
1
0
2
0
3
0
4
0
Histogram of igfbp3
igf bp3
F
r
e
q
u
e
n
c
y
2.0 3.0 4.0 5.0
0
1
0
2
0
3
0
4
0
Histogram of als
als
F
r
e
q
u
e
n
c
y
150 250 350 450
0
1
0
2
0
3
0
Histogram of pinp
pinp
F
r
e
q
u
e
n
c
y
0 200 400 600 800
0
1
0
2
0
3
0
4
0
5
0
Histogram of ictp
ictp
F
r
e
q
u
e
n
c
y
5 10 15 20
0
1
0
2
0
3
0
Histogram of p3np
p3np
F
r
e
q
u
e
n
c
y
5 10 15
0
1
0
2
0
3
0
4
0



9.2 Thng k m t theo tng nhm

Nu chng ta mun tnh trung bnh ca mt bin s nh igfi cho mi nhm nam
v n gii, hm tapply trong R c th dng cho vic ny:

> tapply(igfi, list(sex), mean)
Female Male
167.9741 160.2903

Trong lnh trn, igfi l bin s chng ta cn tnh, bin s phn nhm l sex, v ch s
thng k chng ta mun l trung bnh (mean). Qua kt qu trn, chng ta thy s trung
bnh ca igfi cho n gii (167.97) cao hn nam gii (160.29).

Nhng nu chng ta mun tnh cho tng gii tnh v sc tc, chng ta ch cn thm mt
bin s trong hm list:

> tapply(igfi, list(ethnicity, sex), mean)
Female Male
African 145.1252 120.9168
Phn tch s liu v biu bng R Nguyn Vn Tun

61
Asian 165.6589 160.4999
Caucasian 176.6536 169.4790
Others NA 200.5000

Trong kt qu trn, NA c ngha l not available, tc khng c s liu cho ph n trong
cc sc tc others.

9.3 Kim nh t (t.test)


Kim nh t da vo gi thit phn phi chun. C hai loi kim nh t: kim
nh t cho mt mu (one-sample t-test), v kim nh t cho hai mu (two-sample t-test).
Kim nh t mt mu nm tr li cu hi d liu t mt mu c phi tht s bng mt
thng s no hay khng. Cn kim nh t hai mu th nhm tr li cu hi hai mu c
cng mt lut phn phi, hay c th hn l hai mu c tht s c cng tr s trung bnh
hay khng. Ti s ln lt minh ha hai kim nh ny qua s liu igfdata trn.

9.3.1 Kim nh t mt mu

V d 10. Qua phn tch trn, chng ta thy tui trung bnh ca 100 i tng
trong nghin cu ny l 19.17 tui. Chng hn nh trong qun th ny, trc y chng
ta bit rng tui trung bnh l 30 tui. Vn t ra l c phi mu m chng ta c c
c i din cho qun th hay khng. Ni cch khc, chng ta mun bit gi tr trung bnh
19.17 c tht s khc vi gi tr trung bnh 30 hay khng.

tr li cu hi ny, chng ta s dng kim nh t. Theo l thuyt thng k,
kim nh t c nh ngha bng cng thc sau y:

/
x
t
s n

=

Trong , x l gi tr trung bnh ca mu, l trung bnh theo gi thit (trong trng
hp ny, 30), s l lch chun, v n l s lng mu (100). Nu gi tr t cao hn gi tr
l thuyt theo phn phi t mt tiu chun c ngha nh 5% chng hn th chng ta c
l do pht biu khc bit c ngha thng k. Gi tr ny cho mu 100 c th tnh ton
bng hm qt ca R nh sau:

> qt(0.95, 100)
[1] 1.660234

Nhng c mt cch tnh ton nhanh gn hn tr li cu hi trn, bng cch dng hm
t.test nh sau:

> t.test(age, mu=30)

One Sample t-test

Phn tch s liu v biu bng R Nguyn Vn Tun

62
data: age
t = -27.6563, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
18.39300 19.94700
sample estimates:
mean of x
19.17

Trong lnh trn age l bin s chng ta cn kim nh, v mu=30 l gi tr gi thit. R
trnh by tr s t = -27.66, vi 99 bc t do, v tr s p < 2.2e-16 (tc rt thp). R
cng cho bit tin cy 95% ca age l t 18.4 tui n 19.9 tui (30 tui nm qu ngoi
khong tin cy ny). Ni cch khc, chng ta c l do pht biu rng tui trung
bnh trong mu ny tht s thp hn tui trung bnh ca qun th.

9.3.2 Kim nh t hai mu

V d 11. Qua phn tch m t trn (phm summary) chng ta thy ph n c
hormone igfi cao hn nam gii (167.97 v 160.29). Cu hi t ra l c phi tht s
l mt khc bit c h thng hay do cc yu t ngu nhin gy nn. Tr li cu hi ny,
chng ta cn xem xt mc khc bit trung bnh gia hai nhm v lch chun ca
khc bit.

2 1
x x
t
SED

=
Trong
1
x v
2
x l s trung bnh ca hai nhm nam v n, v SED l lch chun
ca (
1
x -
2
x ) . Thc ra, SED c th c tnh bng cng thc:

2 2
1 2
SED SE SE = +

Trong
1
SE v
2
SE l sai s chun (standard error) ca hai nhm nam v n. Theo l
thuyt xc sut, t tun theo lut phn phi t vi bc t do
1 2
2 n n + , trong n
1
v n
2
l
s mu ca hai nhm. Chng ta c th dng R tr li cu hi trn bng hm t.test
nh sau:

> t.test(igfi~ sex)

Welch Two Sample t-test

data: igfi by sex
t = 0.8412, df = 88.329, p-value = 0.4025
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.46855 25.83627
sample estimates:
mean in group Female mean in group Male
167.9741 160.2903
Phn tch s liu v biu bng R Nguyn Vn Tun

63


R trnh by cc gi tr quan trng trc ht:

t = 0.8412, df = 88.329, p-value = 0.4025

df l bc t do. Tr s p = 0.4025 cho thy mc khc bit gia hai nhm nam v n
khng c ngha thng k (v cao hn 0.05 hay 5%).

95 percent confidence interval:
-10.46855 25.83627

l khong tin cy 95% v khc bit gia hai nhm. Kt qu tnh ton trn cho bit
igf n gii c th thp hn nam gii 10.5 ng/L hoc cao hn nam gii khong 25.8
ng/L. V khc bit qu ln v l thm bng chng cho thy khng c khc bit c
ngha thng k gia hai nhm.

Kim nh trn da vo gi thit hai nhm nam v n c khc phng sai. Nu
chng ta c l do cho rng hai nhm c cng phng sai, chng ta ch thay i mt
thng s trong hm t vi var.equal=TRUE nh sau:

> t.test(igfi~ sex, var.equal=TRUE)

Two Sample t-test

data: igfi by sex
t = 0.7071, df = 98, p-value = 0.4812
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.88137 29.24909
sample estimates:
mean in group Female mean in group Male
167.9741 160.2903


V mc s, kt qu phn tch trn c khc cht t so vi kt qu phn tch da vo gi
nh hai phng sai khc nhau, nhng tr s p cng i n mt kt lun rng khc bit
gia hai nhm khng c ngha thng k.

9.4 Kim nh Wilcoxon cho hai mu (wilcox.test)

Kim nh t da vo gi thit l phn phi ca mt bin phi tun theo lut phn
phi chun. Nu gi nh ny khng ng, kt qu ca kim nh t c th khng hp l
(valid). kim nh phn phi ca igfi, chng ta c th dng hm shapiro.test
nh sau:

> shapiro.test(igfi)

Shapiro-Wilk normality test

Phn tch s liu v biu bng R Nguyn Vn Tun

64
data: igfi
W = 0.8528, p-value = 1.504e-08

Tr s p nh hn 0.05 rt nhiu, cho nn chng ta c th ni rng phn phi ca igfi
khng tun theo lut phn phi chun. Trong trng hp ny, vic so snh gia hai
nhm c th da vo phng php phi tham s (non-parametric) c tn l kim nh
Wilcoxon, v kim nh ny (khng nh kim nh t) khng ty thuc vo gi nh phn
phi chun.

> wilcox.test(igfi ~ sex)

Wilcoxon rank sum test with continuity correction

data: igfi by sex
W = 1125, p-value = 0.6819
alternative hypothesis: true mu is not equal to 0

Tr s p = 0.682 cho thy qu tht khc bit v igfi gia hai nhm nam v n khng
c ngha thng k. Kt lun ny cng khng khc vi kt qu phn tch bng kim nh
t.


9.5 Kim nh t cho cc bin s theo cp (paired t-test, t.test)

Kim nh t va trnh by trn l cho cc nghin cu gm hai nhm c lp nhau
(nh gia hai nhm nam v n), nhng khng th ng dng cho cc nghin cu m mt
nhm i tng c theo di theo thi gian. Ti tm gi cc nghin cu ny l nghin
cu theo cp. Trong cc nghin cu ny, chng ta cn s dng mt kim nh t c tn l
paired t-test.

V d 12. Mt nhm bnh nhn gm 10 ngi c iu tr bng mt thuc
nhm gim huyt p. Huyt p ca bnh nhn c o lc khi u nghin cu (lc cha
iu tr), v sau khi iu kh. S liu huyt p ca 10 bnh nhn nh sau:

Trc khi iu tr (x
0
)
180, 140, 160, 160, 220, 185, 145, 160, 160, 170
Sau khi iu tr (x
1
)
170, 145, 145, 125, 205, 185, 150, 150, 145, 155

Cu hi t ra l bin chuyn huyt p trn c kt lun rng thuc iu tr c
hiu qu gim p huyt. tr li cu hi ny, chng ta dng kim nh t cho tng cp
nh sau:

> # nhp d kin
> before <- c(180, 140, 160, 160, 220, 185, 145, 160, 160, 170)
> after <- c(170, 145, 145, 125, 205, 185, 150, 150, 145, 155)
> bp <- data.frame(before, after)

> # kim nh t
> t.test(before, after, paired=TRUE)
Phn tch s liu v biu bng R Nguyn Vn Tun

65

Paired t-test

data: before and after
t = 2.7924, df = 9, p-value = 0.02097
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
1.993901 19.006099
sample estimates:
mean of the differences
10.5

Kt qu trn cho thy sau khi iu tr p sut mu gim 10.5 mmHg, v khong tin cy
95% l t 2.0 mmHg n 19 mmHg, vi tr s p = 0.0209. Nh vy, chng ta c bng
chng pht biu rng mc gim huyt p c ngha thng k.

Ch nu chng ta phn tch sai bng kim nh thng k cho hai nhm c lp di y
th tr s p = 0.32 cho bit mc gim p sut khng c ngha thng k!

> t.test(before, after)

Welch Two Sample t-test

data: before and after
t = 1.0208, df = 17.998, p-value = 0.3209
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-11.11065 32.11065
sample estimates:
mean of x mean of y
168.0 157.5


9.6 Kim nh Wilcoxon cho cc bin s theo cp (wilcox.test)

Thay v dng kim nh t cho tng cp, chng ta cng c th s dng hm
wilcox.test cho cng mc ch:

> wilcox.test(before, after, paired=TRUE)

Wilcoxon signed rank test with continuity correction

data: before and after
V = 42, p-value = 0.02291
alternative hypothesis: true mu is not equal to 0

Kt qu trn mt ln na khng nh rng gim p sut mu c ngha thng k vi
tr s (p=0.023) chng khc my so vi kim nh t cho tng cp.
Phn tch s liu v biu bng R Nguyn Vn Tun

66


9.7 Tn s (frequency)

Hm table trong R c chc nng cho chng ta bit v tn s ca mt bin s
mang tnh phn loi nh sex v ethnicity.

> table(sex)
sex
Female Male
69 31

> table(ethnicity)
ethnicity
African Asian Caucasian Others
8 60 30 2

Mt bng thng k 2 chiu:


> table(sex, ethnicity)
ethnicity
sex African Asian Caucasian Others
Female 4 43 22 0
Male 4 17 8 2


Ch trong cc bng thng k trn, hm table khng cung cp cho chng ta s phn
trm. tnh s phn trm, chng ta cn n hm prop.table v cch s dng c th
minh ho nh sau:

# to ra mt object tn l freq cha kt qu tn s
> freq <- table(sex, ethnicity)

# kim tra kt qu
> freq
ethnicity
sex African Asian Caucasian Others
Female 4 43 22 0
Male 4 17 8 2

# dng hm margin.table xem kt qu
> margin.table(freq, 1)
sex
Female Male
69 31

> margin.table(freq, 2)
ethnicity
African Asian Caucasian Others
Phn tch s liu v biu bng R Nguyn Vn Tun

67
8 60 30 2

# tnh phn trm bng hm prop.table
> prop.table(freq, 1)
ethnicity
sex African Asian Caucasian Others
Female 0.05797101 0.62318841 0.31884058 0.00000000
Male 0.12903226 0.54838710 0.25806452 0.06451613

Trong bng thng k trn, prop.table tnh t l sc tc cho tng gii tnh. Chng hn
nh n gii (female), 5.8% l ngi Phi chu, 62.3% l ngi chu, 31.8% l ngi
Ty phng da trng . Tng cng l 100%. Tng t, nam gii t l ngi Phi chu l
12.9%, chu l 54.8%, v.v

# tnh phn trm bng hm prop.table
> prop.table(freq, 2)
ethnicity
sex African Asian Caucasian Others
Female 0.5000000 0.7166667 0.7333333 0.0000000
Male 0.5000000 0.2833333 0.2666667 1.0000000

Trong bng thng k trn, prop.table tnh t l gii tnh cho tng sc tc. Chng hn
nh trong nhm ngi chu, 71.7% l n v 28.3% l nam.

# tnh phn trm cho ton b bng
> freq/sum(freq)
ethnicity
sex African Asian Caucasian Others
Female 0.04 0.43 0.22 0.00
Male 0.04 0.17 0.08 0.02


9.8 Kim nh t l (proportion test, prop.test, binom.test)

Kim nh mt t l thng da vo gi nh phn phi nh phn (binomial distribution).
Vi mt s mu n v t l p, v nu n ln (tc hn 50 chng hn), th phn phi nh phn
c th tng ng vi phn phi chun vi s trung bnh np v phng sai np(1 p).
Gi x l s bin c m chng ta quan tm, kim nh gi thit p = c th s dng thng
k sau y:

( ) 1
x n
z
n



y, z tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Cng c th
ni z
2
tun theo lut phn phi Chi bnh phng vi bc t do bng 1.

Phn tch s liu v biu bng R Nguyn Vn Tun

68
V d 13. Trong nghin cu trn, chng ta thy c 69 n v 31 nam. Nh vy t
l n l 0.69 (hay 69%). kim nh xem t l ny c tht s khc vi t l 0.5 hay
khng, chng ta c th s dng hm prop.test(x, n, ) nh sau:

> prop.test(69, 100, 0.50)

1-sample proportions test with continuity correction

data: 69 out of 100, null probability 0.5
X-squared = 13.69, df = 1, p-value = 0.0002156
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5885509 0.7766330
sample estimates:
p
0.69

Trong kt qu trn, prop.test c tnh t l n gii l 0.69, v khong tin cy 95% l
0.588 n 0.776. Gi tr Chi bnh phng l 13.69, vi tr s p = 0.00216. Nh vy,
nghin cu ny c t l n cao hn 50%.

Mt cch tnh chnh xc hn kim nh t l l kim nh nh phn bionom.test(x,
n, ) nh sau:

> binom.test(69, 100, 0.50)

Exact binomial test

data: 69 and 100
number of successes = 69, number of trials = 100, p-value = 0.0001831
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5896854 0.7787112
sample estimates:
probability of success
0.69

Ni chung, kt qu ca kim nh nh phn khng khc g so vi kim nh Chi bnh
phng, vi tr s p = 0.00018, chng ta cng c bng chng kt lun rng t l n gii
trong nghin cu ny tht s cao hn 50%.


9.9 So snh hai t l (prop.test, binom.test)

Phng php so snh hai t l c th khai trin trc tip t l thuyt kim nh mt t l
va trnh by trn. Cho hai mu vi s i tng n
1
v n
2
, v s bin c l x
1
v x
2
. Do
, chng ta c th c tnh hai t l p
1
v p
2
. L thuyt xc sut cho php chng ta pht
biu rng khc bit gia hai mu d = p
1
p
2
tun theo lut phn phi chun vi s
trung bnh 0 v phng sai bng:

Phn tch s liu v biu bng R Nguyn Vn Tun

69
( )
1 2
1 1
1
d
V p p
n n
| |
= +
|
\ .


Trong :
1 2
1 2
x x
p
n n
+
=
+


Thnh ra, z = d/V
d
tun theo lut phn phi chun vi trung bnh 0 v phng sai 1. Ni
cch khc, z
2
tun theo lut phn phi Chi bnh phng vi bc t do bng 1. Do ,
chng ta cng c th s dng prop.test kim nh hai t l.

V d 14. Mt nghin cu c tin hnh so snh hiu qu ca thuc chng gy
xng. Bnh nhn c chia thnh hai nhm: nhm A c iu tr gm c 100 bnh
nhn, v nhm B khng c iu tr gm 110 bnh nhn. Sau thi gian 12 thng theo
di, nhm A c 7 ngi b gy xng, v nhm B c 20 ngi gy xng. Vn t ra
l t l gy xng trong hai nhm ny bng nhau (tc thuc khng c hiu qu)?
kim nh xem hai t l ny c tht s khc nhau, chng ta c th s dng hm
prop.test(x, n, ) nh sau:

> fracture <- c(7, 20)
> total <- c(100, 110)
> prop.test(fracture, total)

2-sample test for equality of proportions with continuity
correction

data: fracture out of total
X-squared = 4.8901, df = 1, p-value = 0.02701
alternative hypothesis: two.sided
95 percent confidence interval:
-0.20908963 -0.01454673
sample estimates:
prop 1 prop 2
0.0700000 0.1818182

Kt qu phn tch trn cho thy t l gy xng trong nhm 1 l 0.07 v nhm 2 l 0.18.
Phn tch trn cn cho thy xc sut 95% rng khc bit gia hai nhm c th 0.01
n 0.20 (tc 1 n 20%). Vi tr s p = 0.027, chng ta c th ni rng t l gy xng
trong nhm A qu tht thp hn nhm B.


9.10 So snh nhiu t l (prop.test, chisq.test)

Kim nh prop.test cn c th s dng kim nh nhiu t l cng mt lc.
Trong nghin cu trn, chng ta c 4 nhm sc tc v tn s cho tng gii tnh nh sau:

> table(sex, ethnicity)
Phn tch s liu v biu bng R Nguyn Vn Tun

70
ethnicity
sex African Asian Caucasian Others
Female 4 43 22 0
Male 4 17 8 2

Chng ta mun bit t l n gii gia 4 nhm sc tc c khc nhau hay khng, v tr
li cu hi ny, chng ta li dng prop.test nh sau:

> female <- c( 4, 43, 22, 0)
> total <- c(8, 60, 30, 2)
> prop.test(female, total)

4-sample test for equality of proportions without continuity
correction

data: female out of total
X-squared = 6.2646, df = 3, p-value = 0.09942
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.5000000 0.7166667 0.7333333 0.0000000

Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)


Tuy t l n gii gia cc nhm c v khc nhau ln (73% trong nhm 3 (ngi da trng)
so vi 50% trong nhm 1 (Phi chu) v 71.7% trong nhm chu, nhng kim nh Chi
bnh phng cho bit trn phng din thng k, cc t l ny khng khc nhau, v tr s
p = 0.099.


9.10.1 Kim nh Chi bnh phng (Chi squared test, chisq.test)

Tht ra, kim nh Chi bnh phng cn c th tnh ton bng hm chisq.test nh
sau:

> chisq.test(sex, ethnicity)

Pearson's Chi-squared test

data: sex and ethnicity
X-squared = 6.2646, df = 3, p-value = 0.09942

Warning message:
Chi-squared approximation may be incorrect in: chisq.test(sex,
ethnicity)


Kt qu ny hon ton ging vi kt qu t hm prop.test.


Phn tch s liu v biu bng R Nguyn Vn Tun

71
9.10.2 Kim nh Fisher (Fishers exact test, fisher.test)

Trong kim nh Chi bnh phng trn, chng ta ch cnh bo:

Warning message:
Chi-squared approximation may be incorrect in: prop.test(female, total)

V trong nhm 4, khng c n gii cho nn t l l 0%. Hn na, trong nhm ny ch c
2 i tng. V s lng i tng qu nh, cho nn cc c tnh thng k c th khng
ng tin cy. Mt phng php khc c th p dng cho cc nghin cu vi tn s thp
nh trn l kim nh fisher (cn gi l Fishers exact test). Bn c c th tham kho
l thuyt ng sau kim nh fisher hiu r hn v logic ca phng php ny, nhng
y, chng ta ch quan tm n cch dng R tnh ton kim nh ny. Chng ta ch
n gin lnh:

> fisher.test(sex, ethnicity)

Fisher's Exact Test for Count Data

data: sex and ethnicity
p-value = 0.1048
alternative hypothesis: two.sided

Ch tr s p t kim nh Fisher l 0.1048, tc rt gn vi tr s p ca kim nh Chi
bnh phng. Cho nn, chng ta c thm bng chng khng nh rng t l n gii
gia cc sc tc khng khc nhau mt cch ng k.


10. Phn tch hi qui tuyn tnh

V d 15. minh ha cho vn , chng ta th xem xt nghin cu sau y, m
trong nh nghin cu o lng cholestrol trong mu ca 18 i tng nam. T
trng c th (body mass index) cng c c tnh cho mi i tng bng cng thc
tnh BMI l ly trng lng (tnh bng kg) chia cho chiu cao bnh phng (m
2
). Kt qu
o lng nh sau:

tui, t trng c th v cholesterol

M s ID
(id)
tui
(age)
BMI
(bmi)
Cholesterol
(chol)
1 46 25.4 3.5
2 20 20.6 1.9
3 52 26.2 4.0
4 30 22.6 2.6
5 57 25.4 4.5
6 25 23.1 3.0
7 28 22.7 2.9
8 36 24.9 3.8
Phn tch s liu v biu bng R Nguyn Vn Tun

72
9 22 19.8 2.1
10 43 25.3 3.8
11 57 23.2 4.1
12 33 21.8 3.0
13 22 20.9 2.5
14 63 26.7 4.6
15 40 26.4 3.2
16 48 21.2 4.2
17 28 21.2 2.3
18 49 22.8 4.0

Nhn s qua s liu chng ta thy ngi c tui cng cao cholesterol cng
cng cao. Chng ta th nhp s liu ny vo R v v mt biu tn x nh sau:

> age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49)

> bmi <-c(25.4,20.6,26.2,22.6,25.4,23.1,22.7,24.9,19.8,25.3,23.2,
21.8,20.9,26.7,26.4,21.2,21.2,22.8)

> chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2, 4.2,2.3,4.0)

> data <- data.frame(age, bmi, chol)
> plot(chol ~ age, pch=16)

20 30 40 50 60
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
age
c
h
o
l

Biu 18. Lin h gia tui v cholesterol.

Phn tch s liu v biu bng R Nguyn Vn Tun

73

Biu 18 trn y gi cho thy mi lin h gia tui (age) v cholesterol l mt
ng thng (tuyn tnh). o lng mi lin h ny, chng ta c th s dng h s
tng quan (coefficient of correlation).

10.1 H s tng quan

H s tng quan (r) l mt ch s thng k o lng mi lin h tng quan gia
hai bin s, nh gia tui (x) v cholesterol (y). H s tng quan c gi tr t -1 n
1. H s tng quan bng 0 (hay gn 0) c ngha l hai bin s khng c lin h g vi
nhau; ngc li nu h s bng -1 hay 1 c ngha l hai bin s c mt mi lin h tuyt
i. Nu gi tr ca h s tng quan l m (r <0) c ngha l khi x tng cao th y gim
(v ngc li, khi x gim th y tng); nu gi tr h s tng quan l dng (r > 0) c
ngha l khi x tng cao th y cng tng, v khi x tng cao th y cng gim theo.

Thc ra c nhiu h s tng quan trong thng k, nhng y ti s trnh by 3
h s tng quan thng dng nht: h s tng quan Pearson r, Spearman , v Kendall
.

10.1.1 H s tng quan Pearson

Cho hai bin s x v y t n mu, h s tng quan Pearson c c tnh bng
cng thc sau y:
( )( )
( ) ( )

=
= =
=
n
i
i
n
i
i
n
i
i i
y y x x
y y x x
r
1
2
1
2
1
. Trong , nh nh ngha phn trn, x
v y l gi tr trung bnh ca bin s x v y. c tnh h s tng quan gia tui
age v cholesterol, chng ta c th s dng hm cor(x,y) nh sau:

> cor(age, chol)
[1] 0.936726

Chng ta c th kim nh gi thit h s tng quan bng 0 (tc hai bin x v y
khng c lin h). Phng php kim nh ny thng da vo php bin i Fisher m
R c sn mt hm cor.test tin hnh vic tnh ton.

> cor.test(age, chol)

Pearson's product-moment correlation

data: age and chol
t = 10.7035, df = 16, p-value = 1.058e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8350463 0.9765306
sample estimates:
cor
0.936726
Phn tch s liu v biu bng R Nguyn Vn Tun

74

10.1.2 H s tng quan Spearman

H s tng quan Pearson ch hp l nu bin s x v y tun theo lut phn phi
chun. Nu x v y khng tun theo lut phn phi chun, chng ta phi s dng mt h
s tng quan khc tn l Spearman, mt phng php phn tch phi tham s. H s ny
c c tnh bng cch bin i hai bin s x v y thnh th bc (rank), v xem
tng quan gia hai dy s bc. Do , h s cn c tn ting Anh l Spearmans Rank
correlation. R c tnh h s tng quan Spearman bng hm cor.test vi thng s
method=spearman nh sau:

> cor.test(age, chol, method="spearman")

Spearman's rank correlation rho

data: age and chol
S = 51.1584, p-value = 2.57e-09
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.947205

Warning message:
Cannot compute exact p-values with ties in: cor.test.default(age,
chol, method = "spearman")

10.1.3 H s tng quan Kendall

H s tng quan Kendall (cng l mt phng php phn tch phi tham s) c
c tnh bng cch tm cc cp s (x, y) song hnh" vi nhau. Mt cp (x, y) song hnh
y c nh ngha l hiu ( khc bit) trn trc honh c cng du hiu (dng hay
m) vi hiu trn trc tung. Nu hai bin s x v y khng c lin h vi nhau, th s cp
song hnh bng hay tng ng vi s cp khng song hnh.

Bi v c nhiu cp phi kim nh, phng php tnh ton h s tng quan
Kendall i hi thi gian ca my tnh kh cao. Tuy nhin, nu mt d liu di 5000
i tng th mt my vi tnh c th tnh ton kh d dng. R dng hm cor.test vi
thng s method=kendall c tnh h s tng quan Kendall:

> cor.test(age, chol, method="kendall")

Kendall's rank correlation tau

data: age and chol
z = 4.755, p-value = 1.984e-06
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.8333333
Phn tch s liu v biu bng R Nguyn Vn Tun

75

Warning message:
Cannot compute exact p-value with ties in: cor.test.default(age,
chol, method = "kendall")


10.2 M hnh ca hi qui tuyn tnh n gin

tin vic theo di v m t m hnh, gi tui cho c nhn i l x
i
v
cholesterol l y
i
. y i = 1, 2, 3, , 18. M hnh hi tuyn tnh pht biu rng:

i i i
y x = + +
Ni cch khc, phng trnh trn gi nh rng cholesterol ca mt c nhn bng mt
hng s cng vi mt h s lin quan n tui, v mt sai s
i
. Trong phng
trnh trn, l chn (intercept, tc gi tr lc x
i
=0), v l dc (slope hay gradient).
Trong thc t, v l hai thng s (paramater, cn gi l regression coefficient hay h
s hi qui), v
i
l mt bin s theo lut phn phi chun vi trung bnh 0 v phng sai

2
.

Cc thng s , v
2
phi c c tnh t d liu. Phng php c tnh
cc thng s ny l phng php bnh phng nh nht (least squares method). Nh tn
gi, phng php bnh phng nh nht tm gi tr , sao cho ( )
2
1
n
i i
i
y x
=
( +

nh
nht. Sau vi thao tc ton, c th chng minh d dng rng, c s cho v p ng
iu kin l:
( )( )
( )
1
2
1

n
i i
i
n
i
i
x x y y
x x

=
=

=

v y x =
)
)

y, x v y l gi tr trung bnh ca bin s x v y. Ch , ti vit
)
v
)
(vi du
m pha trn) l nhc nh rng y l hai c s (estimates) ca v , ch khng
phi v (chng ta khng bit chnh xc v , nhng ch c th c tnh m thi).
Sau khi c c s
)
v
)
, chng ta c th c tnh cholesterol trung bnh cho tng
tui nh sau:

i i
y x = +
)


Tt nhin,
i
y y ch l s trung bnh cho tui x
i
, v phn cn li (tc
i
y -
i
y ) gi l
phn d (residual). V phng sai ca phn d c th c tnh nh sau:
( )
2 1

2
n
i i
i
y y
s
n
=

=

. y, s
2
chnh l c s ca
2
.

Phn tch s liu v biu bng R Nguyn Vn Tun

76
Hm lm (vit tt t linear model) trong R c th tnh ton cc gi tr ca
)

v
)
, cng nh s
2
mt cch nhanh gn. Chng ta tip tc vi v d bng R nh sau:

> lm(chol ~ age)

Call:
lm(formula = chol ~ age)

Coefficients:
(Intercept) age
1.08922 0.05779


Trong lnh trn, chol ~ age c ngha l m t chol l mt hm s ca age. Kt
qu tnh ton ca lm cho thy
)
= 1.0892 v
)
= 0.05779. Ni cch khc, vi hai thng
s ny, chng ta c th c tnh cholesterol cho bt c tui no trong khong tui
ca mu bng phng trnh tuyn tnh:

i
y = 1.08922 + 0.05779 x age

Phng trnh ny c ngha l khi tui tng 1 nm th cholesterol tng khong 0.058
mmol/L.

Tht ra, hm lm cn cung cp cho chng ta nhiu thng tin khc, nhng chng ta phi
a cc thng tin ny vo mt object. Gi object l reg, th lnh s l:

> reg <- lm(chol ~ age)
> summary(reg)

Call:
lm(formula = chol ~ age)

Residuals:
Min 1Q Median 3Q Max
-0.40729 -0.24133 -0.04522 0.17939 0.63040

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.089218 0.221466 4.918 0.000154 ***
age 0.057788 0.005399 10.704 1.06e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08


Lnh th hai, summary(reg), yu cu R lit k cc thng tin tnh ton trong reg. Phn
kt qu chia lm 3 phn:
Phn tch s liu v biu bng R Nguyn Vn Tun

77

(a) Phn 1 m t phn d (residuals) ca m hnh hi qui:

Residuals:
Min 1Q Median 3Q Max
-0.40729 -0.24133 -0.04522 0.17939 0.63040

Chng ta bit rng trung bnh phn d phi l 0, v y, s trung v l -0.04, cng
khng xa 0 bao nhiu. Cc s quantiles 25% (1Q) v 75% (3Q) cng kh cn i chung
quan s trung v, cho thy phn d ca phng trnh ny tng i cn i.

(b) Phn hai trnh by c s ca
)
v
)
cng vi sai s chun v gi tr ca kim nh t.
Gi tr kim nh t cho
)
l 10.74 vi tr s p = 1.06e-08, cho thy khng phi bng 0.
Ni cch khc, chng ta c bng chng cho rng c mt mi lin h gia cholesterol
v tui, v mi lin h ny c ngha thng k.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.089218 0.221466 4.918 0.000154 ***
age 0.057788 0.005399 10.704 1.06e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(c) Phn ba ca kt qu cho chng ta thng tin v phng sai ca phn d (residual mean
square). y, s
2
= 0.3027. Trong kt qu ny cn c kim nh F, cng ch l mt
kim nh xem c qu tht bng 0, tc c ngha tng t nh kim nh t trong phn
trn. Ni chung, trong trng hp phn tch hi qui tuyn tnh n gin (vi mt yu t)
chng ta khng cn phi quan tm n kim nh F.

Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

Ngoi ra, phn 3 cn cho chng ta mt thng tin quan trng, l tr s R
2
hay h
s xc nh bi (coefficient of determination). Tc l bng tng bnh phng gia s c
tnh v trung bnh chia cho tng bnh phng s quan st v trung bnh. Tr s R
2
trong
v d ny l 0.8775, c ngha l phng trnh tuyn tnh (vi tui l mt yu t) gii
thch khong 88% cc khc bit v cholesterol gia cc c nhn. Tt nhin tr s R
2

c gi tr t 0 n 100% (hay 1). Gi tr R
2
cng cao l mt du hiu cho thy mi lin h
gia hai bin s tui v cholesterol cng cht ch.

Mt h s cng cn cp y l h s iu chnh xc nh bi (m trong kt
qu trn R gi l Adjusted R-squared). y l h s cho chng ta bit mc ci tin
ca phng sai phn d (residual variance) do yu t tui c mt trong m hnh tuyn
tnh. Ni chung, h s ny khng khc my so vi h s xc nh bi, v chng ta cng
khng cn ch tm qu mc.

Gi nh ca phn tch hi qui tuyn tnh
Phn tch s liu v biu bng R Nguyn Vn Tun

78

Tt c cc phn tch trn da vo mt s gi nh quan trng nh sau:
(a) x l mt bin s c nh hay fixed, (c nh y c ngha l khng c sai st ngu
nhin trong o lng);

(b)
i
phn phi theo lut phn phi chun;

(c)
i
c gi tr trung bnh (mean) l 0;

(d)
i
c phng sai
2
c nh cho tt c x
i
; v

(e) cc gi tr lin tc ca
i
khng c lin h tng quan vi nhau (ni cch khc,
1
v
2

khng c lin h vi nhau).

Nu cc gi nh ny khng c p ng th phng trnh m chng ta c tnh
c vn hp l (validity). Do , trc khi trnh by v din dch m hnh trn, chng
ta cn phi kim tra xem cc gi nh trn c p ng c hay khng. Trong trng
hp ny, gi nh (a) khng phi l vn , v tui khng phi l mt bin s ngu
nhin, v khng c sai s khi tnh tui ca mt c nhn.

i vi cc gi nh (b) n (e), cch kim tra n gin nhng hu hiu nht l
bng cch xem xt mi lin h gia
i
y ,
i
x , v phn d
i
e (
i i i
e y y = ) bng nhng th
tn x.

Vi lnh fitted() chng ta c th tnh ton
i
y cho tng c nhn nh sau (v d
i vi c nhn 1, 46 tui, cholestrol c th tin on nh sau: 1.08922 + 0.05779
x 46 = 3.747).

> fitted(reg)
1 2 3 4 5 6 7 8
3.747483 2.244985 4.094214 2.822869 4.383156 2.533927 2.707292 3.169600
9 10 11 12 13 14 15 16
2.360562 3.574118 4.383156 2.996234 2.360562 4.729886 3.400753 3.863060
17 18
2.707292 3.920849

Vi lnh resid() chng ta c th tnh ton phn d
i
e cho tng c nhn nh
sau (vi i tng 1, e
1
= 3.5 3.74748 = -0.24748):

> resid(reg)
1 2 3 4 5 6
-0.247483426 -0.344985415 -0.094213736 -0.222869265 0.116844338 0.466072660
7 8 9 10 11 12
0.192707505 0.630400424 -0.260562185 0.225881729 -0.283155662 0.003765579
13 14 15 16 17 18
0.139437815 -0.129885972 -0.200753116 0.336939804 -0.407292495 0.079151419

Phn tch s liu v biu bng R Nguyn Vn Tun

79
kim tra cc gi nh trn, chng ta c th v mt lot 4 th m ti s gii
thch sau y:

> op <- par(mfrow=c(2,2)) #yu cu R dnh ra 4 ca s
> plot(reg) #v cc th trong reg

2.5 3.0 3.5 4.0 4.5
-
0
.
4
0
.
0
0
.
2
0
.
4
0
.
6
Fitted values
R
e
s
i
d
u
a
l
s
Residuals vs Fitted
8
6
17
-2 -1 0 1 2
-
1
0
1
2
Theoretical Quantiles
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
Normal Q-Q
8
6
17
2.5 3.0 3.5 4.0 4.5
0
.
0
0
.
5
1
.
0
1
.
5
Fitted values
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
Scale-Location
8
6
17
0.00 0.05 0.10 0.15 0.20 0.25
-
1
0
1
2
Leverage
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
Cook's distance
0.5
0.5
1
Residuals vs Leverage
6
2
8

Biu 19. Phn tch phn d kim tra cc gi nh trong phn tch hi
qui tuyn tnh.


(a) th bn tri dng 1 v phn d
i
e v gi tr tin on cholesterol
i
y . th ny cho
thy cc gi tr phn d tp chung quanh ng y = 0, cho nn gi nh (c), hay
i
c gi
tr trung bnh 0, l c th chp nhn c.

(b) th bn phi dng 1 v gi tr phn d v gi tr k vng da vo phn phi chun.
Chng ta thy cc s phn d tp trung rt gn cc gi tr trn ng chun, v do , gi
nh (b), tc
i
phn phi theo lut phn phi chun, cng c th p ng.

(c) th bn tri dng 2 v cn s phn d chun (standardized residual) v gi tr ca

i
y . th ny cho thy khng c g khc nhau gia cc s phn d chun cho cc gi tr
Phn tch s liu v biu bng R Nguyn Vn Tun

80
ca
i
y , v do , gi nh (d), tc
i
c phng sai
2
c nh cho tt c x
i
, cng c th
p ng.

Ni chung qua phn tch phn d, chng ta c th kt lun rng m hnh hi qui tuyn
tnh m t mi lin h gia tui v cholesterol mt cch kh y v hp l.


M hnh tin on

Sau khi m hnh tin on cholesterol c kim tra v tnh hp l c thit lp,
chng ta c th v ng biu din ca mi lin h gia tui v cholesterol bng lnh
abline nh sau (xin nhc li object ca phn tch l reg):

> plot(chol ~ age, pch=16)
> abline(reg)

20 30 40 50 60
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
age
c
h
o
l

Biu 20. ng biu din mi lin h gia tui (age) v
cholesterol.


Nhng mi gi tr
i
y c tnh t c s
)
v
)
, m cc c s ny u c sai
s chun, cho nn gi tr tin on
i
y cng c sai s. Ni cch khc,
i
y ch l trung bnh,
Phn tch s liu v biu bng R Nguyn Vn Tun

81
nhng trong thc t c th cao hn hay thp hn ty theo chn mu. Khong tin cy
95% ny c th c tnh qua R bng cc lnh sau y:

> reg <- lm(chol ~ age)
> new <- data.frame(age = seq(15, 70, 5))
> pred.w.plim <- predict.lm(reg, new, interval="prediction")
> pred.w.clim <- predict.lm(reg, new, interval="confidence")
> resc <- cbind(pred.w.clim, new)
> resp <- cbind(pred.w.plim, new)
> plot(chol ~ age, pch=16)
> lines(resc$fit ~ resc$age)
> lines(resc$lwr ~ resc$age, col=2)
> lines(resc$upr ~ resc$age, col=2)
> lines(resp$lwr ~ resp$age, col=4)
> lines(resp$upr ~ resp$age, col=4)


20 30 40 50 60
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
age
c
h
o
l
Biu 21. Gi tr tin on v khong tin cy 95%.

Biu trn v gi tr tin on trung bnh
i
y (ng thng mu en), v khong tin cy
95% ca gi tr ny l ng mu . Ngoi ra, ng mu xanh l khong tin cy ca
gi tr tin on cholesterol cho mt tui mi trong qun th.


Phn tch s liu v biu bng R Nguyn Vn Tun

82
10.3 M hnh hi qui tuyn tnh a bin (multiple linear regression)

M hnh c din t qua phng trnh
i i i
y x = + + c mt yu t duy nht
( l x), v v th thng c gi l m hnh hi qui tuyn tnh n gin (simple linear
regression model). Trong thc t, chng ta c th pht trin m hnh ny thnh nhiu
bin, ch khng ch gii hn mt bin nh trn, chng hn nh:
1 1 2 2
...
i i i k ki i
y x x x = + + + + +
Ch trong phng trnh trn, chng ta c nhiu bin x (x
1
, x
2
, n x
k
), v mi bin c
mt thng s
j
(j = 1, 2, , k) cn phi c tnh. V th m hnh ny cn c gi l
m hnh hi qui tuyn tnh a bin.

V d 16. Chng ta quay li nghin cu v mi lin h gia tui, bmi v
cholesterol. Trong v d, chng ta ch mi xt mi lin h gia tui v cholesterol, m
cha xem n mi lin h gia c hai yu t tui v bmi v cholesterol. Biu sau
y cho chng ta thy mi lin h gia ba bin s ny:

> pairs(data)

age
20 22 24 26
2
0
3
0
4
0
5
0
6
0
2
0
2
2
2
4
2
6
bmi
20 30 40 50 60 2.0 2.5 3.0 3.5 4.0 4.5
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
4
.
5
chol
Biu 22. Gi tr tin on v khong tin cy 95%.

Cng nh gia tui v cholesterol, mi lin h gia bmi v cholesterol cng gn tun
theo mt ng thng. Biu trn cn cho chng ta thy tui v bmi c lin h vi
Phn tch s liu v biu bng R Nguyn Vn Tun

83
nhau. Tht vy, phn tch hi qui tuyn tnh n gin gia bmi v cholesterol cho thy
nh mi lin h ny c ngha thng k:

> summary(lm(chol ~ bmi))

Call:
lm(formula = chol ~ bmi)

Residuals:
Min 1Q Median 3Q Max
-0.9403 -0.3565 -0.1376 0.3040 1.4330

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.83187 1.60841 -1.761 0.09739 .
bmi 0.26410 0.06861 3.849 0.00142 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.623 on 16 degrees of freedom
Multiple R-Squared: 0.4808, Adjusted R-squared: 0.4483
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001418

BMI gii thch khong 48% dao ng v cholesterol gia cc c nhn. Nhng v BMI
cng c lin h vi tui, chng ta mun bit nu hai yu t ny c phn tch cng
mt lc th yu t no quan trng hn. bit nh hng ca c hai yu t age (x
1
) v
bmi (tm gi l x
2
) n cholesterol (y) qua mt m hnh hi qui tuyn tnh a bin, v m
hnh l:

i i i i
x x y + + + =
2 2 1 1


hay phng trnh cng c th m t bng k hiu ma trn: Y = X + m ti va trnh
by trn. y, Y l mt vector vector 18 x 1, X l mt matrix 18 x 2 phn t, v mt
vector 2 x 1, v l vector gm 18 x 1 phn t. c tnh hai h s hi qui,
1
v

2
chng ta cng ng dng hm lm() trong R nh sau:

> mreg <- lm(chol ~ age + bmi)
> summary(mreg)

Call:
lm(formula = chol ~ age + bmi)

Residuals:
Min 1Q Median 3Q Max
-0.3762 -0.2259 -0.0534 0.1698 0.5679

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455458 0.918230 0.496 0.627
Phn tch s liu v biu bng R Nguyn Vn Tun

84
age 0.054052 0.007591 7.120 3.50e-06 ***
bmi 0.033364 0.046866 0.712 0.487
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3074 on 15 degrees of freedom
Multiple R-Squared: 0.8815, Adjusted R-squared: 0.8657
F-statistic: 55.77 on 2 and 15 DF, p-value: 1.132e-07

Kt qu phn tch trn cho thy c s = 0.455,
1

= 0.054 v
2

= 0.0333. Ni cch
khc, chng ta c phng trnh c on cholesterol da vo hai bin s tui v
bmi nh sau:

Cholesterol = 0.455 + 0.054(age) + 0.0333(bmi)

Phng trnh cho bit khi tui tng 1 nm th cholesterol tng 0.054 mg/L (c s ny
khng khc my so vi 0.0578 trong phng trnh ch c tui), v mi 1 kg/m
2
tng
BMI th cholesterol tng 0.0333 mg/L. Hai yu t ny gii thch khong 88.2% (R
2
=
0.8815) dao ng ca cholesterol gia cc c nhn.

Chng ta ch phng trnh vi tui (trong phn tch phn trc) gii thch
khong 87.7% dao ng cholesterol gia cc c nhn. Khi chng ta thm yu t BMI,
h s ny tng ln 88.2%, tc ch 0.5%. Cu hi t ra l 0.5% tng trng ny c
ngha thng k hay khng. Cu tr li c th xem qua kt qu kim nh yu t bmi vi
tr s p = 0.487. Nh vy, bmi khng cung cp cho chng thm thng tin hay tin on
cholesterol hn nhng g chng ta c t tui. Ni cch khc, khi tui c
xem xt, th nh hng ca bmi khng cn ngha thng k. iu ny c th hiu c,
bi v qua Biu 10.5 chng ta thy tui v bmi c mt mi lin h kh cao. V hai
bin ny c tng quan vi nhau, chng ta khng cn c hai trong phng trnh. (Tuy
nhin, v d ny ch c tnh cch minh ha cho vic tin hnh phn tch hi qui tuyn tnh
a bin bng R, ch khng c nh m phng d liu theo nh hng sinh hc).

Phn tch s liu v biu bng R Nguyn Vn Tun

85
2.5 3.0 3.5 4.0 4.5
-
0
.
4
0
.
0
0
.
4
Fitted values
R
e
s
i
d
u
a
l
s
Residuals vs Fitted
8
16
6
-2 -1 0 1 2
-
1
.
0
0
.
0
1
.
0
2
.
0
Theoretical Quantiles
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
Normal Q-Q
8
16
6
2.5 3.0 3.5 4.0 4.5
0
.
0
0
.
4
0
.
8
1
.
2
Fitted values
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
Scale-Location
8
16
6
0.00 0.10 0.20 0.30
-
1
0
1
2
Leverage
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
s
Cook's distance
0.5
Residuals vs Leverage
16
8
15
Biu 23. Phn tch phn d kim tra cc gi nh trong
phn tch hi qui tuyn tnh a bin.

Tuy BMI khng c ngha thng k trong trng hp ny, Biu 10.6 cho thy
cc gi nh v m hnh hi qui tuyn tnh c th p ng.



11. Phn tch phng sai

11.1 Phn tch phng sai n gin (one-way analysis of variance -
ANOVA)

V d 17. Bng di y so snh galactose trong 3 nhm bnh nhn: nhm 1
gm 9 bnh nhn vi bnh Crohn; nhm 2 gm 11 bnh nhn vi bnh vim rut kt
(colitis); v nhm 3 gm 20 i tng khng c bnh (gi l nhm i chng). Cu hi
t ra l galactose gia 3 nhm bnh nhn c khc nhau hay khng?

galactose cho 3 nhm bnh nhn Crohn, vim rut kt
v i chng

Nhm 1: bnh
Crohn
Nhm 2: bnh vim
rut kt
Nhm 3: i
chng (control)
Phn tch s liu v biu bng R Nguyn Vn Tun

86
1343
1393
1420
1641
1897
2160
2169
2279
2890

1264
1314
1399
1605
2385
2511
2514
2767
2827
2895
3011
1809 2850
1926 2964
2283 2973
2384 3171
2447 3257
2479 3271
2495 3288
2525 3358
2541 3643
2769 3657
n=9
Trung bnh: 1910
SD: 516
n=11
Trung bnh: 2226
SD: 727
n=20
Trung bnh: 2804
SD: 527
Ch thch: SD l lch chun (standard deviation).

Gi gi tr trung bnh ca ba nhm l
1
,
2
, v
3
, v ni theo ngn ng ca kim nh
gi thit th gi thit o l:

H
o
:
1
=
2
=
3

V gi thit chnh l:
H
A
: c mt khc bit gia 3
j
(j = 1,2,3)

Thot u c l bn c, sau khi hc qua phng php so snh hai nhm bng
kim nh t, s ngh rng chng ta cn lm 3 so snh bng kim nh t: gia nhm 1 v 2,
nhm 2 v 3, v nhm 1 v 3. Nhng phng php ny khng hp l, v c ba phng
sai khc nhau. Phng php thch hp cho so snh l phn tch phng sai. Phn tch
phng sai c th ng dng so snh nhiu nhm cng mt lc (simultaneous
comparisons).

minh ha cho phng php phn tch phng sai, chng ta phi dng k hiu.
Gi galactose ca bnh nhn i thuc nhm j (j = 1, 2, 3) l x
ij
. M hnh phn tch
phng sai pht biu rng:

ij i ij
x = + +
Hay c th hn:
x
i1
= +
1
+
i1

x
i2
= +
2
+
i2

x
i3
= +
3
+
i3


Trc ht, chng ta cn phi nhp d liu vo R. Bc th nht l bo cho R bit rng
chng ta c ba nhm bnh nhn (1, 2 v ), nhm 1 gm 9 ngi, nhm 2 c 11 ngi, v
nhm 3 c 20 ngi:

> group <- c(1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3)
Phn tch s liu v biu bng R Nguyn Vn Tun

87

phn tch phng sai, chng ta phi nh ngha bin group l mt yu t - factor.

> group <- as.factor(group)

Bc k tip, chng ta np s liu galactose cho tng nhm nh nh ngha trn (gi
object l galactose):

> galactose <- c(1343,1393,1420,1641,1897,2160,2169,2279,2890,
1264,1314,1399,1605,2385,2511,2514,2767,2827,2895,3011,
1809,2850,1926,2964,2283,2973,2384,3171,2447,3257,2479,3271,2495,3288,
2525,3358,2541,3643,2769,3657)

a hai bin group v galactose vo mt dataframe v gi l data:

> data <- data.frame(group, galactose)
> attach(data)

Sau khi c d liu sn sng, chng ta dng hm lm() phn tch phng sai nh
sau:

> analysis <- lm(galactose ~ group)

Trong hm trn chng ta cho R bit bin galactose l mt hm s ca group. Gi
kt qu phn tch l analysis.

Kt qu phn tch phng sai. By gi chng ta dng lnh anova bit kt qu
phn tch:

> anova(analysis)
Analysis of Variance Table

Response: galactose
Df Sum Sq Mean Sq F value Pr(>F)
group 2 5683620 2841810 8.6655 0.0008191 ***
Residuals 37 12133923 327944
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Trong kt qu trn, c ba ct: Df (degrees of freedom) l bc t do; Sum Sq l tng bnh
phng (sum of squares), Mean Sq l trung bnh bnh phng (mean square); F
value l gi tr F; v Pr(>F) l tr s P lin quan n kim nh F.

11.2 So snh nhiu nhm (multiple comparisons) v iu chnh tr s p

Cho k nhm, chng ta c t nht l k(k-1)/2 so snh. V d trn c 3 nhm, cho
nn tng s so snh kh d l 3 (gia nhm 1 v 2, nhm 1 v 3, v nhm 2 v 3). Khi
k=10, s ln so snh c th ln rt cao. Nh cp trong chng 7, khi c nhiu so
snh, tr s p tnh ton t cc kim nh thng k khng cn ngha ban u na, bi v
cc kim nh ny c th cho ra kt qu dng tnh gi (tc kt qu vi p<0.05 nhng
Phn tch s liu v biu bng R Nguyn Vn Tun

88
trong thc t khng c khc nhau hay nh hng). Do , trong trng hp c nhiu so
snh, chng ta cn phi iu chnh tr s p sao cho hp l.

C kh nhiu phng php iu chnh tr s p, v 4 phng php thng dng nht
l: Bonferroni, Scheff, Holm v Tukey (tn ca 4 nh thng k hc danh ting).
Phng php no thch hp nht? Khng c cu tr li dt khot cho cu hi ny, nhng
hai im sau y c th gip bn c quyt nh tt hn:

(a) Nu k < 10, chng ta c th p dng bt c phng php no iu
chnh tr s p. Ring c nhn ti th thy phng php Tukey thng
rt hu ch trong so snh.

(b) Nu k>10, phng php Bonferroni c th tr nn rt bo th. Bo
th y c ngha l phng php ny rt t khi no tuyn b mt so
snh c ngha thng k, d trong thc t l c tht! Trong trng
hp ny, hai phng php Tukey, Holm v Scheff c th p dng.

Quay li v d trn, cc tr s p trn y l nhng tr s cha c iu chnh cho
so snh nhiu ln. Trong chng v tr s p, ti ni cc tr s ny phng i ngha
thng k, khng phn nh tr s p lc ban u (tc 0.05). iu chnh cho nhiu so
snh, chng ta phi s dng n phng php iu chnh Bonferroni.

Chng ta c th dng lnh pairwise.t.test c c tt c cc tr s p so
snh gia ba nhm nh sau:

> pairwise.t.test(galactose, group, p.adj="bonferroni")

Pairwise comparisons using t tests with pooled SD

data: galactose and group

1 2
2 0.6805 -
3 0.0012 0.0321

P value adjustment method: bonferroni


Kt qu trn cho thy tr s p gia nhm 1 (Crohn) v vim rut kt l 0.6805 (tc khng
c ngha thng k); gia nhm Crohn v i chng l 0.0012 (c ngha thng k), v
gia nhm vim rut kt v i chng l 0.0321 (tc cng c ngha thng k).

Mt phng php iu chnh tr s p khc c tn l phng php Holm:

> pairwise.t.test(galactose, group)

Pairwise comparisons using t tests with pooled SD

data: galactose and group

Phn tch s liu v biu bng R Nguyn Vn Tun

89
1 2
2 0.2268 -
3 0.0012 0.0214

P value adjustment method: holm

Kt qu ny cng khng khc so vi phng php Bonferroni.

Tt c cc phng php so snh trn s dng mt sai s chun chung cho c ba nhm.
Nu chng ta mun s dng cho tng nhm th lnh sau y (pool.sd=F) s p ng
yu cu :

> pairwise.t.test(galactose, group, pool.sd=FALSE)

Pairwise comparisons using t tests with non-pooled SD

data: galactose and group

1 2
2 0.2557 -
3 0.0017 0.0544

P value adjustment method: holm

Mt ln na, kt qu ny cng khng lm thay i kt lun.

Trong cc phng php trn, chng ta ch bit tr s p so snh gia cc nhm,
nhng khng bit mc khc bit cng nh khong tin cy 95% gia cc nhm. c
nhng c s ny, chng ta cn n mt hm khc c tn l aov (vit tt t analysis of
variance) v hm TukeyHSD (HSD l vit tt t Honest Significant Difference, tm dch
nm na l Khc bit c ngha thnh tht) nh sau:

> res <- aov(galactose ~ group)
> TukeyHSD (res)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = galactose ~ group)

$group
diff lwr upr p adj
2-1 316.3232 -312.09857 944.745 0.4439821
3-1 894.2778 333.07916 1455.476 0.0011445
3-2 577.9545 53.11886 1102.790 0.0281768

Kt qu trn cho chng ta thy nhm 3 v 1 khc nhau khong 894 n v, v khong tin
cy 95% t 333 n 1455 n v. Tng t, galactose trong nhm bnh nhn vim rut
kt thp hn nhm i chng (nhm 3) khong 578 n v, v khong tin cy 95% t 53
n 1103.

Phn tch s liu v biu bng R Nguyn Vn Tun

90
0 500 1000 1500
3
-
2
3
-
1
2
-
1
95% family-wise confidence level
Differences in mean levels of group

Biu 24. Trung bnh hiu v khong tin cy 95%
gia nhm 1 v 2, 1 v 3, v 3 v 2. Trc honh l
galactose, trc tung l ba so snh.



11.3 Phn tch bng phng php phi tham s

Phng php so snh nhiu nhm phi tham s (non-parametric statistics) tng
ng vi phng php phn tch phng sai l Kruskal-Wallis. Cng nh phng php
Wilcoxon so snh hai nhm theo phng php phi tham s, phng php Kruskal-Wallis
cng bin i s liu thnh th bc (ranks) v phn tch khc bit th bc ny gia cc
nhm. Hm kruskal.test trong R c th gip chng ta trong kim nh ny:

> kruskal.test(galactose ~ group)

Kruskal-Wallis rank sum test

data: galactose by group
Kruskal-Wallis chi-squared = 12.1381, df = 2, p-value = 0.002313


Tr s p t kim nh ny kh thp (p = 0.002313) cho thy c s khc bit gia
ba nhm nh phn tch phng sai qua hm lm trn y. Tuy nhin, mt bt tin ca
kim nh phi tham s Kruskal-Wallis l phng php ny khng cho chng ta bit hai
nhm no khc nhau, m ch cho mt tr s p chung. Trong nhiu trng hp, phn tch
phi tham s nh kim nh Kruskal-Wallis thng khng c hiu qu nh cc phng
php thng k tham s (parametric statistics).


Phn tch s liu v biu bng R Nguyn Vn Tun

91
11.4 Phn tch phng sai hai chiu (two-way analysis of variance -
ANOVA)

Phn tch phng sai n gin hay mt chiu ch c mt yu t (factor). Nhng
phn tch phng sai hai chiu (two-way ANOVA), nh tn gi, c hai yu t. Phng
php phn tch phng sai hai chiu ch n gin khai trin t phng php phn tch
phng sai n gin. Thay v c tnh phng sai ca mt yu t, phng php phn sai
hai chiu c tnh phng sai ca hai yu t.

V d 18. Trong v d sau y, nh gi hiu qu ca mt k thut sn mi,
cc nh nghin cu p dng sn trn 3 loi vt liu (1, 2 v 3) trong hai iu kin (1, 2).
Mi iu kin v loi vt liu, nghin cu c lp li 3 ln. bn c o l ch s
bn b (tm gi l score). Tng cng, c 18 s liu nh sau:

bn b ca sn cho 2 iu kin v 3 vt liu

Vt liu (j) iu kin
(i) 1 2 3
1 4.1, 3.9, 4.3 3.1, 2.8, 3.3 3.5, 3.2, 3.6
2 2.7, 3.1, 2.6 1.9, 2.2, 2.3 2.7, 2.3, 2.5

Gi x
ij
l score ca iu kin i (i = 1, 2) cho vt liu j (j = 1, 2, 3). ( n gin ha
vn , chng ta tm thi b qua k i tng). M hnh phn tch phng sai hai chiu
pht biu rng:
ij i j ij
x = + + +

l s trung bnh cho ton qun th, cc h s
i
(nh hng ca iu kin i)v
j
(nh
hng ca vt liu j) cn phi c tnh t s liu thc t.
ij
c gi nh tun theo lut
phn phi chun vi trung bnh 0 v phng sai
2
.

phn tch bng R, chng ta cn phi t chc d liu sao cho c 4 bin nh sau:

Condition Material i tng Score
(iu kin) (vt liu)
1 1 1 4.1
1 1 2 3.9
1 1 3 4.3
1 2 4 3.1
1 2 5 2.8
1 2 6 3.3
1 3 7 3.5
1 3 8 3.2
1 3 9 3.6
2 1 10 2.7
2 1 11 3.1
2 1 12 2.6
2 2 13 1.9
2 2 14 2.2
2 2 15 2.3
Phn tch s liu v biu bng R Nguyn Vn Tun

92
2 3 16 2.7
2 3 17 2.3
2 3 18 2.5

Chng ta c th to ra mt dy s bng cch s dng hm gl (generating levels).

> condition <- gl(2, 9, 18)
> material <- gl(3, 3, 18)

V to nn 18 m s (t 1 n 18):

> id <- 1:18

Sau cng l s liu cho score:

> score <- c(4.1,3.9,4.3, 3.1,2.8,3.3, 3.5,3.2,3.6,
2.7,3.1,2.6, 1.9,2.2,2.3, 2.7,2.3,2.5)

Tt c cho vo mt dataframe tn l data:

> data <- data.frame(condition, material, id, score)
> attach(data)

By gi s liu sn sng cho phn tch. phn tch phng sai hai chiu, chng ta
vn s dng lnh lm vi cc thng s nh sau:

> twoway <- lm(score ~ condition + material)
> anova(twoway)
Analysis of Variance Table

Response: score
Df Sum Sq Mean Sq F value Pr(>F)
condition 1 5.0139 5.0139 95.575 1.235e-07 ***
material 2 2.1811 1.0906 20.788 6.437e-05 ***
Residuals 14 0.7344 0.0525
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Ba ngun dao ng (variation) ca score c phn tch trong bng trn. Qua
trung bnh bnh phng (mean square), chng ta thy nh hng ca iu kin c v quan
trng hn l nh hng ca vt liu th nghim. Tuy nhin, c hai nh hng u c
ngha thng k, v tr s p rt thp cho hai yu t. Chng ta yu cu R tm lc cc c
s phn tch bng lnh summary:

> summary(twoway)

Call:
lm(formula = score ~ condition + material)

Residuals:
Min 1Q Median 3Q Max
-0.32778 -0.16389 0.03333 0.16111 0.32222
Phn tch s liu v biu bng R Nguyn Vn Tun

93

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9778 0.1080 36.841 2.43e-15 ***
condition2 -1.0556 0.1080 -9.776 1.24e-07 ***
material2 -0.8500 0.1322 -6.428 1.58e-05 ***
material3 -0.4833 0.1322 -3.655 0.0026 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.229 on 14 degrees of freedom
Multiple R-Squared: 0.9074, Adjusted R-squared: 0.8875
F-statistic: 45.72 on 3 and 14 DF, p-value: 1.761e-07


Kt qu trn cho thy so vi iu kin 1, iu kin 2 c score thp hn khong
1.056 v sai s chun l 0.108, vi tr s p = 1.24e-07, tc c ngha thng k. Ngoi ra,
so vi vt liu 1, score cho vt liu 2 v 3 cng thp hn ng k vi thp nht ghi
nhn vt liu 2, v nh hng ca vt liu th nghim cng c ngha thng k.

Gi tr c tn l Residual standard error c c tnh t trung bnh bnh
phng phn d trong phn (a), tc l 0.0525 = 0.229, tc l c s ca .

H s xc nh bi (R
2
) cho bit hai yu t iu kin v vt liu gii thch khong
91% dao ng ca ton b mu. H s ny c tnh t tng bnh phng trong kt
qu phn (a) nh sau:

2
5.0139 2.1811
0.9074
5.0139 2.1811 0.7344
R
+
= =
+ +


V sau cng, h s R
2
iu chnh phn nh ci tin ca m hnh. hiu h
s ny tt hn, chng ta thy phng sai ca ton b mu l s
2
= (5.0139 + 2.1811 +
0.7344) / 17 = 0.4644. Sau khi iu chnh cho nh hng ca iu kin v vt liu,
phng sai ny cn 0.0525 (tc l residual mean square). Nh vy hai yu t ny lm
gim phng sai khong 0.4644 0.0525 = 0.4119. V h s R
2
iu chnh l:

Adj R
2
= 0.4119 / 0.4644 = 0.88

Tc l sau khi iu chnh cho hai yu t iu kin v vt liu phng sai ca score gim
khong 88%.

So snh gia cc nhm. Chng ta s c tnh khc bit gia hai iu kin v
ba vt liu bng hm TukeyHSD vi aov:

> res <- aov(score ~ condition+ material+condition)
> TukeyHSD(res)
Tukey multiple comparisons of means
95% family-wise confidence level

Phn tch s liu v biu bng R Nguyn Vn Tun

94
Fit: aov(formula = score ~ condition + material + condition)

$condition
diff lwr upr p adj
2-1 -1.055556 -1.287131 -0.8239797 1e-07

$material
diff lwr upr p adj
2-1 -0.8500000 -1.19610279 -0.5038972 0.0000442
3-1 -0.4833333 -0.82943612 -0.1372305 0.0068648
3-2 0.3666667 0.02056388 0.7127695 0.0374069

Biu sau y s minh ho cho cc kt qu trn:

> plot(TukeyHSD(res), ordered=TRUE)
There were 16 warnings (use warnings() to see them)

-1.0 -0.5 0.0 0.5
3
-
2
3
-
1
2
-
1
95% family-wise confidence level
Differences in mean levels of material
Biu 25. So snh gia 3 loi vt liu bng
phng php Tukey.


12. Phn tch hi qui logistic

Trong cc phn trc v phn tch hi qui tuyn tnh v phn tch phng sai,
chng ta tm m hnh v mi lin h gia mt bin ph thuc lin tc (continuous
dependent variable) v mt hay nhiu bin c lp (independent variable) hoc l lin tc
hoc l khng lin tc. Nhng trong nhiu trng hp, bin ph thuc khng phi l bin
lin tc m l bin mang tnh o lng nh phn: c/khng, mc bnh/khng mc bnh,
cht/sng, xy ra/khng xy ra, v.v, cn cc bin c lp c th l lin tc hay khng
lin tc. Chng ta cng mun tm hiu mi lin h gia cc bin c lp v bin ph
thuc.

Phn tch s liu v biu bng R Nguyn Vn Tun

95
V d 19. Trong mt nghin cu do ti tin hnh tm hiu mi lin h gia
nguy c gy xng (fracture, vit tt l fx) v mt xng cng mt s ch s sinh ha
khc, 139 bnh nhn nam (hay ni ng hn l i tng nghin cu) tui t 60 tr ln.
Nm 1990, cc s liu sau y c thu thp cho mi i tng: tui (age), t trng
c th (body mass index hay BMI), mt cht khong trong xng (bone mineral
density hay BMD), ch s hy xng ICTP, ch s to xng PINP. Cc i tng
nghin cu c theo di trong vng 15 nm. Trong thi gian theo di, cc bnh nhn b
gy xng hay khng gy xng c ghi nhn. Cu hi t ra ban u l c mt mi
lin h g gia BMD v nguy c gy xng hay khng. S liu ca nghin cu ny c
trnh by trong phn cui ca chng ny, v s trnh by mt phn di y bn c
nm c vn .

Mt phn s liu nghin cu v cc yu t nguy c cho gy xng

id fx age bmi bmd ictp pinp
1 1 79 24.7252 0.818 9.170 37.383
2 1 89 25.9909 0.871 7.561 24.685
3 1 70 25.3934 1.358 5.347 40.620
4 1 88 23.2254 0.714 7.354 56.782
5 1 85 24.6097 0.748 6.760 58.358
6 0 68 25.0762 0.935 4.939 67.123
7 0 70 19.8839 1.040 4.321 26.399
8 0 69 25.0593 1.002 4.212 47.515
9 0 74 25.6544 0.987 5.605 26.132
10 0 79 19.9594 0.863 5.204 60.267
...

137 0 64 38.0762 1.086 5.043 32.835
138 1 80 23.3887 0.875 4.086 23.837
139 0 67 25.9455 0.983 4.328 71.334

y, v bin ph thuc (gy xng) khng c o lng theo tnh lin tc (m
ch l c hay khng), cho nn phng php phn tch hi qui tuyn tnh phn tch mi
lin h gia bin ph thuc v bin c lp. Mt phng php phn tch c pht trin
tng i gn y (vo thp nin 1970s) c tn l logistic regression analysis (hay phn
tch hi qui logistic) c th p dng cho trng hp trn.

Trong nghin cu ny, sau 15 nm theo di, c 38 bnh nhn b gy xng. Tnh
theo phn trm, t l gy xng l 38 / 139 = 0.273 (hay 27.3%).

12.1 M hnh hi qui logistic

Cho mt tn s bin c x ghi nhn t n i tng, chng ta c th tnh xc sut
ca bin c l:
x
p
n
=
p c th xem l mt ch s o lng nguy c ca mt bin c. Mt cch th hin nguy c
khc l odds (mt danh t, nu ti khng lm, ch c trong ting Anh ngay c ting
Php, c, Ty Ban Nha cng khng c danh t tng ng vi odds). Ti tm dch
Phn tch s liu v biu bng R Nguyn Vn Tun

96
odds l kh nng. Kh nng ca mt bin c c nh ngha n gin bng t s xc
sut bin c xy ra trn xc sut bin c khng xy ra:
1
p
odds
p
=


Hm logit ca odds c nh ngha nh sau:
( ) l ogit log
1
p
p
p
| |
=
|

\ .

Cho mt bin c lp x (x c th l lin tc hay khng lin tc), m hnh hi qui logistic
pht biu rng:

logit(p) = x +

Tng t nh m hnh hi qui tuyn tnh, v l hai thng s tuyn tnh cn phi c
tnh t d liu nghin cu. Nhng ngha ca thng s ny, c bit l thng s , rt
khc vi ngha m ta quen vi m hnh hi qui tuyn tnh. hiu ngha ca hai
thng s ny, ti s quay li vi v d 19.

Vn m chng ta mun bit l mi lin h gia mt xng bmd v nguy c
gy xng (fx). tin cho vic minh ha, gi bmd l x, vn m chng ta cn bit
c th vit bng ngn ng m hnh nh sau

( ) logit log
1
p
p x
p

| |
= +
|

\ .


Ni cch khc:
( )
1
x
p
odds p e
p
+
= =



Ni cch khc, m hnh hi qui logistic va trnh by trn pht biu rng mi lin
h gia xc sut gy xng (p) v mt xng bmd l mt mi lin h theo hnh ch S.
M hnh trn cn cho thy xc sut gy xng p ty thuc vo gi tr ca x. Thnh ra,
m hnh trn c th vit mt cch chnh xc hn rng kh nng gy xng vi iu kin x
l:
( ) |
x
odds p x e
+
=

Khi x = x
0
, kh nng gy xng l: ( )
0
0
|
x
odds p x x e
+
= =

Khi x = x
0
+ 1 (tc tng 1 n v t x
0
), kh nng gy xng l:

( )
( )
0
1
0
| 1
x
odds p x x e
+ +
= + =

V, t s ca hai xc sut gy xng:
Phn tch s liu v biu bng R Nguyn Vn Tun

97

( )
( )
( )
0
0
1
0
0
| 1
|
x
x
odds p x x
e
e
odds p x x e


+ +
+
= +
= =
=


Trong dch t hc, e

c gi l odds ratio. Odds ratio, nh tn gi l, t s kh nng


hay t s kh d. Ni cch khc, h s trong m hnh hi qui logistic chnh l t s kh
d.

Phng php c tnh thng s trong m hnh [3] kh phc tp (dng phng php
maximum likelihood tc phng php Hp l cc i) v khng nm trong phm vi ca
cun sch ny, nn ti s khng trnh by y (bn c c th tham kho sch gio
khoa bit thm, nu cn thit). Tuy nhin, ti mun cp ngn gn l phng php
hp l cc i cung cp cho chng ta mt h phng trnh nh sau:

( )
( )
( )
( )
1

1 1

1 1
1
1
i
i
n n
x
i
i i
n n
x
i i i
i i
y e
x y x e

+
= =
+
= =

= +

= +





Trong , Trong , y
i
l bin ph thuc (gy xng vi gi tr 0 hay 1), v x
i
l
bin c lp (mt xng), v n l s mu. tm c s v

, mt trong nhng
php tnh hay s dng l iterative weighted least square hay Newton-Raphson. R s
dng php tnh Newton-Raphson tm hai c s .

Sau khi c c s v

chng ta c th c tnh xc sut p cho bt c gi tr


no ca x nh sau (sau vi thao tc i s):

( )

1
1
x
x x
e
p
e
e


+
+ +
= =
+
+


Ch ti dng du m p ch s c tnh (predicted value), ch khng phi p l xc
sut quan st. Nu m hnh m t d liu tt v y , khc bit gia p v p nh;
nu m hnh khng thch hp hay khng tt, khc bit c th s cao. khc bit
gia p v p c gi l deviance. Phng php tnh deviance kh phc tp, nhng
khng phi l ch y, cho nn ti ch ni qua khi nim m thi. Khi chng ta c
nhiu m hnh m phng mt hay nhiu mi lin h, deviance c th c s dng
nh gi s thch hp ca mt m hnh, hay chn mt m hnh ti u.

12.2 Phn tch hi qui logistic bng R

By gi, chng ta quay li vi v d 1, dng s liu trong Bng 12.1 c tnh
hai thng s v bng R. Trc ht chng ta phi nhp ton b s liu vo mt data
Phn tch s liu v biu bng R Nguyn Vn Tun

98
frame, v cho mt ci tn, chng hn nh fracture. Trong trng hp ca ti, d liu
c cha trong directory c:\works\stats di tn fracture.txt, do , cc lnh sau
y cn thit nhp s liu:

# bo cho R bit ni cha s liu

> setwd(c:/works/stats)

# nhp s liu v cho vo mt data frame tn fracture

> fracture <- read.table(fracture.txt, header=TRUE, na.string=.)

# kim tra xem c bao nhiu bin trong d liu fracture

> names(fracture)
[1] "id" "fx" "age" "bmi" "bmd" "ictp" "pinp"

# Chn nhng bnh nhn c y s liu cho phn tch

> fulldata <- na.omit(fracture)
> attach(fulldata)

Hai bin m chng ta quan tm trong v d ny l: fx (gy xng) v bmd (mt
xng). Chng ta kim tra xem c bao nhiu bnh nhn gy xng:

> table(fx)
fx
0 1
101 38


K n, xem mt xng trong nhm gy xng v khng gy xng ra sao:

> tapply(bmd, fx, mean)
0 1
0.9444851 0.9016667

> boxplot(bmd ~ fx,
xlab=Fracture: 1=yes, 0=no),
ylab=BMD)
Phn tch s liu v biu bng R Nguyn Vn Tun

99
0 1
0
.
6
0
.
8
1
.
0
1
.
2
Fracture: 1=yes, 0=no)
B
M
D


Kt qu trn cho thy, bmd trong nhm bnh nhn b gy xng thp hn so vi nhm
khng b gy xng (0.90 v 0.94). V, kim nh t sau y cho thy mc khc bit
ny khng c ngha thng k (p = 0.15).

> t.test(bmd~fx)

Welch Two Sample t-test

data: bmd by fx
t = 1.4572, df = 53.952, p-value = 0.1508
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01609226 0.10172922
sample estimates:
mean in group 0 mean in group 1
0.9444851 0.9016667

c tnh thng s trong m hnh [4], hm s glm (vit tt t generalized
linear model) trong R c th p dng, vi c php nh sau:

> logistic <- glm(fx ~ bmd, family=binomial)
> summary(logistic)

Call:
glm(formula = fx ~ bmd, family = "binomial")

Deviance Residuals:
Min 1Q Median 3Q Max
-1.0287 -0.8242 -0.7020 1.3780 2.0709

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.063 1.342 0.792 0.428
bmd -2.270 1.455 -1.560 0.119

(Dispersion parameter for binomial family taken to be 1)

Phn tch s liu v biu bng R Nguyn Vn Tun

100
Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 155.27 on 135 degrees of freedom
AIC: 159.27

Number of Fisher Scoring iterations: 4

Ti s ln lt gii thch cc kt qu trn:

(a) Trong lnh logistic <- glm(fx ~ bmd, family=binomial) chng ta yu cu
R phn tch theo m hnh fx l mt hm s vi bmd nh m hnh [4]. Trong glm c
nhiu lut phn phi, m trong phn phi nh phn (binomial) l mt lut phn
phi chun cho hi qui logistic. Do , family=binomial cn thit cho R.

(b) Deviance: phn th nht ca kt qu cho bit qua v deviance.

Deviance Residuals:
Min 1Q Median 3Q Max
-1.0287 -0.8242 -0.7020 1.3780 2.0709

Deviance nh gii thch trn phn nh khc bit gia m hnh v d liu (cng tng
t nh mean square residual trong phn tch hi qui tuyn tnh vy). i vi mt m
hnh n l nh v d ny th gi tr ca deviance khng c ngha g nhiu.

(c) Phn k tip cung cp c s ca (m R t tn l intercept) v

(bmd) v
sai s chun (standard error).

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.063 1.342 0.792 0.428
bmd -2.270 1.455 -1.560 0.119

Qua kt qu ny, chng ta c = 1.063 v

= -2.27. c s

l s m cho thy mi
lin h gia nguy c gy xng v bmd l mi lin h nghch o: xc sut gy xng
tng khi gi tr ca bmd gim. Tuy nhin, kim nh z (tnh bng cch ly c s chia
cho sai s chun) cho chng ta thy nh hng ca bmd khng c ngha thng k, v tr
s p = 0.119.

Nh rng t s kh d (odds ratio hay vit tt l OR) chnh l e
-2.27
= 0.1033. Ni cch
khc, khi bmd tng 1 g/cm
2
(n v o lng ca bmd l g/cm
2
) th t s OR gim 0.9067
hay 90.67%. Nhng tng 1 g/cm
2
l mt rt cao trong xng v khng thc t. Cho
nn mt cch tnh khc l tnh trn lch chun (standard deviation) ca bmd. Chng
ta s tm hiu lch chun ca bmd:

> sd(bmd)
[1] 0.1406543

Do , OR s tnh trn mi 0.14 g/cm
2
. V OR cho mi lch chun, do , l:

Phn tch s liu v biu bng R Nguyn Vn Tun

101
e
-2.27*0.1406
= 0.7267

Tc l, khi bmd tng mt lch chun th t s kh d gy xng gim khong 28%.
Cng c th ni cch khc, l khi bmd gim mt lch chun th t s kh d tng
e
2.27*0.1406
= 1.376 hay khong 38%.

Mt cch khc bit nh hng ca bmd l c tnh xc sut gy xng qua phng
trnh:
( )
( )
1.063 2.27
1.063 2.27

1
bmd
bmd
e
p
e

=
+


Theo , khi bmd = 1.00, p = 0.23. Khi bmd = 0.86 (tc gim 1 lch chun), p =
0.291. Tc l, nu BMD gim 1 lch chun th xc sut gy xng tng 0.291/0.23 =
1.265 hay 26%5.

(d) Phn cui ca kt qu cung cp deviance cho hai m hnh: m hnh khng c bin
c lp (null deviance), v m hnh vi bin c lp, tc l bmd trong v d
(residual deviance).

Null deviance: 157.81 on 136 degrees of freedom
Residual deviance: 155.27 on 135 degrees of freedom
AIC: 159.27

Qua hai s ny, chng ta thy bmd nh hng rt thp n vic tin on gy
xng, ch lm gim deviance t 157.8 xung cn 155.27, v mc gim ny khng c
ngha thng k.

Ngoi ra, R cn cung cp gi tr ca AIC (Akaike Information Criterion) c
tnh t deviance v bc t do. Ti s quay li ngha ca AIC trong phn sp n khi so
snh cc m hnh.

12.3 c tnh xc sut bng R

Xin nhc li trong phn tch trn, chng ta cho cc kt qu vo i tng
logistic. Trong i tng ny c nhiu thng tin c ch, nhng nu mun xem cc
thng tin ny chng ta phi dng n cc lnh nh summary chng hn. Trong phn
ny, ti s trnh by mt vi hm xem xt cc thng tin lin quan n vic tin on
xc sut.

predict dng lit k cc gi tr c tnh (predicted values) ca m hnh
log
1
p
x
p

| |
= +
|

\ .
cho tng bnh nhn.

> predict(logistic)

1 2 3 4 5 6
Phn tch s liu v biu bng R Nguyn Vn Tun

102
2.377576584 1.085694014 -2.141117756 1.492824115 0.965379946 -0.941253280
7 8 9 10 11 12
-1.733686514 -1.675645430 -0.665282957 -0.507046129 -0.941854868 -0.648740461
...

Cc s trn l log(p / (1 p)), tc log odds, khng c ngha hc t bao nhiu. Chng ta
mun bit gi tr tin on xc sut p tnh t phng trnh
( )
( )
1.063 2.27
1.063 2.27

1
bmd
bmd
e
p
e

=
+
. c gi tr
ny cho tng bnh nhn, chng ta cho thng s type=response vo hm predict
nh sau:

> predict(logistic, type="response")
1 2 3 4 5 6 7
0.91510135 0.74757001 0.10516416 0.81650178 0.72419767 0.28064726 0.15011664
8 9 10 11 12 13 14
0.15767295 0.33955387 0.37588624 0.28052582 0.34327343 0.44305196 0.23830776
...

Trong kt qu trn (ch in mt phn) c tnh xc sut gy xng cho bnh nhn 1 l
0.915, cho bnh nhn 2 l 0.747, v.v

Chng ta c th xem xt cc gi tr tin on ny vi bmd bng cch dng hm
plot thng thng:

> plot(bmd, fitted(glm(fx ~ bmd, family=binomial)))

0.6 0.8 1.0 1.2
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
0
.
4
0
bmd
f
i
t
t
e
d
(
g
l
m
(
f
x

~

b
m
d
,

f
a
m
i
l
y

=

"
b
i
n
o
m
i
a
l
"
)
)

Xc sut tin on gy xng (trc tung) v bmd (trc
honh) qua m hnh hi qui logistic.

Phn tch s liu v biu bng R Nguyn Vn Tun

103
Biu trn c th ci tin bng cch cho cc khong cch gi tr bmd gn nhau hn
(nh 0.50, 0.55, 0.60, , 1.20 chng hn), v dng ng biu din thay v dng du
chm. Cc lnh sau y s ci tin biu .

> logistic <- glm(fx ~ bmd, family=binomial)
> fnbmd <- seq(0.5, 1.2, 0.05) #cho fnbmd t > 0.50,0.55,0.6,...,1.2
> new.data <- data.frame(bmd = fnbmd) #cho vo mt dataframe mi
> predicted <- predict(logistic, new.data, type=response)
> plot(predicted ~ fnbmd, type=l)

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
0
.
3
5
0
.
4
0
0
.
4
5
fnbmd
p
r
e
d
i
c
t
e
d

Xc sut tin on gy xng (trc tung) v bmd (trc
honh) qua m hnh hi qui logistic.


13. c tnh c mu (sample size estimation)

Mt cng trnh nghin cu thng da vo mt mu (sample). Mt trong nhng cu hi
quan trng nht trc khi tin hnh nghin cu l cn bao nhiu mu hay bao nhiu i
tng cho nghin cu. i tng y l n v cn bn ca mt nghin cu, l s
bnh nhn, s tnh nguyn vin, s mu rung, cy trng, thit b, v.v c tnh s
lng i tng cn thit cho mt cng trnh nghin cu ng vai tr cc k quan trng,
v n c th l yu t quyt nh s thnh cng hay tht bi ca nghin cu. Nu s
lng i tng khng th kt lun rt ra t cng trnh nghin cu khng c chnh
xc cao, thm ch khng th kt lun g c. Ngc li, nu s lng i tng qu
nhiu hn s cn thit th ti nguyn, tin bc v thi gian s b hao ph. Do , vn
then cht trc khi nghin cu l phi c tnh cho c mt s i tng va cho
mc tiu ca nghin cu. S lng i tng va ty thuc vo ba yu t chnh:

Phn tch s liu v biu bng R Nguyn Vn Tun

104
Sai st m nh nghin cu chp nhn, c th l sai st loi I v II;
dao ng (variability) ca o lng, m c th l lch chun; v
Mc khc bit hay nh hng m nh nghin cu mun pht hin.

Khng c s liu v ba yu t ny th khng th no c tnh c mu. Kinh
nghim ca ngi vit cho thy rt nhiu ngi khi tin hnh nghin cu thng khng
c nim g v cc s liu ny, cho nn khi n tham vn cc chuyn gia v thng k
hc, h ch nhn cu tr li: khng th tnh c! Trong chng ny ti s bn qua ba
yu t trn.

13.1 Khi nim v power

Thng k hc l mt phng php khoa hc c mc ch pht hin, hay i tm
nhng ci c th gp chung li bng cm t cha c bit (unknown). Ci cha c
bit y l nhng hin tng chng ta khng quan st c, hay quan st c nhng
khng y . Ci cha bit c th l mt n s (nh chiu cao trung bnh ngi
Vit Nam, hay trng lng mt phn t), hiu qu ca mt thut iu tr, gen c chc
nng lm cho cy l c mu xanh, s thch ca con ngi, v.v Chng ta c th o chiu
cao, hay tin hnh xt nghim bit hiu qu ca thuc, nhng cc nghin cu nh th
ch c tin hnh trn mt nhm i tng, ch khng phi ton b qun th ca dn
s.

mc n gin nht, nhng ci cha bit ny c th xut hin di hai hnh
thc: hoc l c, hoc l khng. Chng hn nh mt thut iu tr c hay khng c hiu
qu chng gy xng, khch hng thch hay khng thch mt loi nc gii kht. Bi v
khng ai bit hin tng mt cch y , chng ta phi t ra gi thit. Gi thit n
gin nht l gi thit o (hin tng khng tn ti, k hiu H-) v gi thit chnh (hin
tng tn ti, k hiu H+).

Chng ta s dng cc phng php kim nh thng k (statistical test) nh kim
nh t, F, z,
2
, v.v nh gi kh nng ca gi thit. Kt qu ca mt kim nh
thng k c th n gin chia thnh hai gi tr: hoc l c ngha thng k (statistical
significance), hoc l khng c ngha thng k (non-significance). C ngha thng k
y, nh cp trong Chng 7, thng da vo tr s P: nu P < 0.05, chng ta pht
biu kt qu c ngha thng k; nu P > 0.05 chng ta ni kt qu khng c ngha
thng k. Cng c th xem c ngha thng k hay khng c ngha thng k nh l c
tn hiu hay khng c tn hiu. Hy tm t k hiu T+ l kt qu c ngha thng k, v
T- l kt qu kim nh khng c ngha thng k.

Hy xem xt mt v d c th: bit thuc risedronate c hiu qu hay khng
trong vic iu tr long xng, chng ta tin hnh mt nghin cu gm 2 nhm bnh
nhn (mt nhm c iu tr bng risedronate v mt nhm ch s dng gi dc
placebo). Chng ta theo di v thu thp s liu gy xng, c tnh t l gy xng cho
tng nhm, v so snh hai t l bng mt kim nh thng k. Kt qu kim nh thng
k hoc l c ngha thng k (P<0.05) hay khng c ngha thng k (P>0.05). Xin
nhc li rng chng ta khng bit risedronate tht s c hiu nghim chng gy xng
Phn tch s liu v biu bng R Nguyn Vn Tun

105
hay khng; chng ta ch c th t gi thit H. Do , khi xem xt mt gi thit v kt
qu kim nh thng k, chng ta c bn tnh hung:

(a) Gi thuyt H ng (thuc risedronate c hiu nghim) v kt qu kim nh thng
k P<0.05.

(b) Gi thuyt H ng, nhng kt qu kim nh thng k khng c ngha thng k;

(c) Gi thuyt H sai (thuc risedronate khng c hiu nghim) nhng kt qu kim
nh thng k c ngha thng k;

(d) Gi thuyt H sai v kt qu kim nh thng k khng c ngha thng k.

y, trng hp (a) v (d) khng c vn , v kt qu kim nh thng k nht qun
vi thc t ca hin tng. Nhng trong trng hp (b) v (c), chng ta phm sai lm, v
kt qu kim nh thng k khng ph hp vi gi thit. Trong ngn ng thng k hc,
chng ta c vi thut ng:

xc sut ca tnh hung (b) xy ra c gi l sai st loi II (type II error), v
thng k hiu bng .

xc sut ca tnh hung (a) c gi l Power. Ni cch khc, power chnh l xc
sut m kt qu kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H l
tht. Ni cch khc: power = 1- ;

xc sut ca tnh hung (c) c gi l sai st loi I (type I error, hay significance
level), v thng k hiu bng . Ni cch khc, chnh l xc sut m kt qu
kim nh thng cho ra kt qu p<0.05 vi iu kin gi thit H sai;

xc sut tnh hng (d) khng phi l vn cn quan tm, nn khng c thut
ng, d c th gi l kt qu m tnh tht (hay true negative).

C th tm lc 4 tnh hung trong mt Bng 1 sau y:

Cc tnh hung trong vic th nghim mt gi thit khoa hc

Gi thuyt H
Kt qu kim nh thng k ng
(thuc c hiu nghim)

Sai
(thuc khng c hiu nghim)

C ngha thng k (p<0,05)

Dng tnh tht (power),
1-= P(s | H+)


Sai st loi I (type I error)
= P(s | H-)

Khng c ngha thng k
(p>0,05)

Sai st loi II (type II error)
= P(ns | H+)

m tnh tht (true negative)
1- = P(ns | H-)

Phn tch s liu v biu bng R Nguyn Vn Tun

106
Ch thch: s trong biu ny c ngha l significant; ns non-significant; H+ l gi thuyt ng;
v H- l gi thuyt sai. Do , c th m t 4 tnh hung trn bng ngn ng xc sut c iu
kin nh sau: Power = 1 = P(s | H+); = P(ns | H+); v = P(s | H-).

13.2 S liu c tnh c mu

Nh cp trong phn u ca chng ny, c tnh s i tng cn thit
cho mt cng trnh nghin cu, chng ta cn phi c 3 s liu: xc sut sai st loi I v II,
dao ng ca o lng, v nh hng.

V xc sut sai st, thng thng mt nghin cu chp nhn sai st loi I khong
1% hay 5% (tc = 0.01 hay 0.05), v xc sut sai st loi II khong = 0.1 n
= 0.2 (tc power phi t 0.8 n 0.9).

dao ng chnh l c lch chun (standard deviation) ca o lng m cng
trnh nghin cu da vo phn tch. Chng hn nh nu nghin cu v cao
huyt p, th nh nghin cu cn phi c lch chun ca p sut mu. Chng
ta tm gi dao ng l .

nh hng, nu l cng trnh nghin cu so snh hai nhm, l khc bit
trung bnh gia hai nhm m nh nghin cu mun pht hin. Chng hn nh
nh nghin cu c th gi thit rng bnh nhn c iu tr bng thuc A c p
sut mu gim 10 mmHg so vi nhm gi c. y, 10 mmHg c xem l
nh hng. Chng ta tm gi nh hng l .

Mt nghin cu c th c mt nhm i tng hay hai (v c khi hn 2) nhm
i tng. V c tnh c mu cng ty thuc vo cc trng hp ny.

Trong trng hp mt nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton mt cch th cng nh sau:

( )
2
/
C
n


Trong trng hp c hai nhm i tng, s lng i tng (n) cn thit cho
nghin cu c th tnh ton nh sau:

( )
2
2
/
C
n



Trong , hng s C c xc nh t xc sut sai st loi I v II (hay power) nh sau:

Phn tch s liu v biu bng R Nguyn Vn Tun

107
Hng s C lin quan n sai st loi I v II

= = 0.20
(Power = 0.80)
= 0.10
(Power = 0.90)
= 0.05
(Power = 0.95)
0.10 6.15 8.53 10.79
0.05 7.85 10.51 13.00
0.01 13.33 16.74 19.84

13.4 c tnh c mu

13.4.1 c tnh c mu cho mt ch s trung bnh

V d 20: Chng ta mun c tnh chiu cao n ng ngi Vit, v chp nhn
sai s trong vng 1 cm (d = 1) vi khong tin cy 0.95 (tc =0.05) v power = 0.8 (hay
= 0.2). Cc nghin cu trc cho bit lch chun chiu cao ngi Vit khong 4.6
cm. Chng ta c th p dng cng thc [1] c tnh c mu cn thit cho nghin cu:

( ) ( )
2 2
7.85
166
/ 1/ 4.6
C
n

= = =



Ni cch khc, chng ta cn phi o chiu cao 166 i tng c tnh chiu cao n
ng Vit vi sai s trong vng 1 cm.

Nu sai s chp nhn l 0.5 cm (thay v 1 cm), s lng i tng cn thit l:
( )
2
7.85
664
0.5/ 4.6
n = = . Nu sai s m chng ta chp nhn l 0.1 cm th s lng i
tng nghin cu ln n 16610 ngi! Qua cc c tnh ny, chng ta d dng thy c
mu ty thuc rt ln vo sai s m chng ta chp nhn. Mun c c tnh cng
chnh xc, chng ta cn cng nhiu i tng nghin cu.

Trong R c hm power.t.test c th p dng c tnh c mu cho v d
trn nh sau. Ch chng ta cho R bit vn l mt nhm tc
type=one.sample:

# sai s 1 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=1, sd=4.6, sig.level=.05, power=.80,
type='one.sample')

One-sample t test power calculation

n = 168.0131
delta = 1
sd = 4.6
sig.level = 0.05
power = 0.8
alternative = two.sided

Phn tch s liu v biu bng R Nguyn Vn Tun

108
kt qu tnh ton t R l 168, khc vi cch tnh th cng 2 i tng, v c nhin R s
dng nhiu s l hn v chnh xc hn cch tnh th cng. Vi sai s 0.5 cm:

# sai s 0.5 cm, c lch chun 4.6, a=0.05, power=0.8
> power.t.test(delta=0.5, sd=4.6, sig.level=.05, power=.80,
type='one.sample')

One-sample t test power calculation

n = 666.2525
delta = 0.5
sd = 4.6
sig.level = 0.05
power = 0.8
alternative = two.sided


V d 21: Mt loi thuc iu tr c kh nng tng alkaline phosphatase bnh
nhn long xng. lch chun ca alkaline phosphatase l 15 U/l. Mt nghin cu
mi s tin hnh trong mt qun th bnh nhn Vit Nam, v cc nh nghin cu mun
bit bao nhiu bnh nhn cn tuyn chng minh rng thuc c th alkaline
phosphatase t 60 n 65 U/l sau 3 thng iu tr, vi sai s I = 0.05 v power = 0.8.

y l mt loi nghin cu trc sau (before-after study); c ngha l trc
v sau khi iu tr. y, chng ta ch c mt nhm bnh nhn, nhng c o hai ln
(trc khi dng thuc v sau khi dng thuc). Ch tiu lm sng nh gi hiu nghim
ca thuc l thay i v alkaline phosphatase. Trong trng hp ny, chng ta c tr
s tng trung bnh l 5 U/l v lch chun l 15 U/l, hay ni theo ngn ng R,
delta=5, sd=15, sig.level=.05, power=.80, v lnh:

> power.t.test(delta=3, sd=15, sig.level=.05, power=.80,
type='one.sample')

One-sample t test power calculation

n = 198.1513
delta = 3
sd = 15
sig.level = 0.05
power = 0.8
alternative = two.sided


Nh vy, chng ta cn phi c 198 bnh nhn t cc mc tiu trn.


13.4.2 c tnh c mu cho so snh hai s trung bnh

Trong thc t, rt nhiu nghin cu nhm so snh hai nhm vi nhau. Cch c
tnh c mu cho cc nghin cu ny ch yu da vo cng thc [2] nh trnh by phn
15.3.1.
Phn tch s liu v biu bng R Nguyn Vn Tun

109

V d 22: Mt nghin cu c thit k th nghim thuc alendronate trong
vic iu tr long xng ph n sau thi k mn kinh. C hai nhm bnh nhn c
tuyn: nhm 1 l nhm can thip (c iu tr bng alendronate), v nhm 2 l nhm
i chng (tc khng c iu tr). Tiu ch nh gi hiu qu ca thuc l mt
xng (bone mineral density BMD). S liu t nghin cu dch t hc cho thy gi tr
trung bnh ca BMD trong ph n sau thi k mn kinh l 0.80 g/cm
2
, vi lch chun
l 0.12 g/cm
2
. Vn t ra l chng ta cn phi nghin cu bao nhiu i tng
chng minh rng sau 12 thng iu tr BMD ca nhm 1 tng khong 5% so vi nhm
2?

Trong v d trn, tm gi tr s trung bnh ca nhm 2 l
2
v nhm 1 l
1
,
chng ta c:
1
= 0.8*1.05 = 0.84 g/cm
2
(tc tng 5% so vi nhm 1), v do , = 0.84
0.80 = 0.04 g/cm
2
. lch chun l = 0.12 g/cm
2
. Vi power = 0.90 v = 0.05, c
mu cn thit l:

( ) ( )
2 2
2 2 10.51
189
/ 0.04/ 0.12
C
n


= = =



V li gii t R qua hm power.t.test nh sau:

> power.t.test(delta=0.04, sd=0.12, sig.level=0.05, power=0.90,
type="two.sample")

Two-sample t test power calculation

n = 190.0991
delta = 0.04
sd = 0.12
sig.level = 0.05
power = 0.9
alternative = two.sided

NOTE: n is number in *each* group

Ch trong hm power.t.test, ngoi cc thng s thng thng nh delta (
nh hng hay khc bit theo gi thit), sd ( lch chun), sig.level xc sut sai
st loi I, v power, chng ta cn phi c th ch ra rng y l nghin cu gm c hai
nhm vi thng s type=two.sample.

Kt qu trn cho bit chng ta cn 190 bnh nhn cho mi nhm (hay 380 bnh
nhn cho cng trnh nghin cu). Trong trng hp ny, power = 0.90 v = 0.05 c
ngha l g ? Tr li: hai thng s c ngha l nu chng ta tin hnh tht nhiu nghin
cu (v d 1000) v mi nghin cu vi 380 bnh nhn, s c 90% (hay 900) nghin cu
s cho ra kt qu trn vi tr s p < 0.05.


Phn tch s liu v biu bng R Nguyn Vn Tun

110
13.4.3 c tnh c mu cho phn tch phng sai

Phng php c tnh c mu cho so snh gia hai nhm cng c th khai trin
thm c tnh c mu cho trng hp so snh hn hai nhm. Trong trng hp c
nhiu nhm, nh cp trong Chng 11, phng php so snh l phn tch phng sai.
Theo phng php ny, s trung bnh bnh phng phn d (residual mean square, RMS)
chnh l c tnh ca dao ng ca o lng trong mi nhm, v ch s ny rt quan
trng trong vic c tnh c mu.

Chi tit v l thuyt ng sau cch c tnh c mu cho phn tch phng sai kh
phc tp, v khng nm trong phm vi ca chng ny. Nhng nguyn l ch yu vn
khng khc so vi l thuyt so snh gia hai nhm. Gi s trung bnh ca k nhm l
1
,

2
,
3
, . . .,
k
, chng ta c th tnh tng bnh phng gia cc nhm bng
SS ( )
2
1
k
i
i
SS
=
=

, trong ,
1
/
k
i
i
k
=
=

. Cho
( ) 1
SS
k RMS
=

, vn t ra l tm
c lng c mu n sao cho z

p ng yu cu power = 0.80 hay 0.9, m



( )( ) ( )( )
1
1 1 1 1 2
z
k n F k n n


=
+ + +

( ) ( )( ) ( ) ( )( ) ( ) ( )
2
1 2 1 1 1| 2 1 1 2 1 1 k n k n n F k n k n
| |
(
+ +
|

\ .


Trong F l kim nh F. (Xem J. Fleiss, The Design and Analysis of Clinical
Experiments, John Wiley & Sons, New York 1986, trang 373).

V d 23. so snh ngt ca mt loi nc ung gia 4 nhm i tng
khc nhau v gii tnh v tui (tm gi 4 nhm l A, B, C v D), cc nh nghin cu
gi thit rng ngt trong nhm A, B. C v D ln lc l 4.5, 3.0, 5.6, v 1.3. Qua xem
xt nhiu nghin cu trc, cc nh nghin cu cn bit rng RMS v ngt trong mi
nhm l khong 8.7. Vn t ra l bao nhiu i tng cn nghin cu pht hin s
khc bit c ngha thng k mc = 0.05 v power = 0.9.

Hm power.anova.test trong R c th ng dng gii quyt vn . Chng ta ch
cn n gin cung cp 4 s trung bnh theo gi thit v s RMS nh sau:

# trc ht cho 4 s trung bnh vo mt vector
> groupmeans <- c(4.5, 3.0, 5.6, 1.3)

# sau , gi hm power.anova.test:
> power.anova.test(groups = length(groupmeans),
between.var=var(groupmeans),
within.var=8.7, power=0.90, sig.level=0.05)

Balanced one-way analysis of variance power calculation

groups = 4
Phn tch s liu v biu bng R Nguyn Vn Tun

111
n = 12.81152
between.var = 3.486667
within.var = 8.7
sig.level = 0.05
power = 0.9

NOTE: n is number in each group

Kt qu cho thy cc nh nghin cu cn khong 13 i tng cho mi nhm (tc 52 i
tng cho ton b nghin cu).


13.4.4 c tnh c mu c tnh mt t l

Nhiu nghin cu m t c mc ch kh n gin l c tnh mt t l. Chng
hn nh gii y t thng hay tm hiu t l mt bnh trong cng ng, hay gii thm d
kin v th trng thng tm hiu t l dn s a thch mt sn phm. Trong cc trng
hp ny, chng ta khng c nhng o lng mang tnh lin tc, nhng kt qu ch l
nhng gi tr nh nh c / khng, thch / khng tch, v.v V cch c tnh c mu cng
khc vi ba v d trn y.

Nm 1991, mt cuc thm d kin M cho thy 45% ngi c hi sn sng
khuyn khch con h nn hin mt qu thn cho nhng bnh nhn cn thit. Khong tin
cy 95% ca t l ny l 42% n 48%, tc mt khong cch n 6%! Kt qu ny
[tng i] thiu chnh xc, d s lng i tng tham gia ln n 1000 ngi. Ti
sao? tr li cu hi ny, chng ta th xem qua mt vi l thuyt v c tnh c mu
cho mt t l.

Chng ta bit qua Chng 6 v 9 rng nu p c c tnh t n i tng, th
khong tin cy 95% ca mt t l p [trong dn s] l: ( )
1.96 p SE p , trong
( ) ( )
1 / SE p p p n = .

By gi th lt ngc vn : chng ta mun c tnh p sao khong rng
( )
2 1.96 SE p khng qu mt hng s m. Ni cch khc, chng ta mun:

( )
1.96 1 / p p n m

Chng ta mun tm s lng i tng n t yu cu trn. Qua cch din t trn, d
dng thy rng:

( )
2
1.96
1 n p p
m
| |

|
\ .


Do , s lng c mu ty thuc vo sai s m v t l p m chng ta mun c tnh.
sai s cng thp, s lng c mu cng cao.
Phn tch s liu v biu bng R Nguyn Vn Tun

112

V d 24: Chng ta mun c tnh t l n ng ht thuc Vit Nam, sao cho
c s khng cao hn hay thp hn 2% so vi t l tht trong ton dn s. Mt nghin
cu trc cho thy t l ht thuc trong n ng ngi Vit c th ln n 70%. Cu hi
t ra l chng ta cn nghin cu trn bao nhiu n ng t yu cu trn.

Trong v d ny, chng ta c sai s m = 0.02, p = 0.70, v s lng c mu cn
thit cho nghin cu l:

2
1.96
0.7 0.3
0.02
n
| |

|
\ .


Ni cch khc, chng ta cn nghin cu t nht l 2017.

Nu chng ta mun gim sai s t 2% xung 1% (tc m = 0.01) th s lng i tng s
l 8067! Ch cn thm chnh xc 1%, s lng mu c th thm hn 6000 ngi. Do
, vn c tnh c mu phi rt thn trng, xem xt cn bng gia chnh xc thng
tin cn thu thp v chi ph.

R khng c hm cho c tnh c mu cho mt t l, nhng vi cng thc trn, bn c c
th vit mt hm tnh rt d dng.


13.4.5 c tnh c mu cho so snh hai t l

Nhiu nghin cu mang tnh suy lun thng c hai [hay nhiu hn hai] nhm
so snh. Trong phn 15.4.2 chng ta lm quen vi phng php c tnh c mu
so snh hai s trung bnh bng kim nh t. l nhng ngi cu m tiu ch l nhng
bin s lin tc. Nhng c nghin cu bin s khng lin tc m mang tnh nh phn nh
ti va bn trong phn 15.4.3. so snh hai t l, phng php kim nh thng dng
nht l kim nh nh phn (binomial test) hay Chi bnh phng (
2
test). Trong phn
ny, ti s bn qua cch tnh c mu cho hai loi kim nh thng k ny.


Gi hai t l [m chng ta khng bit nhng mun tm hiu] l
1
p v
2
p , v gi
=
1
p
2
p . Gi thit m chng ta mun kim nh l = 0. L thuyt ng sau c
tnh c mu cho kim nh gi thit ny kh rm r, nhng c th tm gn bng cng
thc sau y:

( ) ( ) ( )
( )
2
/ 2 1 1 2 2
2
2 1 1 1 z p p z p p p p
n

+ +
=


Trong , p = (
1
p +
2
p )/2,
/ 2
z

l tr s z ca phn phi chun cho xc sut /2 (chng


hn nh khi = 0.05, th
/ 2
z

= 1.96; khi = 0.01, th


/ 2
z

= 2.57), v z

l tr s z ca
Phn tch s liu v biu bng R Nguyn Vn Tun

113
phn phi chun cho xc sut (chng hn nh khi = 0.10, th z

= 1.28; khi = 0.20,


th z

= 0.84).

V d 25: Mt th nghim lm sng i chng ngu nhin c thit k nh
gi hiu qu ca mt loi thuc chng gy xng sng. Hai nhm bnh nhn s c
tuyn. Nhm 1 c iu tr bng thuc, v nhm 2 l nhm i chng (khng c
iu tr). Cc nh nghin cu gi thit rng t l gy xng trong nhm 2 l khong 10%,
v thuc c th lm gim t l ny xung khong 6%. Nu cc nh nghin cu mun th
nghim gi thit ny vi sai st I l = 0.01 v power = 0.90, bao nhiu bnh nhn cn
phi c tuyn m cho nghin cu?

y, chng ta c = 0.10 0.06 = 0.04, v p = (0.10 + 0.06)/2 = 0.08. Vi
= 0.01,
/ 2
z

= 2.57 v vi power = 0.90, z

= 1.28. Do , s lng bnh nhn cn thit


cho mi nhm l:

( )
( )
2
2
2.57 2 0.08 0.92 1.28 0.1 0.90 0.06 0.94
1361
0.04
n
+ +
= =

Nh vy, cng trnh nghin cu ny cn phi tuyn t nht l 2722 bnh nhn kim
nh gi thit trn.

Hm power.prop.test R c th ng dng tnh c mu cho trng hp trn. Hm
power.prop.test cn nhng thng tin nh power, sig.level, p1, v p2.
Trong v d trn, chng ta c th vit:

> power.prop.test(p1=0.10, p2=0.06, power=0.90, sig.level=0.01)

Two-sample comparison of proportions power calculation

n = 1366.430
p1 = 0.1
p2 = 0.06
sig.level = 0.01
power = 0.9
alternative = two.sided

NOTE: n is number in *each* group

Ch kt qu t R c phn chnh xc hn (1366 i tng cho mi nhm) v R dng
nhiu s l cho tnh ton hn l tnh th cng.

Trc khi ri chng ny, ti mun nhn c hi ny nhn mnh mt ln na,
c tnh c mu cho nghin cu l mt bc cc k quan trng trong vic thit k mt
nghin cu cho c ngha khoa hc, v n c th quyt nh thnh bi ca nghin cu.
Trc khi c tnh c mu nh nghin cu cn phi bit trc (hay t ra l c vi gi thit
c th) v vn mnh quan tm. c tnh c mu cn mt s thng s nh cp n
Phn tch s liu v biu bng R Nguyn Vn Tun

114
trong phn u ca chng, v nu cc thng s ny khng c th khng th c tnh
c. Trong trng hp mt nghin cu hon ton mi, tc cha ai tng lm trc ,
c th cc thng s v nh hng v dao ng o lng s khng c, v nh nghin
cu cn phi tin hnh mt s m phng (simulation) hay mt nghin cu s khi c
nhng thng s cn thit. Cch c tnh c mu bng m phng l mt lnh vc nghin
cu kh chuyn su, khng nm trong ti ca sch ny, nhng bn c c th tm hiu
thm phng php ny trong cc sch gio khoa v thng k hc cp cao hn.

Trn y l vi hng dn nhanh bn c c th s dng R cho phn tch s
liu v to biu . Bi vit ny thc cht l tm lc t cun Phn tch s liu v to
biu bng R: hng dn v thc hnh, do Nh xut bn i hc Quc gia Thnh ph
H Ch Minh n hnh vo nm 2006. Chi tit v l thuyt v mt s phng php khc
nh phn tch s kin, xy dng m hnh thng k, m phng, lp chng, v.v c th
tm trong sch trn.


Phn tch s liu v biu bng R Nguyn Vn Tun

115
14. Ti liu tham kho

Hin nay, th vin sch v R cn tng i khim tn so vi th vin cho cc
phn mm thng mi nh SAS v SPSS. Tuy nhin, trong thi i tin b phi thng
v thng tin internet v ton cu ha nh hin nay, sch in v sch xut bn trn website
khng cn l nhng khc nhau bao xa. Phn ln ch dn v cch s dng R c th tm
thy ri rc y trn cc website t cc trng i hc v website c nhn trn khp
th gii. Trong phn ny ti ch lit k mt s sch m bn c, nu cn tham kho
thm, nn tm c. Trong qu trnh vit cun sch m bn c ang cm trn tay, ti
cng tham kho mt s sch v trang web m ti s lit k sau y vi vi li nhn xt c
nhn.

Ti liu tham kho chnh v R l bi bo ca hai ngi sng to ra R: Ihaka R,
Gentleman R. R: A language for data analysis and graphics. Journal of Computational
and Graphical Statistics 1996; 5:299-314.

Data Analysis and Graphics Using R An Example Approach (Nh xut bn
Cambridge University Press, 2003) ca John Maindonald nay xut in li ln th
2 vi thm mt tc gi mi John Braun. y l cun sch rt c ch cho nhng ai
mun tm hiu v hc v R. Nm chng u ca sch vit cho bn c cha tng
bit v R, cn cc chng sau th vit cho cc bn c bit cch s dng R thnh
tho.

Introductory Statistics With R (Nh xut bn Springer, 2004) ca Peter
Dalgaard l mt cun sch loi cn bn cho R nhm vo bn c cha bit g v R.
Sch tng i ngn (ch khong 200 trang) nhng kh t gi!

Linear Models with R (Nh xut bn Chapman & Hall/CRC, 2004) ca Julian
Faraway. Sch hin c th ti t internet xung min ph ti website sau y:
http://www.stat.lsa.umich.edu/~faraway/book/pra.pdf hay http://cran.r-
project.org/doc/contrib/Faraway-PRA.pdf. Ti liu di 213 trang.

R Graphics (Computer Science and Data Analysis) (Nh xut bn Chapman &
Hall/CRC, 2005) ca Paul Murrell. y l cun sch chuyn v phn tch biu
bng R. Sch c rt nhiu m bn c c th t mnh thit k cc biu phc
tp v mu m.

Modern Applied Statistics with S-Plus (Nh xut bn Springer, 4
th
Edition,
2003) ca W. N. Venables v B. D. Ripley c vit cho ngn ng S-Plus nhng
tt c cc lnh v m trong sch ny u c th p dng cho R m khng cn thay
i. (S-Plus l tin thn ca R, nhng S-Plus l mt phn mm thng mi, cn R
th hon ton min ph!) y l cun sch c th ni l cun sch tham kho cho
tt c ai mun pht trin thm v R. Hai tc gi cng l nhng chuyn gia c thm
quyn v ngn ng R. Sch dnh cho bn c vi trnh cao v my tnh v
thng k hc.

Phn tch s liu v biu bng R Nguyn Vn Tun

116
Cc website quan trng hay c ch v R

Rt nhiu ti liu tham kho c th ti t website chnh thc ca R sau y:
http://cran.R-project.org/other-docs.html

Trong c mt s ti liu quan trng nh An Introduction to R ca W. N.
Venables v B. D. Ripley.
a ch internet: http://cran.r-project.org/doc/manuals/R-intro.pdf.

Vi ti liu hng dn cch s dng R c th ti (min ph) v tham kho nh sau:

R for Beginners (57 trang) ca Emmanuel Paradis. Ti liu c son cho bn
c mi lm quen vi R.
a ch internet: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf.

Using R for Data Analysis and Graphics: Introduction, Code and Commentary
(35 trang) ca John Maindonald l mt tm lc cc lnh v hm cn bn ca R
cho phn tch s liu v biu . Ch ca ti liu ny rt gn vi cun sch m
bn ang c.
a ch internet: http://cran.r-project.org/doc/contrib/usingR.pdf

Statistical Analysis with R a quick start (46 trang) ca Oleg Nenadic v
Walter Zucchini. Web. Ti liu hng dn cch ng dng R cho phn tch thng
k v biu .
a ch internet: http://www.statoek.wiso.uni-goettingen.de/mitarbeiter/ogi/pub/r_workshop.pdf

A Brief Guide to R for Beginners in Econometrics (31 trang) ca M. Arai. Ti
liu ch yu son cho gii phn tch thng k kinh t.
a ch internet: http://people.su.se/~ma/R_intro

Notes on the use of R for psychology experiments and questionnaires (39
trag) ca Jonathan Baron v Yuelin Li. Web. Ti liu c son cho gii nghin
cu tm l hc v x hi hc. C v d v log-linear model v mt s m hnh phn
tch phng sai trong tm l hc.
a ch internet: http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

StatsRus gm mt su tp v cc mo s dng R hu hiu hn (di khong 80
trang). a ch internet: http://lark.cc.ukans.edu/pauljohn/R/statsRus.html

V sau cng l mt ti liu Hng dn s dng R cho phn tch s liu v biu
(khong 50 trang thng xuyn cp nht ha) do chnh ti vit bng ting
Vit. Website: www.R.ykhoa.net thc cht l tm lc mt s chng chnh ca
cun sch ny. Trang web ny cn c tt c cc d liu (datasets) v cc m s
trong trong sch bn c c th ti xung my tnh c nhn s dng.

Phn tch s liu v biu bng R Nguyn Vn Tun

117
15. Thut ng dng trong sch


Ting Anh Ting Vit
95% confidence interval Khong tin cy 95%
Akaike Information criterion (AIC) Tiu chun thng tin Akaike
Analysis of covariance Phn tch hip bin
Analysis of variance (ANOVA) Phn tch phng sai
Bar chart Biu thanh
Binomial distribution Phn phi nh phn
Box plot Biu hnh hp
Categorical variable Bin th bc
Clock chart Biu ng h
Coefficient of correlation H s tng quan
Coefficient of determination H s xc nh bi
Coefficient of heterogeneity H s bt ng nht
Combination T hp
Continuous variable Bin lin tc
Correlation Tng quan
Covariance Hp bin
Cross-over experiment Th nghim giao cho
Cumulative probability distribution Hm phn phi tch ly
Degree of freedom Bc t do
Determinant nh thc
Discrete variable Bin ri rc
Dot chart Biu im
Estimate c s
Estimator Hm c lng thng k
Factorial analysis of variance Phn tch phng sai cho th nghim giai tha
Fixed effects nh hng bt bin
Frequency Tn s
Function Hm
Heterogeneity Bt ng nht
Histogram Biu tn s
Homogeneity ng nht
Hypothesis test Kim nh gi thit
Inverse matrix Ma trn nghch o
Latin square experiment Th nghim hnh vung Latin
Least squares method Phng php bnh phng nh nht
Linear Logistic regression analysis Phn tch hi qui tuyn tnh logistic
Linear regression analysis Phn tch hi qui tuyn tnh
Phn tch s liu v biu bng R Nguyn Vn Tun

118
Matrix Ma trn
Maximum likelihood method Phng php hp l cc i
Mean S trung bnh
Median S trung v
Meta-analysis Phn tch tng hp
Missing value Gi tr khng
Model M hnh
Multiple linear regression analysis Phn tch hi qui tuyn tnh a bin
Normal distribution Phn phi chun
Object i tng
Parameter Thng s
Permutation Hon v
Pie chart Biu hnh trn
Poisson distribution Phn phi Poisson
Polynomial regression Hi qui a thc
Probability Xc sut
Probability density distribution Hm mt xc sut
P-value Tr s P
Quantile Hm nh bc
Random effects nh hng ngu nhin
Random variable Bin ngu nhin
Relative risk T s nguy c tng i
Repeated measure experiment Th nghim ti o lng
Residual Phn d
Residual mean square Trung bnh bnh phng phn d
Residual sum of squares Tng bnh phng phn d
Scalar matrix Ma trn v hng
Scatter plot Biu tn x
Significance C ngha thng k
Simulation M phng
Standard deviation lch chun
Standard error Sai s chun
Standardized normal distribution Phn phi chun chun ha
Survival analysis Phn tch bin c
Traposed matrix Ma trn chuyn v
Variable Bin (bin s)
Variance Phng sai
Weight Trng s
Weighted mean Trung bnh trng s

You might also like