esson Notes - Chapter 28: Comparing Multiple Means (ANOVA)
‘A researcher wishes to investigate the effectiveness of diferent treatments for
‘removing bacteria fom human hands: washing wih water only, washing with regular
‘Soap, washing with antibacterial soap, and using en antbacteral spray containing
(65% ethanol.
Each moming, she used one, randomly chosen, cleaning technique on her hands,
‘and then placed her hand on a sterile media pate (used for growing, and counting,
bacteria). She replicates this procedure 8 tmes for each ofthe 4 treatments, Media
plates wore incubated for 2 days at constant temperature, and the number of
bacteria on each plate was recorded
AA side-by-side boxplot of the numbers of bacteria recorded for each treatment is
shown:
I ‘
ji, 4 o
4
‘Ace the treatments equally effective?
There is some variation in the mean number of bacteria for each treatment method.
But there is also variation from sample to sample within each group.
\We expect some variation between means, but whether or nat this is statistically significant
really depends upon the varianees within each ofthe groups. Consider these two sets of
data
+e 7 @
Al @
: u| 8 &
‘The means forthe 4 groups is each data set are the same (31, 36, 98, and 31) but we
‘would conclude that in the left data these means are al within what would be normally
‘expected variation. In the right data, these same differences between means are instead
‘much larger than the expected variations,
For this reason, when comparing multiple means, the analysis is realy an analysis of the
variance, and this type of analysis is called Analysis of Variance (ANOVA).We need a statistic and a distribution to use to determine a p-value. The statistic needs
to compare the diference in the means ofthe groups tothe variability expected (as
determined by the variably within each group, somehow combined to include the
variances of each ofthe groups)
‘This static should be larger ifthe diference between the group means is larger and it
should also become larger f the variabllty within the groups fs smaller which means the
‘qumeralor of this slaiste should address cifference between aroup means and the
‘denominalor ofthis slasic should address variably within the groups.
‘The numerator of this static should address difference between group means. I there were ony
{wo means to compare, we cou just examine the ference. With multiple means, if we just treat
teach mean as a dala value then the variably could be measure by fist finding the variance ofthis
Small set of mean values. For the hand-washing experiment
= 1245.08
‘But each data value is a mean of @ sample, each of size n=8, so we could determine the
corresponding variance othe orginal observations by the sampling distribution standard deviation
‘oemcla
(1245.08) = 9960.67
Ini esimate of adressen teronoe between gr ans an cae he
treatment moan square, writen” MS, oF MST
|
“The denominator ofthis statistic should address variabilty within the aroups. For this, we
ineed to consider the variances of each graup, and somehow combine these into one overall
‘measure of the total variably in all the groups.
We started with subjects random assigned to treatments but all taken from the same
population. When we do hypothesis testing, ur null hypothesis will assume that there is no
Sitference between the treatments, so ifs reasonable fo assume the variance won't be
‘affected by the treatments, So each group variance is estimating the same common a”
‘To combine, wel simply pool the samples ftom the groups, and because n is the same for
all groups, we can find this measure of variability wihin the groups by taking the average
‘ofthe group variances, This estimate of o addresses varibilly within the groups and
is called the error mean square, writen: MS, or MSE or 5,
1, » OSL 17220524190 08
1410.10 |
MS,
‘The F-statistic
‘The statistic i then the ratia of these two estimates of population variance a”
and is called the F-statistc (in honors of Sir Ronald Fisher)
F -statisticHypotheses:
The effect we are testing is whethor the diferencs inthe means is significant,
0 the nll hypothesis would be thal ther is ne diffrence at all in the means:
He thh= t= ty
H, :The group means are not all equal
It Hols ue, th alerence in group means shouldbe around the same as the
Ciforence within the groupe, go the F-statistlc should be around 1. the diference in
‘roup means is large compared with te ciflerence wih the groups, the F-statistc
vl be higher than 1
‘The F-aletribution
In order to use the F-satsticto compute a p-value, we need a sampling
distribution model for he F-satistc. This was investigated by Sir Ronald
Fisher, and the resulting distribution is diferent from the Normal, t- and Chi
‘square distributions we've already seen. itis called the F-distribution, and
like the Chi-square distribution itis posiive and one-stled. It also depends
upon two diferent degrees of freedom, one for the numerator and one for the
denominator of the F-statstc
<1 where k=# of groups
=k where N= kn (#of all samples combined)
1 and N-k
Numerator, — dy
Denominator: dae
Usually, these are justlisted in order: df
Back to the hand-washing example.
penny nase | 0
aml tise | sma
Fen oa fs | noe
‘+ Entering the group means into L1, we can get the std deviation with 1-VarStats
(35.2857), square to get variance, and multiply by n=8 to get MSr = 9960.64.
+ Taking the average of the group variances gives us MSe= 1410.10.
MS, _ 9960.64
MS, 1410.10
+ We need the two degrees-of freedom:
Gog =K=1 3 F
Pian = NK =32-4= 28
To compute p-value, we use a calculator function:
p-value = Fedf (7.06,999,3,28
‘With low p-value, we reject Hp. There Is sulicient evidence that the group means are
rot all equal (the hand washing methods are not equally effective).
~ statisticTechnology vs. tables
‘The F-cietribution is particularly dificult to use without technology, because of the
two degrees-of freedom. For an F-distibution table, the rows and columns must
‘each specify one of the degrees of freedom, so an entire page must be used for
a single significance level (such as 05). Here is an excerpt from a portion of the
table forthe 0.05 table:
oni)
Zee yNere
‘Much better to use a calculator or software wiich handles all the cases automaticaly
Before we talk about conditions...
\We need to consider two other ideas before we tlk about the colton for
conducting F-statstc tats: that the ANOVA analysis can be thought of 2s a form of
regression, and S,, a8 a standard deviation
‘The ANOVA model
‘You can think ofeach incvidual data value (observation) as being a distance away from
ts group mean, and therefore made Up ofthe group mean plus an error tem called a
residual. The Fh observation from the k-th group would be:
Ya
‘Then ey, would the “rrr or Yesidua for this observation: &y = Yu — Ps
aan
‘The MS, the variance of these errors, andthe MS, comes from the variance ofthe group means.
‘We can think ofthis as being simiar to regression analysis where we have « fited or
pradted vale fr each observation which jus Bwermenn forthe run: >
‘So then the underying ‘rue! model assumes that the means of the groups are samples
from varying around tre mean for the group, 4, and we could write tis expression for
‘ne underiying ANOVA model
Ya = Met bi
(vain place of Fy, and 6 in place of ¢ for the population)Residual Standard Deviation
We've been using variances because they are easy to combine, but variance is notin the
Unis of the problem, standard deviation is. Ifwe want a measure of variably ofthe data,
this would be the standard deviation ofthe residuals of the ANOVA model. This is called
residual standard deviation and represents the pooled standard deviation:
This standard deviation should be representative of the standard deviation of each ofthe
groups (because these should be approximately equal). In the handwashing example:
5, = VAIO =37.6 bacteria colonies
which seems reasonable as a value
‘representing the standard deviation of al ofthe
groups.
wees
HH
Hoo
HEH
MA sap ae
Conditions.
4) Plot the data first (show tho side-by-side boxplts).
2) Independence ! Randomization Condition:
+ Groups must be independent of each other
+ Data within each treatment group must be independent to.
+ Data colected with suitable randomization,
3) Equal variance assumption
1We need a pooled variance for MS, so varianoas of treatment groups mus be approximately
‘equal. Ways to check:
“Look at side-by-side boxplois 1 see whether than have roughly the same spread, or look at
‘Side-by-side boxplots of the residuals (moves all the centers togethor). groups have
dflerent spreads it makes MS¢ larger, reducing sats, making it ass lkaly we reject the
‘ul, so ANOVA usualy fais on the'safe side’. Because ofthis, we only miss this condition
{the spreads are quite different from each other before becoming concerned.
“Look lor syslematie changes in spread with change in cantar (widar spread for higher center
values, otc.) Often, this can be fixed with re-oxpcession, especially for skowed groups.
kat tha residuals plod againt the predicted valuas. Look for larger predicted values
loading to largor residuas (Fanning).
4) Normal population assumption / Nearly Normal condition
~ Check this with ahistogram oc NPP of al the residuals logether (should use data from all
g10ups, otherwise, mislead by small skews in each group).
‘Chock for outlers within each group. Consider removing outers within groups before
proceeding‘The ANOVA table"
‘ota for an ANOVA arisen presenta inthe form ofan "ANOVA table. The
parelr formats structured to factate hand computation of to Fasc (rom fo lt
Says when his was done by teams of people, by hand), Nonetheless, many modem
soar packages sil produce ANOVA outputs fom
nai of Variance Tabla
Source Sumf Square” OF MeanSqiare rato Paalue
Scope ‘0002 2 sees! = 70508 ODT
Gre ‘e404 oe tatata
Taal ‘65086 3
‘The ony pcos ofnormaton we woud ever use fram such abl
+ The Ffaon (Estate) 7.0686
1 The Mean Square for ror (MSe): 1410.14
1 Pevaui i provided 0.0011
A complete example.
‘Sometimes, we don't have access to the entre data gt, but we have access fo the eummary
stasis foreach of multiple distributions and wish to compare the moars.
‘An experiment is conducted to compare three weight loss programs: fow calorie, low fa,
‘and ow carbohydrate. 15 participants are randomly assigned fo tha groups and weight ie
‘recorded (in pounds) at the boginning and end ofan B-wock experiment period. The data
‘cnsidered are the weight oss values for each participant, and the flowing dala were
recorded
ow tow
aloe
fal uk ts oto gean orvew ‘Welsto do Vat on eat gate sma tits
ofa, foreach ireatmeat
: = Ba ae] ;
Lt (ow calorie) dows appear to have a substantially higher mean weight loss compared to the
‘ther two treatments, buts this ference of means significant?
Hypotheses:
Hyih= I= ty
H, :The group means are not all equal.Concitions: u
Indopendonco / Randomization Condition: a
+ Groups must be independent ofeach other.
+ Data witin each teatrent group must be Independent too, 2
* Data coletos with uae randomization
We can assume these data were collected appropriately
We do not have any reason fo believe that there are any dependencies between samples in
the data groups or between the groups.
‘uhhe ame spread, or ook ot sie-by sida boxpiot of
(roves alto cota together). groupe have
Afrent spreads I makes MSe largo, reducing F-statst, making les ily we reject he
Tul 29 ANOVA usualy fae an he wate side’. Because ofthis, wo only miss tis condlvon
{1 the spreads ere quite diferent from each other before becoming concerned.
temat onder spread forPighe cantar
values, 2) Ol, tuscan be lod wih re-exossion, especialy fr skewed groups
‘ook al he readuls pleted again! thn redeind vluoa Look for argorproitad valves
teasing le largr rests aann).
Normal population assumption / Nearly Normal condition
‘Shack fawn storan or NEP of th esas together (ehoud wee ta ren alt
‘foubs, otherwise, misleed by smal shews in osc group)
"Check or oullrs win each gout, Coneerrfravng outers wit groups before
procoeang
‘These box plats look reasonably good (they are not drastically different widths and there
are no outliers). The difference in widths is not on isue (see note above.
Tt does appear skewed right and seems less variable the Li, and L2, so there is some
{justification for further analysis, although many would proceed with the test at this peint
Untortunatly, nesiating futher not very easy you ora using calcu
‘Anne way to vestigate i i ta make the sample sie larger by combining le data
‘Opether and looing tho residuals. Each daa pln esdua sus hoaferenco between
wean ‘roup Bul even pdr an ANOVA
testi ate histogram
‘ut you can use tis procedure opt he Helogram or NPP ofthe combined residual:
now sts LA 8, and LS fr each group's residuals
Lae LWT, [4= 12-12, L6= 13-13
Use ne augment command (under 2nd STAT,OPS) to pal resi os up into now ts and
‘renal go ce fat ha he resid in (2
augment(LA,L5)~> 1 (-> 1 the store command, STO above the on button)
re
augment(13,18)—> 12
You can ton cpay a histogram or NPP stat plot of (2
“The histogram and Normal Probability Plt both show thatthe residual (from all
‘roups, token asa single deta set) are Necrly Normal and that the variances are equl
naigh to proceed withthe ANOVA,‘A-complato examp!
Perform the ANOVA test:
With he dataln 1, L2, and L3, on 2 THs
STAT
Tests
TANOVA
{nthe command ine, use 2nd-1, the comma key, 2642, ac. and to spect the
fists, thon RI Enter to execute:
“This output Inccates that he Fst Is 6.4176 wth 2
‘akis of 0127. The other enties correspond tothe tems in
{he ANOVA tabi (in which We are general not irtresod),
ith sigicance level of 05, th pave of 0127 is ow, £0 we
‘eject the nl bypothes. Thre signieant evidonce tha the
‘0up means ae nota ual
‘That means that he fot hal ho fw calorie det means 6.6 fs
los (comparadta 3 and forthe ser cet} stbstially
significant
you have only summary statistics.
Some prot or situations may provide summary saistics and ack that you compote the
nals. For example, hey may provide the memento and danornnle mors ana ak YOd
toting heFstatate:
UF MS, =19.4667, MS, =3.0333, what isthe F— statistic?
F static = MSs. WAST 6479
MS, ~ 3.0333
(Wit: nthe 1-84 ANOVA clea tat vu MS, and MS
Secor ana re)
(tne problems may sve you an Fai value and ask youto determine the
eemresponcing pvaue. As with Nomal end Ch-squared Getbwtons, you wuld us the
Feat command forthe F-dstbuton
p-value = Fedf (6.4177, 999, 2, 12)=,0127
Tower Spee" nue dom
teund bound "ah or
Unbalanced designs
‘The exaopee wave seen ar each had restment groupe ofthe same sale sie, This
Event wa ty fo tance deg, thes happen nthe es wor hat oen make he data
Uraaanced,Stejctedrop of heed, era of te cata woe ound be ken
silences ar al oa Word experants a os
Mest what weve developed works for unbalanced designs, excel ht conan
‘stuatons pt rae composed. now must become nx bocaur cals be deren fo
‘ath group, and we can na longer ure poled vananen wih a singe average. Ul
loatnaogy make these suomects automaticaly, o oven using a Tras, we ae able to use
‘ots eo ere empl as ond conduc the naioes ine sam 2‘Comparing Means - Now that we know there means are not the same, what next?
Ith resut of an ANOVA testis that the group means are not athe same, which means a
diferent enough tobe considered signiteantly ferent? If we fallto reject Hy, ten we slop,
but if we reject Ha, Ure more we can say
one ofthe groups could legitimately serve as a conto (2.9. wale forthe hand washing
‘experiment then we could co a 2-sample dlerence of means test ween aach other group
and tat contre group to see which dflerence mn means fs most significant
nthe nand washing experimen, we could aso try comparing the means of al the soaps taken
her tothe mean of all the sprays taken together. More complicated combinaions ike
these are called contrasts and are beyond the scope on an introductory statistics course.
But thors away todo something ko a2-sample ference of means comparison between
altho groups and a single group. For example, in te hand-washing experiment, the
Fesearchernoled thal the bacteria count was very tow for the alcohol eatment, Dut wanted to
know if any ofthe other (more pleasant smeling} treatments were equaly effective as acto
‘She really wanted o compare each treatment against he alshal spray.
‘She could do separate tilorence of means tests on each graup wlalcohol, but each test poses
a risk ofa Type | ror (mistakenly rejecting a tue H,- detecting a aiference when, in fact, there
was ne sigalcant dference belwoen the means)
‘The problem with this approach i that if we have multiple groups, as we do more and more
lest, the risk that we eventually make a Type | or inereases and eventually i bigger than the
aha level of each test. we have enough groups it becomes vor Ikly that wel eect one of
tho null hypotheses by mistake, but we won't know which one the tue Hy which we mistakenly
rejected,
‘There are several defenses against his problem, all ted methods for multiple
comparisons. Althese methods fst require that we be able to eect the overall nu
hypothesis wit an ANOVA Fest. Once weve rejected the overal Hy, then we can think about
‘or even all pars of groups means wih each otter.
instead of doing a act on the fronco, wo could calculate 295%
val forthe ilove in moans. (Thies rats to test wi alphas 05).
‘The Margin of or or such a tat would be
ME =
Terje ne nil hypot ath ogy means ae eq he enc etween hm mu
Zo anu en the Eat wy, Own eine oval ha recs, When we sen ble
‘ayrwecllhe arp eeverv eet lgifeantaferonce (30), Wwe grup mse er
By tae han hs aout tay re suey ere aoe fr eos naa
Fore hand waning exam
n=8, 5, = (MS, = Viai014
For S5% conionce wih df =28, 1*=2.048 wo:
11.55 bacteria colonies
LSD = 2.048 «37.55.
18.45. acteria colonies
a8
“To reject the nu hypotesis hat the two group means are equal, the dference batwoen them must,
‘be greater than tho LSD, so any to wasting methods whose moans differ by more than 3645,
colonia cul be said tobe staltealy significant ferent tsps. 05 by this mad,Bonferroni Multiple Comparisons
This stl a way to examine individual pis, fwe want to examine many pais simultaneously,
‘hore are several methods that adjust the erica!” valuo so tal the resulting confidence intrvais
‘provide appropriates for al the pats bul Keeping over iype | ror rate at oF below alpha.
‘One such method is called the Bonferroni method. This method adjusts the LSD to alow or
‘making many comparisons. The results a wider margin of eror called the minimum significant
‘lflerence (MSD), found by replacing "witha sighlyfarger number,
“This makes the conidance Intervals wiser foreach contrast and the corresponding Type error
rates lower or each test, and koops the overal type | ror rate ator below alpha. The Bonferroni
metros, more speiieaiy, stipes the err rate equally among .J confidence intervals, nding
‘ach at confidence level 14, Instead of egal =a
To signal this adjustment, we label he crcal value instead of.
For the hand-washing example: ta make the 3 confidence intervals comparing the alcohol
‘spray with the other 3 washing methods, and preserve our oval alpha risk at 8%, weld
conetruct each with a conidance level of
0s
3
833
J
For @ confidence level of 98.33% withthe danomingtor degrees of freedom (28) we can use
Ihe iv function:
insT'(.99165,28) = 2.5456
‘This i somewhat larger than the 2.048 value we would use fore lngla comparison, and
ves the following adjusted LSO value:
;
4 = 49.79 bacteria colonies
Wie then compute the difference between each mean and the control (alcohol in this case) and it
a diference in means is above the adjusted LSD, then that can be considered statistical
significant
ISD.
Difference in means (compared to alcohol)
i Nein’ | Side | arance
= a | eo —
* fis | ae 55
5 téce | moa 085
staal sa | aac 795
Allof these are above the 47.79 LSD, so all are signficanty diferent than alcohol. it seems
that alcohol spray is ina class by itself, and
Some statistical sofware packages assume
‘groups and note which groups are equivalent
no other washing method is equivalent.
you want to compare al groups against al ther
thin statistical signcance)
ros
‘Aesbol pay ws [A
‘Anica oop 25 8
Sop 1060 8
Wate 3170 8
This output shows tha the alcohol is in one gr
different from one another.
roup and all three others are not statistically