Professional Documents
Culture Documents
1
Ric Coe
ICRAF, Nairobi, Kenya
Contents
Introduction.......................................................................................................................................................1
Preliminaries.....................................................................................................................................................3
Descriptive tatistics........................................................................................................................................!
". #$o variables............................................................................................................................................%
Descriptive statistics & common problems......................................................................................................1"
Con'irmatory analysis( estimation and )ypot)esis testin*.............................................................................1!
#)e problem................................................................................................................................................1!
+stimates, standard errors and con'idence intervals...................................................................................1,
-ypot)esis tests( #)e lo*ic.........................................................................................................................1.
+/amples o' calculations............................................................................................................................10
1imitations.................................................................................................................................................."2
3)at s)ould you do...................................................................................................................................."1
Con'irmatory Analysis & Re*ression.............................................................................................................."1
tartin* Re*ression....................................................................................................................................."1
Fittin* t)e re*ression line...........................................................................................................................""
C)ec4 t)e 'it................................................................................................................................................"!
Interpretation..............................................................................................................................................."!
Addin* more variables & 5ultiple re*ression............................................................................................."!
Interpretation...................................................................................................................................................",
Re'erences......................................................................................................................................................."%
Introduction
#)is *uide summarises t)e use o' simple statistical analyses in t)e interpretation o'
survey data. It is aimed at t)e typical small surveys 6up to a 'e$ )undred respondents7
carried out by researc)ers loo4in* at t)e role and upta4e o' ne$ a*ricultural tec)nolo*ies.
#)ere are several common problems in t)e approac)es to survey analysis used by many
researc)ers, probably a result o' t)e researc) met)ods courses 'ollo$ed durin* trainin*.
8ne is to concentrate attention on a 'e$ $ell 4no$n statistical tec)ni9ues, suc) as c)i&
s9uared tests in "&$ay tables and re*ression analysis, and to place a naively simplistic
reliance on t)e results. #)is is t)e topic o' t)is *uide. A second problem is to treat
1
5odi'ied 'rom input to a course :Formal data analysis 'or bean researc)ers; or*anised by CIA# at C5R#,
+*erton <niversity, February 1==%. #)an4s to oniia David 'or permission to 9uote t)e e/ample.
1
statistical analysis as a recipe t)at can be 'ollo$ed to a success'ul conclusion $it)out
muc) t)ou*)t or understandin* alon* t)e $ay. #)is is t)e topic o' a companion *uide
:teps in survey analysis; 6Coe "22"7. A t)ird problem is to i*nore t)e conte/t in $)ic)
t)e survey $as carried out, so i*norin* many o' t)e possibilities and limitations o' t)e
statistical analysis. #)is is t)e topic o' t)e *uide :Approac)es to analysis o' survey data;
6C, "2217.
+/ample
#)e e/ample used in t)is *uide $as a survey o' 'armers in t$o districts o' <*anda. It
aimed to c)aracteri>e t)e pattern o' bean *ro$in* and understand role o' ne$ bean
varieties in t)e )ouse)old economy o' ne$ 'armers. A 'e$ o' t)e stated ob?ectives $ere(
8verall( Provide a baseline a*ainst $)ic) to measure adoption and impact o' improved
bean varieties.
-ypot)eses(
1. Adoption.
a. #)ere is no relations)ip bet$een adoption o' ne$ varieties and $ealt).
b. #)e rate o' adoption 'or 5C5,221 $ill be )i*)er in 5bale t)an 5u4ono, due to
stron* non&appreciation o' small seeded varieties in 5u4ono.
". Impact.
a. Adoption o' ne$ varieties $ill result in an increase in absolute 9uantities and
proportion o' beans sold, )ence increasin* )ouse)old income 'rom beans.
b. Adoption o' ne$ varieties $ill not result in increased sales o' 'res) beans.
c. Adoption o' ne$ varieties $ill not c)an*e t)e amount o' income 'rom beans controlled
by $omen.
d. ...
#)e e/amples are based on a subset o' ?ust ,2 )ouse)olds 'rom t)e $)ole survey o' 1.=.
#)e variables used in t)e e/ample )ave been labeled so s)ould be sel'&e/planatory.
In t)is *uide P )as been used 'or t)e statistical analysis. @eneral points appear in
normal te/t. Computer output and ot)er items relatin* speci'ically to t)e e/ample are
bo/ed.
"
Preliminaries
Ae'ore startin* analysis(
1. 5a4e sure you are 'amiliar $it) t)e data source and collection met)ods.
For e/ample(
3as a random samplin* sc)eme usedB
3ere individual 9uestionnaires completed durin* a *roup meetin*B
3)o $as t)e data collected byB 3)y and $)enB
1. Clari'y ob?ectives
#)ese s)ould )ave been listed in detail $)en t)e survey $as planned. I' t)ey $ere not, or
)ave c)an*ed, t)ey must be listed no$. It is impossible to analy>e a survey i' you do
not 4no$ $)at you are tryin* to 'ind out.
3. Codin* and Data entry.
!. 5a4e sure you understand t)e data. Cou must understand t)e e/act meanin* o' every
number and code.
Data t)at needs clari'yin*.
Dariable 3ID+ 6Euestion 37( Does :1; mean 1 $i'e or " $ivesB 6con'lict bet$een
9uestionnaire and code boo47.
Dariable ARRAN@+ 6Euestion !7. Does :NA; mean t)ere are no bean plots or no
)usbandF$i'eB
Dariables 8CC<P-DI and 8CC<P-D" 6Euestion 07( 3)y are t$o occupations *iven
$)en t)e 9uestion as4s 'or t)e main occupationB
Dariable KA3=!A 6Euestion "17. 3)at is t)e di''erence bet$een :na; and :No;B
Dariable A5K3=!A Euestion "17. 3)at are t)e unitsB
3
Descriptive Statistics
1. ummari>in* in*le Dariables
Eualitative 6GCodedH7 variables.
<se'ul summaries are ?ust 're9uencies and percenta*es.
!
MATOKE Grows matoke
Valid Cum
Value Label Value Frequency Percent Percent Percent
Yes 1 42 84.0 84.0 84.0
No 2 8 16.0 16.0 100.0
------- ------- -------
Total 50 100.0 100.0
Valid cases 50 Missing cases 0
HHTYPE Household type
Valid Cum
Value Label Value Frequency Percent Percent Percent
Male headed one wife 1 27 54.0 54.0 54.0
Male headed more tha 2 4 8.0 8.0 62.0
Female headed absent 3 3 6.0 6.0 68.0
Female headed, no hu 4 13 26.0 26.0 94.0
Single man 5 2 4.0 4.0 98.0
Other 7 1 2.0 2.0 100.0
------- ------- -------
Total 50 100.0 100.0
Valid cases 50 Missing cases 0
Note di''erent emp)asis o' 're9uencies and percenta*es. Fre9uencies emp)asi>e
t)e sample, percenta*es emp)asi>e t)e population. @ive total sample si>e $it)
percenta*es.
#a4e care $it) percenta*es( ma4e sure you are usin* an appropriate baseline
6$)at is 122I7 and remember t)at percenta*es mi*)t not )ave to add to 122, as in
t)e e/ample belo$.
+dit t)e computer output 'or presentationJ
Crop I *ro$in*
Cassava 122
Aeans =0
5ato4e 0!
5ai>e .0
Cams "2
ample si>e ,2
1oo4 care'ully at and identi'y rare cases. uc) data points may be errors, or may
need special treat
=
#)is is t)e standard deviation o' possible estimates t)at could be produced by
di''erent simple random samples o' t)e same si>e.
#)e standard error is best interpreted via a confidence interval. A =,I con'idence
interval 'or p is p O " / se6p7
N 2.33 O " / 2.2.
N 62.1=, 2.!.7
#)is is interpreted as G3e are =,I con'ident t)at t)e true percenta*e o' 'emale
)eaded )ouse)olds is bet$een 1=I and !.IH. -ence t)e uncertainty in results due
to samplin* error is 9uanti'ied.
Means
#)e mean amount o' beans planted in =!a is ,.0 4*. #)e standard deviation o' t)is is
se mean
s
n
6 7 =
"
, $)ere s
"
is t)e variance in amount o' beans and n t)e sample si>e.
se mean 6 7
.
. = =
0 %
32
1 %
"
#)e =,I con'idence interval is
mean O " / se6mean7
N O " / 1.%
N 6".%, =.27
#)e mean amount o' beans planted is bet$een ".% and =.2 4*.
$i%%erences
I' interested in di''erences bet$een sub*roups $e can similarly estimate t)e
di''erence and 'ind a standard error o' t)e estimate.
1%
Di''erence in mean amount o' beans planted by
males and 'emales N %., & ".=
N 3.% 4*.
se difference
s
n
s
n
6 7 = +
1
"
1
"
"
"
N
= ,
"!
1 3
%
" "
. .
+
N ".2
=,I con'idence interval 'or di''erence is
3.% O " / ".2
6&2.!, ..%7
#)e mean di''erence bet$een amounts planted by males and 'emales could be
anyt)in* bet$een &2.! 4* and ..% 4*.
Hypothesis tests: The logic
#)e lo*ic o' all t)e tests commonly used depends on t)e 'act t)at random samples 'rom a
population be)ave in a predictable $ay. #)e mean amount o' beans planted by 'emale
)ouse)olds o' ".= 4*, is not t)e actual mean o' all )ouse)olds in t)e districts $)ere t)e study
too4 place. I' a di''erent sample )ad been randomly selected t)e mean $ould )ave been
di''erent. #)e 9uestion is :-o$ di''erentB;. I' all )ouse)olds are very similar 6lo$ variation
bet$een )ouse)olds7 t)en it really does not matter $)ic) sample is selected. 8n t)e ot)er
)and, )i*) variation in t)e population $ill lead to very di''erent sample means, and )ence
less certainty in t)e results obtained. #)e mat)ematics o' statistics allo$s 9uanti'ication o'
t)ese ideas, and )ence ans$ers to t)e 9uestion o' )o$ certain $e are o' t)e results.
#)e lo*ic o' t)e )ypot)esis tests is as 'ollo$s(
1. Assume some 'act is true & t)e null )ypot)esis 6e.*. #)ere is no di''erence in mean
amount o' beans planted by male and 'emale )eaded income )ouse)olds7.
". Deduce )o$ t)e sample $ould be)ave i' 617 is true 6e.*. -o$ bi* could t)e sample
di''erences bet$een male and 'emale )eaded )ouse)olds beB7
3. Compare t)e actual sample $it) t)e predictions in 6"7.
1.
!. I' 6"7 and 637 do not a*ree t)en 617 must be untrue & t)e null )ypot)esis is re?ected.
I' 6"7 and 637 do a*ree t)en t)ere is no reason, in t)is data, not to believe 617.
#)e level o' a*reement is measured by t)e Psi*ni'icance levelP, e/plained in t)e e/amples
belo$.
Examples of calculations
Chi&squared test %or no association in a ' ( ' table)
#a4in* #able A as an e/ample, $e $ant to test $)et)er t)e proportion o'
)ouse)olds )irin* labour is t)e same in male and 'emale )eaded )ouse)olds. #)e steps are(
1. Formulate t)e null )ypot)esis( t)e proportion is e9ual 'or bot) male and
'emale )ouse)olds.
I' 617 is true, t)en t)is proportion is estimated by 3%F!=. -ence $e $ould e/pect numbers in
eac) cate*ory to be (
10
5ale Female
Never )ire
33
3%
!=
"! " x = . 1%
3%
!=
11 0 x = .
-ire
3
13
!=
0 0 3 x = . 1%
13
!=
! " = .
3. #)e di''erence bet$een observed and e/pected 're9uencies is summarised as
!. I' 617 is valid t)en t)e value o'
"
s)ould be an observation 'rom a
1
"
&
distribution. Comparison $it) tables s)o$s t)at 2..! is not an e/treme observation. A
number at least as bi* as t)is $ould occur 3=I o' t)e time. #)e si*ni'icance level is p N
2.3=. -ence t)ere is no stron* reason not to believe t)e null )ypot)esis.
t&test to compare two means
In e/ample A t)e steps needed are(
1. Formulate t)e null )ypot)esis( t)e di''erence in mean amount o' beans
planted 'or male and 'emale )ouse)olds is >ero.
",3 I' 617 is true, t)en t)e di''erence in means o' 3.%4*, scaled by its standard
error
6N ".27 ,
t = =
3 %
" 2
1 0
.
.
. ,
is an observation 'rom a t
"0
distribution.
1=
2
2 2 2 2
=
( - 3 )
+
( - )
+
( - )
+
( - )
=
"! " "
"! "
11 0 13
11 0
0 0 12
0 0
! " 3
! "
2 .!
.
.
.
.
.
.
.
.
.
!. Comparison $it) tables s)o$s t)at 1.0 is not an e/treme observation. A
di''erence as bi* as t)is $ould occur 0I o' t)e time 617 is true. #)e si*ni'icance level is p N
2.20. -ence t)ere is not muc) reason not to believe t)e null )ypot)esis.
Limitations
*ssumptions)
#)e calculations in bot) !.1 and !." are based on a series o' assumptions. #)e 4ey
ones are(
Independence. In bot) e/amples A and A $e assume observations are independent.
1ac4 o' independence is caused by(
6i7 non&simple random samples. In t)is case $e )ave used a strati'ied sample.
6ii7 inter'erence bet$een observations. #)is $ould be t)e case i' individuals
$it)in t)ese )ouse)old responded, or i' data $ere collected at a *roup meetin*.
1ac4 o' bias due to non&response, intervie$er e''ects, attempts to PpleaseP t)e
researc)er etc.
+9uality o' variance and normal distribution 6t&test7. #)ese assumptions can be
c)ec4ed. In e/ample A t)e data is clearly not normally distributed
+imits to interpretation)
617 I' t)e result is :si*ni'icant; $e can re?ect t)e null )ypot)esis, and conclude
t)at t)ere is a real di''erence in t)e population. I' t)e result is :not si*ni'icant; $e )ave not
proved t)ere is no di''erence. It is never possible to prove t)e null )ypot)esis is true 6i'
almost never $ill beJ7. All $e can say is t)is study )as not produced evidence to ma4e us
disbelieve t)e null )ypot)esis.
6"7 At $)at level o' si*ni'icance s)ould t)e null )ypot)esis be re?ectedB ,I is
commonly used but t)ere is absolutely no reason $)y it s)ould be treated as a ri*id cut o''.
%I and !I si*ni'icance levels are, 'or all real purposes, e9uivalent.
637 3)et)er t)e null&)ypot)esis is re?ected depends as muc) on t)e sample si>e
and precision o' t)e study, as on t)e Ptrut)P o' t)e null )ypot)esis. A small, imprecise survey
$ill not detect a di''erence t)at could be pic4ed up by a lar*er study. 5ay be $e ?ust did
not collect enou*) dataJ
"2
6!7 #)e $)ole lo*ic o' si*ni'icance testin* and t)e p&value rests on $)at $ould
)appen in repeated surveys o' t)e same desi*n, usin* ne$ randomisations. Is t)is sense,
$)en $e 4no$ t)e survey $ould not and can not ever be repeatedB
6,7 In most analysis e/ercises, di''erences $)ic) Ploo4 interestin*P at t)e
e/ploratory sta*e are investi*ated 'urt)er in t)e con'irmatory analysis. I' t)e tests to
per'orm )ave been selected because di''erences loo4 lar*e, all si*ni'icance levels are
invalid.
6%7 I' a lar*e number o' tests are per'ormed, as is o'ten t)e case in analysis o' a
study $it) many variables, t)en $e $ould e/pect ,I o' t)e tests to *ive Qsi*ni'icantQ results
at t)e p N 2., level even i' all null )ypot)eses $ere true. -ence it can be di''icult to
interpret t)e results o' multiple tests.
hat should you do
617 #reat t)e si*ni'icance level p as an indication o' Pstren*t) o' evidenceP
a*ainst t)e null )ypot)esis, not as a CesFNo decision ma4er.
6"7 Concentrate on estimatin* t)e si>e o' di''erences, rat)er t)an ?ust testin*
$)et)er t)ey e/ist. Con'idence intervals 'or di''erences $ill be muc) more use'ul t)an
)ypot)esis tests.
At t)e end o' every si*ni'icance test apply t)e 8 3-A#B test. As4 yoursel' Po
$)atBP. -as t)e si*ni'icance test really improved your understandin* o' t)e situation
and )elped you ta4e a rational decision 'or 'uture actionB I' not 'or*et it, and *et on
$it) somet)in* more use'ul.
Confirmatory Analysis - Regression
!tarting "egression
& Ae$areJ
+ven :simple; re*ression is not simpleJ
& tart by considerin* types o' relations)ip t)at mi*)t e/ist. #)e most use'ul re*ression
analysis $ill be one t)at starts 'rom understandin* o' t)e t)eory be)ind t)e process bein*
studied.
"1
#)e e/ample used )ere is rat)er arti'icial. It e/amines t)e proposition t)at t)e amount o'
beans )arvested in =!a depends only on land area.
& Plot t)e data to see i' t)ere is any evidence o' t)e relations)ip.
(&ND&:4&
%
9
)
1
)
9
4
&
$20
20
60
100
140
1*0
220
$1 1 3 5 7 9 11
#itting the regression line
& o't$are is $idely available to do t)is
& <nderstand t)e outputJ
""
"3
* * * * M U L T I P L E R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data
Equation Number 1 Dependent Variable.. HVTOT94A
total beans harvested 94a
Block Number 1. Method: Enter LANDAREA
Variable(s) Entered on Step Number
1.. LANDAREA
Multiple R .54425
R Square .29621
Adjusted R Square .28057
Standard Error 29.01659
Analysis of Variance
DF Sum of Squares Mean Square
Regression 1 15946.10384 15946.10384
Residual 45 37888.31105 841.96247
F = 18.93921 Signif F = .0001
------------------ Variables in the Equation
------------------
Variable B SE B Beta T
Sig T
LANDAREA 8.200238 1.884280 .544249 4.352 .
0001
(Constant) -2.863844 6.051297 -.473 .
6383
End Block Number 1 All requested variables entered.
$hec% the fit
& 1oo4 'or any unusual points or outliers. #)ey could represent mista4es or cases t)at
re9uire special treatment. #)ey certainly re9uire e/planation.
& 1oo4 'or in'luential points, $)ic) lar*ely determine results. #)ey are not a bad t)in*,
but you must be a$are i' your conclusions depend critically on one or t$o observations.
& 1oo4 at t)e residuals to determine(
1. 3)et)er t)ey satis'y t)e main assumptions t)at validate t)e analysis 6constant
variance, independence, rou*)ly normally distributed7
". 3)et)er t)ey s)o$ patterns accordin* to t)e value o' ot)er variables, indicatin* t)at
t)ose ot)er variables s)ould be allo$ed 'or in t)e analysis.
&nterpretation
:i*ni'icance; does not tell you $)et)er t)e 'itted model is lo*ically sound or i' it 'its
t)e data $ell.
:i*ni'icance; does not tell you $)et)er t)e model is use'ul in e/plainin* or
describin* a relations)ip, or i' t)e relations)ip )as muc) predictive po$er.
A re*ression model derived 'rom survey data can not tell you $)at $ould )appen
$)en a :/&variable; is c)an*ed. For e/ample $e can not use it to predict t)e bean
)arvest o' a 'armer $)ose land )oldin* c)an*es.
+/istence o' a re*ression relations)ip bet$een t$o variables does not mean t)ere is a
causal relations)ip.
Re*ression relations)ips become use'ul $)en similar relations)ips are 'ound in a number
o' di''erent conditions. 1oo4 'or :si*ni'icant sameness; bet$een re*ions, crops, 'arm
types, etc.
'dding more variables ( )ultiple regression
5ultiple re*ression is a po$er'ul tool 'or understandin* t)e relations)ip o' one
variable to several ot)ers. A<#.....
All t)e limitations to interpretation above apply, and are compounded by t)e
e/istence o' several :/&variables;.
It is )ard to dra$ *rap)s t)at s)o$ t)e relations)ips and t)e $ay data depart 'rom
t)em, so t)e analyst must rely more on numerical indicators o' lac4 o' 'it, outliers,
"!
and in'luential points. 5ultiple re*ression analysis $ill not be success'ul i' t)ese are
not understood.
:tep$ise; and similar variable selection tec)ni9ues, so loved by social scientists,
)ave little t)eoretical basis and can produce ans$ers $)ic) are very poor. Re*ression
modelin* $ill be most success'ul i' understandin* o' t)e underlyin* processes is
used to c)oose possible models, rat)er t)an relyin* on computer al*orit)ms.
#)e sample si>e re9uired 'or multiple re*ression analysis depends on t)e
:con'i*uration; o' t)e data 6in particular t)e ran*e o' t)e /&variables and correlations
amon* t)em7. #)e re9uired sample si>e 9uic4ly becomes lar*e as t)e number o' /&
variables increases. I' re*ression analysis is t)e part o' t)e principle ob?ectives o' t)e
survey, it mi*)t be possible to select t)e sample in a $ay t)at ma4es t)e analysis
more e''icient.
:a; resid#als vs. %%)<'42
%%)<'42
:
a
;
r
e
s
i
d
#
a
l
s
$*0
$40
0
40
*0
120
160
1 2
Interpretation
Interpret results. #)is does not mean :understand $)ic) e''ects are si*ni'icant; but
:understand and communicate $)at you no$ 4no$ about t)e problem;. Cou s)ould be
able to(
5eet t)e ob?ectives o' t)e study.
Clearly state $)at is t)e substantive ne$ 4no$led*e $)ic) as been *enerated.
)o$ )o$ t)is ne$ in'ormation and understandin* builds on $)at $as t)ere
be'ore. Does it(
o add more e/amples o' somet)in* previously 4no$nB
o mean t)at *eneral rules or principles can be stated $it) more con'idenceB
",
o allo$ predictions to be made 'or ne$ and important situationsB
o mean t)at current understandin* or t)eory )as to be substantially
modi'iedB
<se t)e 9uantitative in'ormation you )ave *enerated to ma4e 9uantitative
predictions about t)e lar*er picture.
#)e ultimate *oal o' t)e researc) is a development ob?ective. +/plain )o$ your
results )elp you to$ards t)at ob?ective, and $)at t)e ne/t steps $ill be.
Cour survey and its analysis cost t)ousands o' dollars. +/plain $)y t)is $as a
*ood investment.
Ans$er t)e :o $)atB 9uestion. 3)at can $e no$ do $)ic) $e could not do
be'ore you did your surveyB
References
Coe R 6"22"7 teps in urvey Analysis. Nairobi( ICRAF. 1,pp
C 6"2217 Approac)es to analysis o' survey data. Readin*( tatistical ervices Centre.
"0 pp
"%