Statistics Using Libreoffice

Introduction to Statistics Using
LibreOffice.org Calc
Apache OpenOffice.org Calc
and Gnumeric
Edition 5.3
Dana Lee Ling
1 de 56

and Gnumeric
Statistics using open source software
Dana Lee Ling

College of Micronesia-FSM
Pohnpei, Federated States of Micronesia
QA276
[hp://creavecommons.org/licenses/by/4.0/] 24 May 2015

Introducon to Stascs Using LibreOce.org Calc, Apache OpenOce.org Calc, and Gnumeric by Dana Lee Ling [hp://www.comfsm.fm/~dleeling/leeling.html] is licensed
under a Creave Commons Aribuon 4.0 Internaonal License [hp://creavecommons.org/licenses/by/4.0/] .
2 de 56

and Gnumeric
Table of Contents
i. Title
ii. Soware notes
Chapters
1. Populaons and samples
2. Measures of middle and spread
3. Visualizing data
4. Paired data and scaer diagrams
5. Probability
6. Probability distribuons
7. Introducon to the normal distribuon
8. Normal distribuon and z-values
9. Condence intervals for the mean
10. Hypothesis tesng against a known populaon mean
11. Hypothesis tesng two sample means
Software notes
This stascs text ulizes LibreOce.org/Apache OpenOce.org Calc and Gnome Gnumeric to make stascal calculaons and box plots. Both Calc and Gnumeric
are open source, cross-plaorm soware and can be downloaded from their respecve web sites.
The text does not use any add-ins, add-ons, stascal extensions, or separate dedicated proprietary stascal packages. This choice is very deliberate. Course
alumni and readers of this text are most likely to encounter default installaons of spreadsheet soware without such addional soware. Course alumni should
not feel that they cannot "do" stascs because they lack a special add-in or dedicated package funcon that may require administrave privileges to install. Given
an "out-of-the-box" installaon of a spreadsheet, course alumni, or for that maer any reader of this text, should be able to generate and use the stascs
introduced by this text. With a few excepons, Microso Excel can also generate the results in this text book.
Technical note: Apache OpenOffice.org 4.0.1 uses semi-colons instead of commas in formulas. Where commas appear in spreadsheet formulas in this text, Apache
OpenOffice.org 4.0.1 users will have to substitute semi-colons.
This text ulizes HyperText Markup Language, Scalable Vector Graphics, and Mathemacs Markup Language (HTML+SVG+MathML). A browser that can render
MathML as well as SVG in HTML, such as Mozilla FireFox, is required to properly display and print this text.
Apache OpenOce.org [hp://www.openoce.org/]
LibreOce.org [hp://www.libreoce.org]
The Document Foundaon [hp://www.documenoundaon.org/]
Gnome Gnumeric [hp://projects.gnome.org/gnumeric/]
Preface
We all walk in an almost invisible sea of data. I walked into a school fair and noced a jump rope contest. The number of jumps for
each jumper unl they fouled out was being recorded on the wall. Numbers. With a mode, median, mean, and standard deviaon.
Then I noced that faster jumpers aained higher jump counts than slower jumpers. I saw that I could begin to predict jump
counts based on the starng rhythm of the jumper. I used my stopwatch to record the me and total jump count. I later nd that a
linear correlaon does exist, and I am able to show by a t-test that the faster jumpers have stascally signicantly higher jump
counts. I later incorporated this data in the fall 2007 nal. [hp://www.comfsm.fm/~dleeling/stascs/s73/jump-----rope.html]
I walked into a store back in 2003 and noced that Yamasa soy sauce appeared to cost more than Kikkoman soy sauce. I recorded prices and volumes, working out
the cost per milliliter. I eventually showed that the mean price per milliliter for Yamasa is higher than Kikkoman. I also ran a survey of students and determined that
students prefer Kikkoman to Yamasa. Soy Sauce data. [hp://www.comfsm.fm/~dleeling/stascs/s33/soy10.html]
My son likes arculated mining dump trucks. I nd pictures of Terex dump trucks on the Internet. I write to Terex in Scotland and ask them about how the prices
vary for the dump trucks, explaining that I teach stascs. "Funny you should ask," a Terex sales representave replied in wring. "The dump trucks are basically
priced by a linear relaonship between horsepower and price." The representave included a complete list of horsepower and price. [hp://www.comfsm.fm/~dleeling
/stascs/s23micronesia.html]
One term I learned that a new Cascading Style Sheets level 3 color specicaon [hp://www.w3.org/TR/css3-color/#hsl-color] for hue, luminosity, and luminance was
available for HyperText Markup Language web pages. The hue was based on a color wheel with cyan at the 180 middle of the wheel. I knew that Newton had put
green in the middle of the red-orange-yellow-green-blue-indigo-violet rainbow, but green is at 120 on a hue color wheel. And there is no cyan in Newton's rainbow.
Could the middle of the rainbow actually be at 180 cyan, or was Newton correct to say the middle of the rainbow is at 120 green? I used a hue analysis tool to
3 de 56
analyze the image of an actual rainbow taken by a digital camera here on Pohnpei. This allowed an analysis of the true center of the rainbow. Far Away Rainbow.
[hp://www.comfsm.fm/~dleeling/stascs/s53/farawayrainbow.html]
While researching sakau consumpon in markets here on Pohnpei I found dierences in means between markets, and I found a variaon with distance from
Kolonia. I asked some of the markets to share their cup tally sheets with me, and a number of them obliged. The data proved interesng. [hp://www.comfsm.fm
/~dleeling/stascs/s63/farawaysakau.html]
The point is that data is all around us all the me. You might not go into stascs professionally, yet you will always live in a world lled with data. For one sixteen
week term period in your life I want you to walk with an awareness of the data around you.
Data ows all around you. A sea of data pours past your senses daily. A world of data and numbers. Watch for numbers to happen around you. See the matrix.
Curriculum note
The text and the curriculum are an evolving work. Some curriculum opons are not specically laid out in this text. One opon is to reserve me at the end of the
course to engage in open data exploraon. Time can be gained to do this by de-emphasizing chapter ve probability, essenally oming chapter six, and skipping
from the end of secon 7.2 directly to chapter 8. This material has been retained as these choices should be up to the individual instructor.
01 Introduction: Samples and Levels of Measurement
1.1 Populations and Samples
Stascs studies groups of people, objects, or data measurements and produces summarizing mathemacal informaon on the groups. The groups are usually not
all of the possible people, objects, or data measurements. The groups are called samples. The larger collecon of people, objects or data measurements is called the
population.
Stascs aempts to predict measurements for a populaon from measurements made on the smaller sample. For example, to determine the average weight of a
student at the college, a study might select a random sample of y students to weigh. Then the measured average weight could be used to esmate the average
weight for all student at the college. The y students would be the sample, all students at the college would be the populaon.
Population: The complete group of elements, objects, observaons, or people.
Parameters: Measurements of the populaon.
Sample: A part of the populaon. A sample is usually more than ve measurements, observaons, objects, or people, and smaller than the complete populaon.
Statistics: Measurements of a sample.
Examples
We could use the rao of females to males in a class to esmate the rao of females to males on campus. The sample is the class. The intended populaon is all
students on campus. Whether the stascs class is a "good" sample - representave, unbiased, randomly selected, would be a concern.
We could use the average body fat index for a randomly selected group of females between the ages of 18 and 22 on campus to determine the average body fat
index for females in the FSM between the ages of 18 and 22. The sample is those females on campus that we've measured. The intended populaon is all females
between the ages of 18 and 22 in the FSM. Again, there would be concerns about how the sample was selected.
Measurements are made of individual elements in a sample or populaon. The elements could be objects, animals, or people.
Sample size n
The sample size is the number of elements or measurements in a sample. The lower case leer n is used for sample size. If the populaon size is being reported,
then an upper case N is used. The spreadsheet funcon for calculang the sample size is the COUNT funcon.
=COUNT(data)
Types of measurement
Qualitave data refers to descripve measurements, typically non-numerical.

Quantave data refers to numerical measurements. Quantave data can be discrete or connuous.
Discrete: A countable or limited number of possible numeric values.
Connuous: An innite number of possible numeric values.
Levels of measurement
Type Subtype
Qualitave
Q
u
a
n
t
i
t
a
t
i
v
e
Discrete
Level of
Definition
measurement
Nominal
In name only
Examples
Sorng by categories such as red, orange, yellow, green, blue, indigo, violet
Ordinal
In rank order, there exists an order Grading systems: A, B, C, D, F

but dierences and raos have no Sakau market rang system where the number of cups unl one is pwopihda...
meaning
(highest),
,
,
,... (lowest)
Interval
The numbering of the years: 2001, 2000, 1999. The year 2000 is 1000 years aer 1000 A.D. (the dierence has meaning), but
Dierences have meaning, but not
it is NOT twice as many years (the rao has no meaning). Someone born in 1998 is eight years younger than someone born in
raos. There is either no zero or
1990: 1998 1990. A vase made in 2000 B.C., however, is not twice as old as a vase made in 1000 B.C. The complicaon is
the zero has no mathemacal
subtle and basically can stem from two sources: either there is no zero or the zero is not a true zero. The Fahrenheit and
meaning.
Celsius temperature systems both suer from the later defect.
Rao
Dierence and raos have

meaning. There is a
mathemacally meaningful zero
Connuous
Physical quanes: distance, height, speed, velocity, me in seconds, altude, acceleraon, mass,... 100 kg is twice as heavy
as 50 kg. Ten dollars is 1/10 of $100.
The levels of measurement can also be thought of as being nested. For example, rao level data consists of numbers. Numbers can be put in order, hence rao level
data is also orderable data and is thus also ordinal level data. To some extent, each level includes the ones below that level. The highest level of measurement that a
4 de 56
data could be considered to be is said to be the level of measurement. There are instances where qualitave data might be placed in an order and thus be
considered ordinal data, thus ordinal level data may be either qualitave or quantave. When a survey says, "Strongly agree, agree, disagree, strongly disagree"
the data technically consists of answers which are words. Yet these words have an order, in some instances the answers are mapped to numbers and a median value
is then calculated. Above the ordinal level the data is quantave, numeric data.
Nominal
Qualitave
Words
Names
Categories
Sample size n
Mode
Ordinal
Orderable
Rankable
Qual/Quan
Interval
Quantave
No true zero
Median
Rao
Quantave
Numbers
Zero exists
Fraconal values
Mean
Standard
deviaon
Descriptive statistics: Numerical or graphical representaons of samples or populaons. Can include numerical measures such as mode, median, mean, standard
deviaon. Also includes images such as graphs, charts, visual linear regressions.
Inferential statistics: Using descripve stascs of a sample to predict the parameters or distribuon of values for a populaon.
1.2 Simple random samples
The number of measurements, elements, objects, or people in a sample is the sample size n. A simple random sample of n measurements from a populaon is one
selected in a way that:
any member of the populaon is equally likely to be selected.
any sample of a given size is equally likely to be selected.
Ensuring that a sample is random is dicult. Suppose I want to study how many Pohnpeians own cars. Would people I meet/poll on main street Kolonia be a
random sample? Why? Why not?
Studies oen use random numbers to help randomly selects objects or subjects for a stascal study. Obtaining random numbers can be more dicult than one
might at rst presume.
Computers can generate pseudo-random numbers. "Pseudo" means seemingly random but not truly random. Computer generated random numbers are very close
to random but are actually not necessarily random. Next we will learn to generate pseudo-random numbers using a computer. This secon will also serve as an
introducon to funcons in spreadsheets.
Coins and dice can be used to generate random numbers.
Using a spreadsheet to generate random numbers
This course presumes prior contact with a course such as CA 100 Computer Literacy where a basic introduction to spreadsheets is made.
The random funcon RAND generates numbers between 0 and 0.9999...
=rand()
The random number funcon consists of a funcon name, RAND, followed by parentheses. For the random funcon nothing goes between the parentheses, not
even a space.
To get other numbers the random funcon can be mulplied by coecient. To get whole numbers the integer funcon INT can be used to discard the decimal
poron.
=INT(argument)
The integer funcon takes an "argument." The argument is a computer term for an input to the funcon. Inputs could include a number, a funcon, a cell address or
a range of cell addresses. The following funcon when typed into a spreadsheet that mimic the ipping of a coin. A 1 will be a head, a 0 will be a tail.
=INT(RAND()*2)
The spreadsheet can be made to display the word "head" or "tail" using the following code:
=CHOOSE(INT(RAND()*2),"head","tail")
A single die can also be simulated using the following funcon
=INT(6*RAND()+1)
To randomly select among a set of student names, the following model can be built upon.
5 de 56
=CHOOSE(INT(RAND()*5+1),"Jan","Jen","Jin","Jon","Jun")
To generate another random choice, press the F9 key on the keyboard. F9 forces a spreadsheet to recalculate all formulas.
Methods of sampling
When praccal, feasible, and worth both the cost and eort, measurements are done on the whole populaon. In many instances the populaon cannot be
measured. Sampling refers to the ways in which random subgroups of a populaon can be selected. Some of the ways are listed below.
Census: Measurements done on the whole populaon.
Sample: Measurements of a representave random sample of the populaon.
Simulation
Today this oen refers to construcng a model of a system using mathemacal equaons and then using computers to run the model, gathering stascs as the
model runs.
Stratified sampling
To ensure a balanced sample: Suppose I want to do a study of the average body fat of young people in the FSM using students in the stascs course. The FSM
populaon is roughly half Chuukese, but in the stascs course only 12% of the students list Chuuk as their home state. Pohnpei is 35% of the naonal populaon,
but the stascs course is more than half Pohnpeian at 65%. If I choose as my sample students in the stascs course, then I am likely to wind up with Pohnpeians
being over represented relave to the actual naonal proporon of Pohnpeians.
State
Chuuk
2010 Population Fractional share of national population (relative frequency) Statistics students by state of origin spring 2011 Fractional share of statistics seats
0.47
10
0.12
Kosrae 6616
48651
0.06
0.09
Pohnpei 35981
0.35
53
0.65
Yap
11376
0.11
12
0.15
102624
1.00
82
1.00
The soluon is to use straed sampling. I ensure that my sample subgroups reect the naonal proporons. Given that the sample size is small, I could choose to
survey all ten Chuukese students, seven Pohnpeian students, two Yapese students, and one Kosraean student. There would sll be stascal issues of the small
subsample sizes from each state, but the raos would be closer to that seen in the naonal populaon. Each state would be considered a single strata.
Systematic sampling
Used where a populaon is in some sequenal order. A start point must be randomly chosen. Useful in a measuring a med event. Never used if there is a cyclic or
repeve nature to a system: If the sample rate is roughly equal to the cycle rate, then the results are not going to be randomly distributed measurements. For
example, suppose one is studying whether the sidewalks on campus are crowded. If one measures during the me between class periods when students are moving
to their next class - then one would conclude the sidewalks are crowded. If one measured only when classes were in session, then one would conclude that there is
no sidewalk crowding problem. This type of problem in measurement occurs whenever a system behaves in a regular, cyclical manner. The soluon would be ensure
that the me interval between measurements is random.
Cluster sampling
The populaon is divided into naturally occurring subunits and then subunits are randomly selected for measurement. In this method it is important that subunits
(subgroups) are fairly interchangeable. Suppose we want to poll the people in Ki's opinion on whether they would pay for water if water was guaranteed to be
clean and available 24 hours a day. We could cluster by breaking up the populaon by kosapw and then randomly choose a few kosapws and poll everyone in these
kosapws. The results could probably be generalized to all Ki.
Convenience sampling
Results or data that are easily obtained is used. Highly unreliable as a method of geng a random samples. Examples would include a survey of one's friends and
family as a sample populaon. Or the surveys that some newspapers and news programs produce where a reporter surveys people shopping in a store.
1.3 Experimental Design
In science, stascs are gathered by running an experiment and then repeang the experiment. The sample is the experiments that are conducted. The populaon
is the theorecally abstract concept of all possible runs of the experiment for all me.
The method behind experimentaon is called the scientific method. In the scienc method, one forms a hypothesis, makes a predicon, formulates an
experiment, and runs the experiment.
Some experiments involve new treatments, these require the use of a control group and an experimental group, with the groups being chosen randomly and the
experiment run double blind. Double blind means that neither the experimenter nor the subjects know which treatment is the experimental treatment and which is
the control treatment. A third party keeps track of which is which usually using number codes. Then the results are tested for a stascally signicant dierence
between the two groups.
Placebo eect: just believing you will improve can cause improvement in a medical condion.
Replicaon is also important in the world of science. If an experiment cannot be repeated and produce the same results, then the theory under test is rejected.
Some of the steps in an experiment are listed below:
1. Idenfy the populaon of interest
2. Specify the variables that will be measured. Consider protocols and procedures.
3. Decide on whether the populaon can be measured or if the measurements will have to be on a sample of the populaon. If the later, determine a method
that ensures a random sample that is of sucient size and representave of the populaon.
4. Collect the data (perform the experiment).
6 de 56
5. Analyze the data.

6. Write up the results and publish! Note direcons for future research, note also any problems or complicaons that arose in the study.
Observational study
Observaonal studies gather stascs by observing a system in operaon, or by observing people, animals, or plants. Data is recorded by the observer. Someone
sing and counng the number of birds that land or take-o from a bird nesng islet on the reef is performing an observaonal study.
Surveys
Surveys are usually done by giving a quesonnaire to a random sample. Voluntary responses tend to be negave. As a result, there may be a bias towards negave
ndings. Hidden bias/unfair quesons: Are you the only crazy person in your family?
Generalizing
The process of extending from sample results to populaon. If a sample is a good random sample, representave of the populaon, then some sample stascs can
be used to esmate populaon parameters. Sample means and proporons can oen be used as point esmates of a populaon parameter.
Although the mode and median, covered in chapter three, do not always well predict the populaon mode and median, there there situaons in which a mode may
be used. If a good, random, and representave sample of students nds that the color blue is the favorite color for the sample, then blue is a best rst esmate of
the favorite color of the populaon of students or any future student sample.
Favorite colors
Favorite color Frequency f Relative Frequency or p(color)

Blue
32
35%
Black
18
20%
White
10
11%
Green
10%
Red
7%
Pink
5%
Brown
4%
Gray
3%
Maroon
2%
Orange
1%
Yellow
1%
Sums:
91
100%
If the above sample of 91 students is a good random sample of the populaon of all students, then we could make a point esmate that roughly 35% of the students
in the populaon will prefer blue.
02 Measures of Middle and Spread

2.1 Measures of central tendency:
mode, median, mean, midrange
Mode
The mode is the value that occurs most frequently in the data. Spreadsheet programs can determine the mode with the funcon MODE.
=MODE(data)
In the Fall of 2000 the stascs class gathered data on the number of siblings for each member of the class. One student was an only child and had no siblings. One
student had 13 brothers and sisters. The complete data set is as follows:
1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 7, 8, 9, 10, 12, 12, 13
The mode is 2 because 2 occurs more oen than any other value. Where there is a e there is no mode.
For the ages of students in that class
7 de 56
18, 19, 19, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 25, 25, 26
...there is no mode: there is a tie between 21 and 22, hence there no single most frequent value. Spreadsheets will, however, usually report a mode of 21 in this
case. Spreadsheets oen select the rst mode in a mul-modal e.
If all values appear only once, then there is no mode. Spreadsheets will display #N/A or #VALUE to indicate an error has occurred - there is no mode. No mode is
NOT the same as a mode of zero. A mode of zero means that zero is the most frequent data value. Do not put the number 0 (zero) for "no mode." An example of a
mode of zero might be the number of children for students in stascs class.
Median
The median is the central (or middle) value in a data set. If a number sits at the middle, then it is the median. If the middle is between two numbers, then the
median is half way between the two middle numbers.
For the sibling data...
1, 2, 2, 2, 2, 2, 3, 3, 4, 4, |4|, 5, 5, 5, 7, 8, 9, 10, 12, 12, 13
...the median is 4.
Note the data must be in order (sorted) before you can nd the median. For the data 2, 4, 6, 8 the median is 5: (4+6)/2.
The median funcon in spreadsheets is MEDIAN.
=MEDIAN(data)
Mean (average)
The mean, also called the arithmec mean and also called the average, is calculated mathemacally by adding the values and then dividing by the number of values
(the sample size n).
If the mean is the mean of a populaon, then it is called the populaon mean . The leer is a Greek lower case "m" and is pronounced "mu."
If the mean is the mean of a sample, then it is the sample mean x. The symbol x is pronounced "x bar."
sample mean
=
sum of the sample data

sample size n
The sum of the data x can be determined using the funcon =SUM(data). The sample size n can be determined using =COUNT(data). Thus
=SUM(data)/COUNT(data) will calculate the mean. There is also a single funcon that calculates the mean. The funcon that directly calculates the mean is
AVERAGE
=AVERAGE(data)
Resistant measures: One that is not inuenced by extremely high or extremely low data values. The median tends to be more resistant than mean.
Population mean and sample mean
If the mean is measured using the whole populaon then this would be the populaon mean. If the mean was calculated from a sample then the mean is the
sample mean. Mathemacally there is no dierence in the way the populaon and sample mean are calculated.
Midrange
The midrange is the midway point between the minimum and the maximum in a set of data.
To calculate the minimum and maximum values, spreadsheets use the minimum value funcon MIN and maximum value funcon MAX.
=MIN(data)
=MAX(data)
The MIN and MAX funcon can take a list of comma separated numbers or a range of cells in a spreadsheet. If the data is in cells A2 to A42, then the minimum and
maximum can be found from:
=MIN(A2:A42)
=MAX(A2:A42)
The midrange can then be calculated from:
midrange = (maximum + minimum)/2
In a spreadsheet use the following formula:
=(MAX(data)+MIN(data))/2
2.2 Differences in the Distribution of Data
8 de 56
Range
The range is the maximum data value minus the minimum data value.
=MAX(data)MIN(data)
The range is a useful basic stasc that provides informaon on the distance between the most extreme values in the data set.
The range does not show if the data if evenly spread out across the range or crowded together in just one part of the range. The way in which the data is either
spread out or crowded together in a range is referred to as the distribuon of the data. One of the ways to understand the distribuon of the data is to calculate the
posion of the quarles and making a chart based on the results.
Percentiles, Quartiles, Box and Whisker charts
The median is the value that is the middle value in a sorted list of values. At the median 50% of the data values are below and 50% are above. This is also called the
50th percentile for being 50% of the way "through" the data.
If one starts at the minimim, 25% of the way "through" the data, the point at which 25% of the values are smaller, is the 25th percentile. The value that is 25% of
the way "through" the data is also called the first quartile.
Moving on "through" the data to the median, the median is also called the second quartile.
Moving past the median, 75% of the way "through" the data is the 75th percentile also known as the third quartile.
Note that the 0th percenle is the minimum and the 100th percenle is the maximum.
Spreadsheets can calculate the rst, second, and third quarle for data using a funcon, the quarle funcon.
=QUARTILE(data,type)
Data is a range with data. Type represents the type of quarle. (0 = minimum, 1 = 25% or rst quarle, 2 = 50% (median), 3 = 75% or third quarle and 4 =
maximum. Thus if data is in the cells A1:A20, the rst quarle could be calculated using:
=QUARTILE(A1:A20,1)
There are some complex sublees to calculang the quarle. For a full and thorough treatment of the subject refer to Eric Langford's Quarles in Elementary
Stascs, Journal of Stascs Educaon Volume 14, Number 3 (2006) [hp://www.amstat.org/publicaons/jse/v14n3/langford.html] .
For the purposes of this course, the value produced by the spreadsheet funcon QUARTILE will be used for the rst and third quarles. LibreOce.org, Gnumeric,
Google Docs, and Excel up through version 2007 concur in the values produced for Langford's "canonical" set. Excel 2010 separates the QUARTILE funcon into two
funcons, QUARTILE.INC for the inclusive quarle and QUARTILE.EXC for the exclusive quarle. Consideraon of the dierent results produced by these funcons
goes beyond the scope and intent of this basic text. For further informaon and exploraon, refer to Langford 2006 [hp://www.amstat.org/publicaons/jse/v14n3
/langford.html] and to Patrick Wessa's on line quarle calculator [hp://www.wessa.net/quart.wasp] .
Note that the funcon processing calculator Qalculate! produces a dierent result than the spreadsheets for the rst quarle due to the use of an alternave
algorithm.
The minimum, rst quarle, median, third quarle, and maximum provide a compact and informave ve number summary of the distribuon of a data set.
InterQuartile Range
The InterQuarle Range (IQR) is the range between the rst and third quarle:
=QUARTILE(Data,3)-QUARTILE(Data,1)
There are some subtlees to calculang the IQR for sets with even versus odd sample sizes, but this text leaves those details to the spreadsheet soware funcons.
9 de 56
Quartiles, Box and Whisker plots
The above is very abstract and hard to visualize. A box and whisker plot takes the above quarle informaon and plots a chart based on the quarles. The table
below has four dierent data sets. The rst consists of a single value, the second of values spread uniformly across the range, the third has values concentrated near
the middle of the range, and the last has most of the values at the minimum or maximum.
univalue uniform peaked symmetric bimodal
5
1
1
1
5
Box plots display how the data is spread across the range based on the quarle informaon above.
A box and whisker plot is built around a box that runs from the value at the 25th percenle (rst quarle) to the value at the 75th percenle (third quarle). The
length of the box spans the distance from the value at the rst quarle to the third quarle, this is called the Inter-Quarle Range (IQR). A line is drawn inside the
box at the locaon of the 50th percenle. The 50th percenle is also known as the second quarle and is the median for the data. Half the scores are above the
median, half are below the median. Note that the 50th percenle is the median, not the mean.
The basic box plot described above has lines that extend from the rst quarle down to the minimum value and from the third quarle to the maximum
value. These lines are called "whiskers" and end with a cross-line called a "fence". If, however, the minimum is more than 1.5 IQR below the rst
quarle, then the lower fence is put at 1.5 IQR below the rst quarle and the values below the fence are marked with a round circle. These values are
referred to as potenal outliers - the data is unusually far from the median in relaon to the other data in the set.
s1 s2
10 11
20 11
30 12
Likewise, if the maximum is more than 1.5 IQR beyond the third quarle, then the upper fence is located at 1.5 IQR above the 3rd quarle. The
maximum is then ploed as a potenal outlier along with any other data values beyond 1.5 IQR above the 3rd quarle.
40 13
There are actually two types of outliers. Potenal outliers between 1.5 IQR and 3.0 IQR beyond the fence . Extreme outliers are beyond 3.0 IQR. In
the program Gnome Gnumeric potenal outliers are marked with a circle colored in with the color of the box. Extreme ouers are marked with an open
circle - a circle with no color inside.
60 18
An example with hypothecal data sets is given to illustrate box plots. The data consists of two samples. Sample one (s1) is a uniform distribuon and
sample two (s2) is a highly skewed distribuon.
90 44
50 15
70 23
80 31
100 65
110 99
120 154
While box and whisker plots can be generated by the Gnome Gnumeric program, the loss of support for Microso Windows and Apple OSX in 2015 has made
10 de 56
Gnumeric less broadly useful. This author no longer recommends the use of Gnumeric to generate box plots. That said, the rst and third quarles for Gnumeric,
OpenOce, LibreOce, and Excel (using the QUARTILE funcon) agree with each other.
To generate box plots the online tool BoxPlotR [hp://boxplot.tyerslab.com] generates box plots including outliers. The rst row should be the data label, the variable to
be ploed. Data can be copied and pasted into the second tab using the Paste data opon. If copying and pasng mulple columns from a spread sheet, preset the
separator to Tab. For advanced users notches for the 95% condence interval for the median can be displayed. The plot can also display the mean and the 95%
condence interval for the mean. The tool is also able to generate violin and bean plots, and change whisker denions from Tukey to Spear or Altman for advanced
users. As of 2016 this text recommends use of the BoxPlotR generator. If the tool grays out, reload the page and recopy the data.
Other line box plot generators such as Shodor Interacvate [hp://www.shodor.org/interacvate/acvies/BoxPlot/] can also be used, although the rst and third quarles
may dier from that reported by spreadsheets. The Shodor generator is far more limited in capabilies than the BoxPlotR generator. The dierent algorithms for
calculang the rst and third quarles will result in those quarles diering between the spreadsheet and the on line box plot generator. For an idea of the possible
dierences in the quarle values, see Patrick Wessa's on line quarle calculator [hp://www.wessa.net/quart.wasp] .
The box and whisker plot is a useful tool for exploring data and determining whether the data is symmetrically distributed, skewed, and whether the data has
potenal outliers - values far from the rest of the data as measured by the InterQuarle Range. The distribuon of the data oen impacts what types of analysis can
be done on the data.
The distribuon is also important to determining whether a measurement that was done is performing as intended. For example, in educaon a "good" test is
usually one that generates a symmetric disbuon of scores with few outliers. A highly skewed distribuon of scores would suggest that the test was either too easy
or too dicult. Outliers would suggest unusual performances on the test.
Two data sets, one uniform, the other with one potential outlier and one extreme outlier.
Shodor Interacvate [hp://www.shodor.org/interacvate/acvies/BoxPlot/] produces box plots that can show outliers. The image displays a data set with a potenal
outlier (blue asterisk) and an extreme outlier (black asterisk) displayed. To display the outliers, click on "Uncover Outliers" The Horizontal Scale can also be adjusted.
Note that the displayed image uses the Tukey method for an odd number of data points. The rst and third quarles will not necessarily agree with those found
using a spreadsheet, and this will also aect whether a data value is calculated to be an outlier or not. Use of the Tukey method is recommended in the course
11 de 56
taught by the author of the text.
Data can be copied and pasted directly from a spreadsheet into the data entry box in Shodor [hp://www.shodor.org/interacvate/acvies/BoxPlot/] .
Standard Deviation
Consider the following data:

Data mode median mean min max range midrange
5
5
5 5 0
0
Data set 1 5, 5, 5, 5 5
5
2 8 6
5
Data set 2 2, 4, 6, 8 none 5
Data set 3 2, 2, 8, 8 none 5
Neither the mode, median, nor the mean reveal clearly the dierences in the distribuon of the data above. The mean and the median are the same for each data
set. The mode is the same as the mean and the median for the rst data set and is unavailable for the last data set (spreadsheets will report a mode of 2 for the last
data set). A single number that would characterize how much the data is spread out would be useful.
As noted earlier, the range is one way to capture the spread of the data. The range is calculated by subtracng the smallest value from the largest value. In a
spreadsheet:
=MAX(data)MIN(data)
The range sll does not characterize the dierence between set 2 and 3: the last set has more data further away from the center of the data distribuon. The range
misses this dierence.
To capture the spread of the data we use a measure related to the average distance of the data from the mean. We call this the standard deviation. If we have a
populaon, we report this average distance as the populaon standard deviaon. If we have a sample, then our average distance value may underesmate the
actual populaon standard deviaon. As a result the formula for sample standard deviaon adjusts the result mathemacally to be slightly larger. For our purposes
these numbers are calculated using spreadsheet funcons.
Standard deviation
One way to disnguish the dierence in the distribuon of the numbers in data set 2 and data set 3 above is to use the standard deviaon.
Data mean stdev
0.00
Data set 1 5, 5, 5, 5 5
Data set 2 2, 4, 6, 8 5
2.58
Data set 3 2, 2, 8, 8 5
3.46
The funcon that calculates the sample standard deviaon is:

=STDEV(data)
In this text the symbol for the sample standard deviaon is usually sx.
In this text the symbol for the populaon standard deviaon is usually .
The symbol sx usually refers the standard deviaon of single variable x data. If there is y data, the standard deviaon of the y data is sy. Other symbols that are used
for standard deviaon include s and x. Some calculators use the unusual and confusing notaons xn1 and xn for sample and populaon standard deviaons.
In this class we always use the sample standard deviaon in our calculaons. The sample standard deviaon is calculated in a way such that the sample standard
deviaon is slightly larger than the result of the formula for the populaon standard deviaon. This adjustment is needed because a populaon tends to have a
slightly larger spread than a sample. There is a greater probability of outliers in the populaon data.
Coefficient of variation CV
The Coecient of Variaon is calculated by dividing the standard deviaon (usually the sample standard deviaon) by the mean.
12 de 56
=STDEV(data)/AVERAGE(data)
Note that the CV can be expressed as a percentage: Group 2 has a CV of 52% while group 3 has a CV of 69%. A deviaon of 3.46 is large for a mean of 5 (3.46/5 =
69%) but would be small if the mean were 50 (3.46/50 = 7%). So the CV can tell us how important the standard deviaon is relave to the mean.
Rules of thumb regarding spread
As an approximaon, the standard deviaon for data that has a symmetrical, heap-like distribuon is roughly one-quarter of the range. If given only minimum and
maximum values for data, this rule of thumb can be used to esmate the standard deviaon.
At least 75% of the data will be within two standard deviaons of the mean, regardless of the shape of the distribuon of the data.
At least 89% of the data will be within three standard deviaons of the mean, regardless of the shape of the distribuon of the data.
If the shape of the distribuon of the data is a symmetrical heap, then as much as 95% of the data will be within two standard deviaons of the mean.
Data beyond two standard deviaons away from the mean is considered "unusual" data.
Basic statistics and their interaction with the levels of measurement
Levels of measurement and appropriate measures

Level of measurement Appropriate measure of middle Appropriate measure of spread
nominal
mode
none or number of categories
ordinal
median
range
interval
median or mean
range or standard deviaon
rao
mean
standard deviaon
At the interval level of measurement either the median or mean may be more appropriate depending on the specic system being studied. If the median is more
appropriate, then the range should be quoted as a measure of the spread of the data. If the mean is more appropriate, then the standard deviaon should be used
as a measure of the spread of the data.
Another way to understand the levels at which a parcular type of measurement can be made is shown in the following table.
Levels at which a particular statistic or parameter has meaning
Level of measurement
Nominal Ordinal
sample size
Interval
Ratio
mode
minimum
Statistic/
Parameter
maximum
range
median
mean
standard deviaon
coecient of variaon
For example, a mode, median, and mean can be calculated for rao level measures. Of those, the mean is usually considered the best measure of the middle for a
random sample of rao level data.
2.3 Variables
Discrete Variables
When there are a countable number of values that result from observaons, we say the variable producing the results is discrete. The nominal and ordinal levels of
measurement almost always measure a discrete variable.
The following examples are typical values for discrete variables:
true or false (2 values)
yes or no (2 values)
strongly agree | agree | neutral | disagree | strongly disagree (5 values)
The last example above is a typical result of a type of survey called a Likert survey [hp://www.socialresearchmethods.net/kb/scallik.htm] developed by Renis Likert
[hp://www.nwlink.com/~donclark/hrd/history/likert.html] in 1932.
When reporng the "middle value" for a discrete distribuon at the ordinal level it is usually more appropriate to report the median. For further reading on the
maer of using mean values with discrete distribuons refer to the pages by Nora Mogey [hp://www.icbl.hw.ac.uk/ltdi/cookbook/info_likert_scale/] and by the Canadian
Psychiatric Associaon [hp://www.cpa-apc.org/Publicaons/Archives/CJP/2000/Nov/Research2.asp] .
Note that if the variable measures only the nominal level of measurement, then only the mode is likely to have any stascal "meaning", the nominal level of
measurement has no "middle" per se.
There may be rare instances in which looking at the mean value and standard deviaon is useful for looking at comparave performance [../department
/studentevals13.xls] , but it is not a recommended pracce to use the mean and standard deviaon on a discrete distribuon. The Canadian Psychiatric Associaon
[hp://www.cpa-apc.org/Publicaons/Archives/CJP/2000/Nov/Research2.asp] discusses when one may be able to "break" the rules and calculate a mean on a discrete
distribuon. Even then, bear in mind that raos between means have no "meaning!"
For example, quesonnaire's oen generate discrete results:
13 de 56
About once About A few times

Never
Every day
a month once a week a week
0
4
1
2
3
How oen do you drink caeinated drinks such as coee, tea, or cola?
How oen do you chew tobacco without betelnut?
How oen do you chew betelnut without tobacco?
How oen do you chew betelnut with tobacco?
How oen do you drink sakau en Pohnpei?
How oen do you drink beer?
How oen do you drink wine?
How oen do you drink hard liquor
(whisky, rum, vodka, tequila, etc.)?
How oen do you smoke cigarees?
How oen do you smoke marijuana?
How oen do you use controlled substances other than marijuana?
(methamphetamines, cocaine, crack, ice, shabu, etc.)?
The results of such a quesonnaire are numeric values from 0 to 4.
Continuous Variables
When there is a innite (or uncountable) number of values that may result from observaons, we say that the variable is connuous. Physical measurements such
as height, weight, speed, and mass, are considered connuous measurements. Bear in mind that our measurement device might be accurate to only a certain
number of decimal places. The variable is connuous because beer measuring devices should produce more accurate results.
The following examples are connuous variables:
distance
me
mass
length
height
depth
weight
speed
body fat
When reporng the "middle value" for a connuous distribuon it is appropriate to report the mean and standard deviation. The mean and standard deviaon only
have "meaning" for the rao level of measurement.
Interactions between levels of measure, variable type, and measures of middle and spread
Level of measurement Typical variable type

Appropriate measure of middle
nominal
discrete
mode
Appropriate measure of variation

none
ordinal
discrete
median (can also report mode)
range
rao
connuous
mean (can also report median and mode) sample standard deviaon
2.4 Z: A Measure of Relative Standing
Z-scores are a useful way to combine scores from data that has dierent means and standard deviaons. Z-scores are an applicaon of the above measures of center
and spread.
Remember that the mean is the result of adding all of the values in the data set and then dividing by the number of values in the data set. The word mean and
average are used interchangeably in stascs.
Recall also that the standard deviation can be thought of as a mathemacal calculaon of the average distance of the data from the mean of the data. Note that
although I use the words average and mean, the sentence could also be wrien "the mean distance of the data from the mean of the data."
Z-Scores
Z-scores simply indicate how many standard deviaons away from the mean is a parcular score. This is termed "relave standing" as it is a measure of where in the
data the score is relave to the mean and "standardized" by the standard deviaon. The formula for z is:
If the populaon mean and populaon standard deviaon are known, then the formula for the z-score for a data value x is:
( )
Using the sample mean x and sample standard deviaon sx, the formula for a data value x is:
( )
Note the parentheses! When typing in a spreadsheet do not forget the parentheses.
14 de 56
=(valueAVERAGE(data))/STDEV(data)
Data that is two standard deviaons below the mean will have a z-score of 2, data that is two standard deviaons above the mean will have a z-score of +2. Data
beyond two standard deviaons away from the mean will have z-scores below 2 or above 2. A data value that has a z-score below 2 or above +2 is considered an
unusual value, an extraordinary data value. These values may also be outliers on a box plot depending on the distribuon. Box plot outliers and extraordinary
z-scores are two ways to characterize unusually extreme data values. There is no simple relaonship between box plot outliers and extraordinary z-scores.
Why z-scores?
Suppose a test has a mean score of 10 and a standard deviaon of 2 with a total possible of 20. Suppose a second test has the same mean of 10 and total possible of
20 but a standard deviaon of 8.
On the rst test a score of 18 would be rare, an unusual score. On the rst test 89% of the students would have scored between 6 and 16 (three standard deviaons
below the mean and three standard deviaons above the mean.
On the second test a score of 18 would only be one standard deviaon above the mean. This would not be unusual, the second test had more spread.
Adding two scores of 18 and saying the student had a score of 36 out of 40 devalues what is a phenomenal performance on the rst test.
Converng to z-scores, the relave strength of the performance on test one is valued more strongly. The z-score on test one would be (18-10)/2 = 4, while on test
two the z-score would be (18-10)/8 = 1. The unusually outstanding performance on test one is now reected in the sum of the z-scores where the rst test
contributes a sum of 4 and the second test contributes a sum of 1.
When values are converted to z-scores, the mean of the z-scores is zero. A student who scored a 10 on either of the tests above would have a z-score of 0. In the
world of z-scores, a zero is average!
Z-scores also adjust for dierent means due to diering total possible points on dierent tests.
Consider again the rst test that had a mean score of 10 and a standard deviaon of 2 with a total possible of 20. Now consider a third test with a mean of 100 and
standard deviaon of 40 with a total possible of 200. On this third test a score of 140 would be high, but not unusually high.
Adding the scores and saying the student had a score of 158 out of 220 again devalues what is a phenomenal performance on test one. The score on test one is
dwarfed by the total possible on test three. Put another way, the 18 points of test one are contribung only 11% of the 158 score. The other 89% is the test three
score. We are giving an eight-fold greater weight to test three. The z-scores of 4 and 1 would add to ve. This gives equal weight to each test and the resulng sum
of the z-scores reects the strong performance on test one with an equal weight to the ordinary performance on test three.
Z-scores only provide the relave standing. If a test is given again and all students who take the test do beer the second me, then the mean rises and like a de
"lis all the boats equally." Thus an individual student might do beer, but because the mean rose, their z-score could remain the same. This is also the downside to
using z-scores to compare performances between tests - changes in "sea level" are obscured. One would have to know the mean and standard deviaon and
whether they changed to properly interpret a z-score.
Supplementary discussion on quartile calculations
The issue of dierence in quarle calculaons alluded to above may have a dierenally stronger impact in the stascs classroom where small sets of data are
presented as a part of quizzes and tests. For large sample sizes of connuous rao level data that is smoothly, symmetrically distributed and has no outliers, the
quarle funcons will produce very similar results, or results that dier by an amount that is simply not signicant to the analysis. For small sample sizes as might be
presented in a tesng situaon, where students are being marked on an exactly correct answer, the dierences can be signicant. For example, for the data set
[120,127,132,133,135,143,147] the IQR can vary from 9.5 to 16. The variety of possible results can be seen in the following image showing results from Gnumeric,
Qalculate (rst quarle only), Alcula [hp://www.alcula.com/calculators/stascs/box-plot/] , and Wessa [hp://wessa.net/quart.wasp] .
15 de 56
03 Visualizing data
3.1 Graphs and Charts
The table below includes FSM census 2000 data and student seat numbers for the naonal site of COM-FSM circa 2004.
0.5
Number of student seats held by state at the national

campus
679
Kosrae 7686
0.07
316
0.09
Pohnpei 34486
0.32
2122
0.62
Yap
11241
0.11
287
0.08
107008
3404
State
Chuuk
Population
(2000)
53595
Fractional share of national population (relative

frequency)
Fractional share of the national campus student

seats
0.2
Circle or pie charts
In a circle chart the whole circle is 100% Used when data adds to a whole, e.g. state populaons add to yield naonal populaon.
A pie chart of the state populaons:
The following table includes data from the 2010 FSM census as an update to the above data.
State Population (2010) Relative frequency
Chuuk 48651
Kosrae 6616
Pohnpei 35981
Yap
11376
Sum:
102624
Column charts
Column charts are also called bar graphs. A column chart of the student seats held by each state at the naonal site:
Pareto chart
If a column chart is sorted so that the columns are in descending order, then it is called a Pareto chart [../admissions/html/entrance61.html#pareto_comet] . Descending
order means the largest value is on the le and the values decrease as one moves to the right. Pareto charts are useful ways to convey rank order as well as
numerical data.
Line graph
A line graph is a chart which plots data as a line. The horizontal axis is usually set up with equal intervals. Line graphs are not used in this course and should not be
16 de 56
confused with xy scaergraphs.

XY Scatter graph
When you have two sets of connuous data (value versus value, no categories), use an xy graph. These will be covered in more detail in the chapter on linear
regressions.
3.2 Histograms and Frequency Distributions
A distribuon counts the number of elements of data in either a category or within a range of values. Plong the count of the elements in each category or range as
a column chart generates a chart called a histogram. The histogram shows the distribuon of the data. The height of each column shows the frequency of an event.
This distribuon oen provides insight into the data that the data itself does not reveal. In the histogram below, the distribuon for male body fat among stascs
students has two peaks. The two peaks suggest that there are two subgroups among the men in the stascs course, one subgroup that is at a healthy level of body
fat and a second subgroup at a higher level of body fat.
The ranges into which values are gathered are called bins, classes, or intervals. This text tends to use classes or bins to describe the ranges into which the data
values are grouped.
Nominal level of measurement
At the nominal level of measurement one can determine the frequency of elements in a category, such as students by state in a stascs course.
State Frequency Rel Freq
Chuuk 6
0.11
Kosrae 6
0.11
Pohnpei 31
0.57
Yap
11
0.20
Sums: 54
1,00
Ordinal level of measurement
Data values into classes comprised of each unique data value
At the ordinal level, a frequency distribuon can be done using the rank order, counng the number of elements in each rank order to obtain a frequency. When the
frequency data is calculated in this way, the distribuon is not grouped into a smaller number of classes. Note that some classes could be empty - the classes must
sll be equal width.
Age Frequency Rel Freq
17 1
0.02
18
0.1
19
14
0.27
20
12
0.24
21
0.18
22
0.02
23
0.06
24
0.06
25
0.02
26
0.02
27
0.02
sums 51
17 de 56
Data gathered into a number of classes fewer than the number of unique data values
The ranks can be collected together, classed, to reduce the number of rank order categories. in the example below the age data in gathered into two-year cohorts.
Age Frequency Rel Freq
19
20
0.39
21
21
0.41
23
0.08
25
0.08
27
0.04
Sums: 51
3.3 Frequency tables and histograms at the ratio level of measurement
At the rao level data is always gathered into ranges. At the rao level, classed histograms are used. Rao level data is not necessarily in a nite number of ranks as
was ordinal data.
The ranges into which data is gathered are dened by a class lower limit and a class upper limit. The width is the class upper limit minus the class lower limit. The
frequency funcon in spreadsheets uses class upper limits. In this text histograms are also generated using the class upper limits.
To calculate the class lower and upper limits the minimum and maximum value in a data set must be determined. Spreadsheets include funcons to calculate the
minimum value MIN and maximum value MAX in a data set.
=MIN(data)
=MAX(data)
In LibreOce the MIN and MAX funcon can take a list of comma separated numbers or a range of cells in a spreadsheet. In stascs a range of cells is the most
common input for these funcons. When a range of cells is the usual input, this text uses the word "data" to refer to the fact that the range of cells is usually your
data! Ranges of cells use two cell addresses separated by a full colon. An example is shown below where the data is arranged in a vercal column from A2 to A42.
Sort the original data from smallest to largest before you begin!
=MIN(A2:A42)
How to make a frequency table at the rao level
1. Find the minimum value of the data set using the MIN funcon
2. Find the maximum value of the data set using the MAX funcon
3. Calculate the range by subtracng the MIN from the MAX:
range = maximum value - minimum value
4. Decide on the number of classes you are going to use (also called bins or intervals)
5. Divide the range by the number of classes to calculate the class width (or bin width or interval width)
6. Calculate the class upper limits
7. Put the class upper limits into a column of cells
8. Manually tally the data into the frequency column to determine the frequencies for each class. The class upper limit is included in each tally. As a check, the
sum of the frequencies must be equal to the sample size.
9. Create a column chart
Class Upper Limits (CUL) Frequency
=min + class width
+ class width
+ class width
18 de 56
Class Upper Limits (CUL) Frequency

+ class width
+ class width = max
For the female height data:
58, 58, 59.5, 59.5, 60, 60, 60, 60, 60, 61, 61, 61.2, 61.5, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 63, 63, 63, 63.5, 64, 64, 64, 64, 65, 65, 66, 66
Five classes would produce the following results:
Min = 58
Max = 66
Range = 66 - 58 = 8
Width = 8/5 = 1.6
Calculation Height (CUL) Frequency
58 + 1.6
59.6
4
59.6 + 1.6 61.2
61.2 + 1.6 62.8
13
62.8 + 1.6 64.4
64.4 + 1.6 66
Sum:
37
Note that 61.2 is INCLUDED in the class that ends at 61.2. The class includes values at the class upper limit. In other words, a class includes all values up to and
including the class upper limit.
Note too that the frequencies add to the sample size.
Aer making the column chart, double click on the columns to open the data series dialog box. Find the Opons tab and set the spacing (or gap width) to zero.
Note that the spacing or gap width on the columns has been set to zero.
Relative Frequency
Relave frequency is one way to determine a probability.

Divide each frequency by the sum (the sample size) to get the relave frequency
Height CUL Frequency Relative Frequency f/n or P(x)
59.6
4
0.11
61.2
0.22
62.8
13
0.35
64.4
0.22
66
0.11
Sum:
37
1.00
The relave frequency always adds to one (rounding causes the above to add to 1.01, if all the decimal places were used the relave frequencies would add to one.
The area under the relave frequency columns is equal to one.
19 de 56
Another example using integers:

0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4.5, 5, 5, 5, 6, 6, 7, 8, 9, 10
Five classes
min = 0
max = 10
range = 10
width = 10/5 = 2
Class Num Calculation CUL Frequency Relative Frequency f/n or P(x)
1
min + width 2 4
0.20
2
+ width
0.30
+ width
0.30
+ width
0.10
+ width
10 2
0.10
Sum:
20
1.00
The above method produces equal width classes and to conforms the inclusion of the class upper limit by spreadsheet packages.
Checking frequency tables
The nal class upper limit must be equal to the maximum value in the data set. The frequencies must sum to the sample size n. The relave frequencies must add to
1.00.
CUL
min + width
Frequency Relative Frequency f/n
+ width
+ width
+ width
+ width = MAX
Sum:
sample size n
1.00
Frequency function
For more advanced spreadsheet users, frequency data can be obtained using the frequency funcon FREQUENCY. This funcon is also very useful when working
with large data sets. The frequency funcon is:
=FREQUENCY(DATA,CLASSES)
DATA refers to the range of cells containing the data, CLASSES refers to the range of cells containing the class upper limits.
The data set seen below are the height measurements for 49 female students in stascs courses during two consecuve terms.
The frequency funcon built into spreadsheets works very dierently from all other funcons. The frequency funcon called an "array" funcon because the
funcon places values into an array of cells. For the funcon to do this, you must rst select the cells into which the funcon will place the frequency values.
With the cells sll highlighted, start typing the frequency funcon. Aer typing the opening parenthesis, drag and select the data to be classed. If the data is more
than can be selected by dragging, type the data range in by hand.
20 de 56
Type a comma, drag and select the class upper limits, then type the closing parenthesis.
Then press and hold down BOTH the CONTROL (Ctrl) key and the SHIFT key. With both the control and shi keys held down, press the Enter (or Return) key.
As noted above, the frequencies should add to the sample size. When working with spreadsheets, internal rounding errors can cause the maximum value in a data
set to not get included in the nal class. In the last class, use the value obtained by the MAX funcon and not the previous class + a width formula to generate that
class upper limit.
3.4 Shapes of Distributions
The shapes of distribuons have names by which they are known.
21 de 56
One of the aspects of a sample that is oen similar to the populaon is the shape of the distribuon. If a good random sample of sucient size has a symmetric
distribuon, then the populaon is likely to have a symmetric distribuon. The process of projecng results from a sample to a populaon is called generalizing.
Thus we can say that the shape of a sample distribuon generalizes to a populaon.
Both box plots and frequency histograms show the distribuon of the data. Box plots and frequency histograms are two dierent views of the
distribuon of the data. There is a relaonship between the frequency histogram and the associated box plot. The following charts show the
frequency histograms and box plots for three distribuons: a uniform distribuon, a peaked symmetric heap distribuon, and a le skewed
distribuon.
The uniform data is evenly distributed across the range. The whiskers run from the maximum to minimum value and the InterQuarle Range is
the largest of the three distribuons.
The peaked symmetric data has the smallest InterQuarle Range, the bulk of the data is close to the middle of the distribuon. In the box plot
this can be seen in the small InterQuarle range centered on the median. The peaked symmetric data has two potenal outliers at the minimum
and maximum values. For the peaked symmetric distribuon data is usually found near the middle of the distribuon.
The skewed data has the bulk of the data near the maximum. In the box plot this can be seen by the InterQuarle Range - the box - being
"pushed" up towards the maximum value. The whiskers are also of an unequal length, another sign of a skewed distribuon.
uniform
peaked
skewed
symmetric
10
11
11
12
12
13
12
14
13
15
10 13
16
11 14
17
12 14
18
13 14
19
14 14
20
15 15
20
16 15
21
17 15
22
18 15
23
19 16
24
20 16
23
21 17
24
22 17
25
23 18
26
24 19
27
25 20
25
26 22
26
27 24
27
28 28
28
Endnote: Creating histograms with spreadsheets

Making histograms with LibreOffice.org Calc
Select both the column with the class and the column with the frequencies.
22 de 56
Click on the chart wizard buon. If not selected by default, choose a column chart. Click on Next.
At the second step, the data range step, select "First column as label" as seen in the next image.
Click on Next. At step three there is usually nothing that needs to be done if one has correctly selected their columns prior to starng the chart wizard.
Click on Next. On the next screen ll in the appropriate tles. The legend can be "unchecked" as seen in the next image.
When done, click on Finish. Double click any column in the chart to open up the data series dialog box.
23 de 56
Click on the opons tab and set the Spacing to zero percent as seen in the previous image. In the data series dialog box one can alter the background color, add
column borders, or make other customizaons. Click on OK.
Gnumeric histogram notes
Select both the class upper limits and the frequencies. Choose the chart wizard. At the rst step of the chart wizard select the Column chart opon. Click on "Use
rst series as shared abscissa". The rst series is the rst column, the class upper limits. The abscissa is another word for x-axis.
At step two of two, select PlotBarCol1 and set the Gap to zero.
In step two a tle can be added to Chart1 by clicking on Chart1 and then clicking on the Add buon. The drop down menu includes the item "Title to Chart1" in
alphabec order on the list. To add a label to Y-Axis1, click on Add and then choose "Label to Y-Axis1". When one has made all desired modicaons, click on Insert
and then drag to size the chart.
As an anecdote, dragging to choose the size of the chart is the way Microso Excel 95 operated. While this may seem retro, this is an instructor's blessing. No two
students are going to execute the exact same drag, hence no two homework assignments should have exactly the same size chart in exactly the same locaon.
04 Paired Data and Scatter Diagrams
4.1 Best Fit Lines: Linear Regressions
A runner runs from the College of Micronesia-FSM Naonal campus to PICS via the powerplant/Nahnpohnmal back road. The runner tracks his me and distance.
Location
College
Time x (minutes) Distance y (km)

0
0
Dolon Pass
20
3.3
Turn-o for Nahnpohnmal 25
4.5
Boom of the beast
33
5.7
Top of the beast
34.5
5.9
Track West
55
9.7
PICS
56
10.1
Is there a relaonship between the me and the distance? If there is a relaonship, then data will fall in a paerned fashion on an xy graph. If there is no
relaonship, then there will be no shape to the paern of the data on a graph.
If the relaonship is linear, then the data will fall roughly along a line. Plong the above data yields the following graph:
24 de 56
The data falls roughly along a line, the relaonship appears to linear. If we can nd the equaon of a line through the data, then we can use the equaon to predict
how long it will take the runner to cover distances not included in the table above, such as ve kilometers. In the next image a best fit line has been added to the
graph.
The best fit line is also called the least squares line because the mathemacal process for determining the line minimizes the square of the vercal displacement of
the data points from the line. The process of determining the best fit line is also known and performing a linear regression. Somemes the line is referred to as a
linear regression.
The graph of me versus distance for a runner is a line because a runner runs at the same pace kilometer aer kilometer.
Sample size n for paired data
For paired data the sample size n is the number of pairs. This is usually also the number of rows in the data table. Do NOT count both the x and y values, the (x,y)
data should be counted in pairs.
4.2 Slope and Intercept
Slope
A spreadsheet is used to nd the slope and the y-intercept of the best t line through the data.
To get the slope m use the funcon:
=SLOPE(y-values,x-values)
Note that the y-values are entered rst, the x-values are entered second. This is the reverse of tradional algebraic order where coordinate pairs are listed in the
order (x, y). The x and y-values are usually arranged in columns. The column containing the x data is usually to the le of the column containing the y-values. An
example where the data is in the rst two columns from row two to forty-two can be seen below.
=SLOPE(B2:B42,A2:A42)
Intercept
The intercept is usually the starng value for a funcon. Oen this is the y data value at me zero, or distance zero.
To get the intercept:
=INTERCEPT(y-values,x-values)
Note that intercept also reverses the order of the x and y values!
For the runner data above the equaon is:
distance = slope * time + y-intercept
distance = 0.18 * time + 0.13
y = 0.18 * x + 0.13
or
y = 0.18x 0.13
where x is the time and y is the distance
In algebra the equaon of a line is wrien as y = m*x + b where m is the slope and b is the intercept. In stascs the equaon of a line is wrien as y = a + b*x where
a is the intercept (the starng value) and b is the slope. The two elds have their own tradions, and the leers used for slope and intercept are a tradion that
diers between the eld of mathemacs and the eld of stascs.
25 de 56
Using the y = mx + b equaon we can make predicons about how far the runner will travel given a me, or how long a duraon of me the runner will run given a
distance. For example, according the equaon above, a 45 minute run will result in the runner covering 0.18*45 - 0.13 = 7.97 kilometers. Using the inverse of the
equaon we can predict that the runner will run a ve kilometer distance in 28.5 minutes (28 minutes and 30 seconds).
Given any me, we can calculate the distance. Given any distance, we can solve for the me.
4.3 Relationships between variables
Aer plong the x and y data, the xy scaergraph helps determine the nature of the relaonship between the x values and the y values. If the points lie along a
straight line, then the relaonship is linear. If the points form a smooth curve, then the relaonship is non-linear (not a line). If the points form no paern then the
relaonship is random.
Linear: Posive relaonship
Linear: Negave relaonship
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
10 20 30 40 50 60 70 80 90 100
Non-linear relaonship
No relaonship: random correlaon
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
Relaonships between two sets of data can be posive: the larger x gets, the larger y gets.
Relaonships between two sets of data can be negave: the larger x gets, the smaller y gets.
Relaonships between two sets of data can be non-linear
Relaonships between two sets of data can be random: no relaonship exists!
For the runner data above, the relaonship is a posive relaonship. The points line along a line, therefore the relaonship is linear.
An example of a negave relaonship would be the number of beers consumed by a student and a measure of the physical coordinaon. The more beers consumed
the less their coordinaon!
4.4 Correlation
For a linear relaonship, the closer to a straight line the points fall, the stronger the relaonship. The measurement that describes how closely to a line are the
points is called the correlation.
The following example explores the correlaon between the distance of a business from a city center versus the amount
of product sold per person. In this case the business are places that serve pounded Piper methysticum plant roots, known
elsewhere as kava but known locally as sakau. This business is unique in that customers self-limit their purchases, buying
only as many cups of sakau as necessary to get the warm, sleepy, feeling that the drink induces. The businesses are
locally referred to as sakau markets. The local theory is that the further one travels from the main town (and thus deeper
into the countryside of Pohnpei) the stronger the sakau that is served. If this is the case, then the mean number of cups
should fall with distance from the main town on the island.
The following table uses actual data collected from these businesses, the names of the businesses have been changed.
Sakau Market distance/km (x) mean cups per person (y)
Upon the river 3.0
5.18
Try me rst
13.5
3.93
At the bend
14.0
3.19
Falling down
15.5
2.62
The rst queson a stascian would ask is whether there is a relaonship between the distance and mean cup data. Determining whether there is a relaonship is
best seen in an xy scaergraph of the data.
If we plot the points on an xy graph using a spreadsheet, the y-values can be seen to fall with increasing x-value. The data points, while not all exactly on one line,
26 de 56
are not far away from the best fit line. The best fit line indicates a negave relaonship. The larger the distance, the smaller the mean number of cups consumed.
We use a number called the Pearson product-moment correlation coefficient r to tell us how well the data ts to a straight line. The full name is long, in stascs this
number is called simply r. R can be calculated using a spreadsheet funcon.
The funcon for calculang r is:
=CORREL(y-values,x-values)
Note that the order does not technically matter. The correlation of x to y is the same as that of y to x. For consistency the y-data,x-data order is retained above.
The Pearson product-moment correlaon coecient r (or just correlaon r) values that result from the formula are always between -1 and 1. One is perfect posive
linear correlaon. Negave one is perfect negave linear correlaon. If the correlaon is zero or close to zero: no linear relaonship between the variables.
A guideline to r values:
Note that perfect has to be perfect: 0.99999 is very close, but not perfect. In real world systems perfect correlaon, posive or negave, is rarely or never seen. A
correlaon of 0.0000 is also rare. Systems that are purely random are also rarely seen in the real world.
Spreadsheets usually round to two decimals when displaying decimal numbers. A correlaon r of 0.999 is displayed as "1" by spreadsheets. Use the Format menu to
select the cells item. In the cells dialog box, click on the numbers tab to increase the number of decimal places. When the correlaon is not perfect, adjust the
decimal display and write out all the decimals.
The correlaon r of 0.93 is a strong negave correlaon. The relaonship is strong and the relaonship is negave. The equaon of the best t line, y = 0.18x +
5.8 where y is the mean number of cups and x is the distance from the main town. The equaons that generated the slope, y-intercept, and correlaon can be seen
in the earlier image.
The strong relaonship means that the equaon can be used to predict mean cup values, at least for distances between 3.0 and 15.5 kilometers from town.
A second example is drawn from body fat data. The following chart plots age in years for female stascs students against their body fat index.
27 de 56
Is there a relaonship seen in the xy scaergraph between the age of a female stascs student and the body fat index? Can we use the equaon to predict body fat
index on age alone?
If we plot the points on an xy graph using a spreadsheet as seen above, the data does not appear to be linear. The data points do not form a discernable paern.
The data appears to be scaered randomly about the graph. Although a spreadsheet is able to give us a best fit line (a linear regression or least squares line), that
equaon will not be useful for predicng body fat index based on age.
In the example above the correlaon r can be calculated and is found to be 0.06. Zero would be random correlaon. This value is so close to zero that the correlaon
is eecvely random. The relaonship is random. There is no relaonship. The linear equaon cannot be used to predict the body fat index given the age.
Limitations of linear regressions
We cannot usually predict values that are below the minimum x or above the maximum x values and make meaningful predicons. In the example of the runner, we
could calculate how far the runner would run in 72 hours (three days and three nights) but it is unlikely the runner could run connuously for that length of me.
For some systems values can be predicted below the minimum x or above the maximum x value. When we do this it is called extrapolation. Very few systems can be
extrapolated, but some systems remain linear for values near to the provided x values.
Coefficient of Determination r
The coecient of determinaon, r, is a measure of how much of the variaon in the independent x variable explains the variaon in the dependent y variable. This
does NOT imply causaon. In spreadsheets the ^ symbol (shi-6) is exponenaon. In spreadsheets we can square the correlaon with the following formula:
=(CORREL(y-values,x-values))^2
The result, which is between 0 and 1 inclusive, is oen expressed as a percentage.
Imagine a Yamaha outboard motor shing boat sing out beyond the reef in an open ocean swell. The swell moves the boat gently up and down. Now suppose
there is a small boy moving around in the boat. The boat is rocked and swayed by the boy. The total moon of the boat is in part due to the swell and in part due to
the boy. Maybe the swell accounts for 70% of the boat's moon while the boy accounts for 30% of the moon. A model of the boat's moon that took into account
only the moon of the ocean would generate a coecient of determinaon of about 70%.
Causality
Finding that a correlaon exists does not mean that the x-values cause the y-values. A line does not imply causaon: Your age does not cause your pounds of body
fat, nor does me cause distance for the runner.
Studies in the mid 1800s of Micronesia would have shown of increase each year in church aendance and sexually transmied diseases (STDs). That does NOT mean
churches cause STDs! What the data is revealing is a common variable underlying our data: foreigners brought both STDs and churches. Any correlaon is simply the
result of the common impact of the increasing inuence of foreigners.
Calculator usage notes
Some calculators will generate a best t line. Be careful. In algebra straight lines had the form y = mx + b where m was the slope and b was the y-intercept. In
stascs lines are described using the equaon y = a + bx. Thus b is the slope! And a is the y-intercept! You would not need to know this but your calculator will
likely use b for the slope and a for the y-intercept. The excepon is some TI calculators that use SLP and INT for slope and intercept respecvely.
Physical science note
Note only for those in physical science courses. In some physical systems the data point (0,0) is the most accurately known measurement in a system. In this
situation the physicist may choose to force the linear regression through the origin at (0,0). This forces the line to have an intercept of zero. There is another function
in spreadsheets which can force the intercept to be zero, the LINear ESTimator function. The following functions use time versus distance, common x and y values in
physical science.
=LINEST(distance (y) values,me (x) values,0)
28 de 56
Note that the same as the slope and intercept functions, the y-values are entered first, the x-values are entered second.
05 Probability
5.1 Ways to determine a probability
A probability is the likelihood of an event or outcome. Probabilies are specied mathemacally by a number between 0 and 1 including 0 or 1.
0 is no likelihood an event will occur.
1 is absolute certainty an event will occur.
0.5 is an equal likelihood of occurrence or non-occurrence.
Any value between 0 and 1 can occur.
We use the notaon P(eventLabel) = probability to report a probability.
There are three ways to assign probabilies.
1. Intuion or subjecve esmate
2. Equally likely outcomes
3. Relave Frequencies
Intuition
Intuion/subjecve measure. An educated best guess. Using available informaon to make a best esmate of a probability. Could be anything from a wild guess to
an educated and informed esmate by experts in the eld.
Equally Likely Events or Outcomes
Equally Likely Events: Probabilies from mathemacal formulas

In the following the word "event" and the word "outcome" are taken to have the same meaning.
Probabilities versus Statistics
The study of problems with equally likely outcomes is termed the study of probabilies. This is the realm of the mathemacs of probability. Using the mathemacs
of probability, the outcomes can be determined ahead of me. Mathemacal formulas determine the probability of a parcular outomce. All measures are
populaon parameters. The mathemacs of probability determines the probabilies for coin tosses, dice, cards, loeries, bingo, and other games of chance.
This course focuses not on probability but rather on stascs. In stascs, measurement are made on a sample taken from the populaon and used to esmate the
populaon's parameters. All possible outcomes are not usually known. is usually not known and might not be knowable. Relave frequencies will be used to
esmate populaon parameters.
Calculating Probabilities
Where each and every event is equally likely, the probability of an event occurring can be determined from
probability = ways to get the desired event/total possible events
or
probability = ways to get the parcular outcome/total possible outcomes
Dice and Coins
Binary probabilities: yes or no, up or down, heads or tails
A penny
P(head on a penny) = one way to get a head/two sides = 1/2 = 0.5 or 50%
That probability, 0.5, is the probability of geng a heads or tails prior to the toss. Once the toss is done, the coin is either a head or a tail, 1 or 0, all or nothing.
There is no 0.5 probability anymore.
Over any ten tosses there is no guarantee of ve heads and ve tails: probability does not work like that. Over any small sample the raos of expected outcomes can
dier from the mathemacally calculated raos.
Over thousands of tosses, however, the rao of outcomes such as the number of heads to the number of tails, will approach the mathemacally predicted amount.
We refer to this as the law of large numbers.
In eect, a few tosses is a sample from a populaon that consists, theorecally, of an innite number of tosses. Thus we can speak about a populaon mean for an
innite number of tosses. That populaon mean is the mathemacally predicted probability.
Populaon mean = (number of ways to get a desired outcome)/(total possible outcomes)
Dice: Six-sided
A six-sided die. Six sides. Each side equally likely to appear. Six total possible outcomes. Only one way to roll a one: the side with a single pip must face up. 1 way to
get a one/6 possible outcomes = 0.1667 or 17%
P(1) = 0.17
Dice: Four, eight, twelve, and twenty sided
The formula remains the same: the number of possible ways to get a parcular roll divided by the number of possible outcomes (that is, the number of sides!).
Think about this: what would a three sided die look like? How about a two-sided die? What about a one sided die? What shape would that be? Is there such a thing?
29 de 56
Two dice
Ways to get a ve on two dice: 1 + 4 = 5, 2 + 3 = 5, 3 + 2 = 5, 4 + 1 = 5 (each die is unique). Four ways to get/36 total possibilies = 4/36 = 0.11 or 11%
Homework:
1. What is the probability of rolling a three on...
a. A four sided die?
b. A six sided die?
c. An eight sided die?
d. A twelve sided die?
e. A twenty sided die labeled 0-9 twice.
2. What is the probability of throwing two pennies and having both come up heads?
5.2 Sample space
The sample space set of all possible outcomes in an experiment or system.

Bear in mind that the following is an oversimplicaon of the complex biogenecs of achromatopsia for the sake of a stascs example. Achromatopsia is controlled
by a pair of genes, one from the mother and one from the father. A child is born an achromat when the child inherits a recessive gene from both the mother and
father.
A is the dominant gene
a is the recessive gene
A person with the combinaon AA is "double dominant" and has "normal" vision.
A person with the combinaon Aa is termed a carrier and has "normal" vision.
A person with the combinaon aa has achromatopsia.
Suppose two carriers, Aa, marry and have children. The sample space for this situaon is as follows:
mother
\ A a
father A AA Aa
a Aa aa
The above diagram of all four possible outcomes represents the sample space for this exercise. Note that for each and every child there is only one possible
outcome. The outcomes are said to be mutually exclusive and independent. Each outcome is as likely as any other individual outcome. All possible outcomes can be
calculated. the sample space is completely known. Therefore the above involves probability and not stascs.
The probability of these two parents bearing a child with achromatopsia is:
P(achromat) = one way for the child to inherit aa/four possible combinaons = 1/4 = 0.25 or 25%
This does NOT mean one in every four children will necessarily be an achromat. Suppose they have eight children. While it could turn out that exactly two children
(25%) would have achromatopsia, other likely results are a single child with achromatopsia or three children with achromatopsia. Less likely, but possible, would be
results of no achromat children or four achromat children. If we decide to work from actual results and build a frequency table, then we would be dealing with
stascs.
The probability of bearing a carrier is:
P(carrier) = two ways for the child to inherit Aa/four possible combinaons = 2/4 = 0.50
Note that while each outcome is equally likely,there are TWO ways to get a carrier, which results in a 50% probability of a child being a carrier.
At your desk: mate an achromat aa father and carrier mother Aa.
1. What is the probability a child will be born an achromat? P(achromat) = ________
2. What is the probability a child will be born with "normal" vision? P("normal") = ______
Homework: Mate a AA father and an achromat aa mother.
1. What is the probability a child will be born an achromat? P(achromat) = ________
2. What is the probability a child will be born with "normal" vision? P("normal") = ______
See: hp://www.achromat.org/ [hp://www.achromat.org/] for more informaon on achromatopsia.
Genecally linked schizophrenia is another genec example:
Mol Psychiatry. 2003 Jul;8(7):695-705, 643. [hp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12874606&dopt=Abstract] Genome-wide scan in a large complex
pedigree with predominantly male schizophrenics from the island of Kosrae: evidence for linkage to chromosome 2q. Wijsman EM, Rosenthal EA, Hall D, Blundell ML,
Sobin C, Heath SC, Williams R, Brownstein MJ, Gogos JA, Karayiorgou M. Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA,
USA. It is widely accepted that founder populations hold promise for mapping loci for complex traits. However, the outcome of these mapping efforts will most likely
depend on the individual demographic characteristics and historical circumstances surrounding the founding of a given genetic isolate. The 'ideal' features of a founder
population are currently unknown. The Micronesian islandic population of Kosrae, one of the four islands comprising the Federated States of Micronesia (FSM), was
founded by a small number of settlers and went through a secondary genetic 'bottleneck' in the mid-19th century. The potential for reduced etiological (genetic and
environmental) heterogeneity, as well as the opportunity to ascertain extended and statistically powerful pedigrees makes the Kosraen population attractive for
mapping schizophrenia susceptibility genes. Our exhaustive case ascertainment from this islandic population identied 32 patients who met DSM-IV criteria for
schizophrenia or schizoaffective disorder. Three of these were siblings in one nuclear family, and 27 were from a single large and complex schizophrenia kindred that
includes a total of 251 individuals. One of the most startling ndings in our ascertained sample was the great difference in male and female disease rates. A
genome-wide scan provided initial suggestive evidence for linkage to markers on chromosomes 1, 2, 3, 7, 13, 15, 19, and X. Follow-up multipoint analyses gave
additional support for a region on 2q37 that includes a schizophrenia locus previously identied in another small genetic isolate, with a well-established recent
genealogical history and a small number of founders, located on the eastern border of Finland. In addition to providing further support for a schizophrenia susceptibility
locus at 2q37, our results highlight the analytic challenges associated with extremely large and complex pedigrees, as well as the limitations associated with genetic
30 de 56
studies of complex traits in small islandic populations. PMID: 12874606 [PubMed - indexed for MEDLINE]
The above arcle is both fascinang and, at the same me, calls into queson privacy issues. On the small island of Kosrae "three siblings from one nuclear family"
are idenable people.
5.3 Relative Frequency
The third way to assign probabilies is from relave frequencies. Each relave frequency represents a probability of that event occurring for that sample space.
Body fat percentage data was gathered from 58 females here at the College since summer 2001. The data had the following characteriscs:
count 59
mean 28.7
sx 7.1
min 15.6
max 50.1
A ve class frequency and relave frequency table has the following results:
BFI = Body Fat Index (percentage*100)
CLL = Class (bin) Lower Limit
CUL = Class (bin) Upper Limit (Excel uses)
Note that the classes are not equal width in this example.
Athlecally t*
BFI fem CUL Frequency Relative Frequency

x
f
f/n or P(x)
20
3
0.05
Physically t
24
15
0.25
Acceptable
31
24
0.41
Borderline obese (overfat) 39
12
0.20
Medically obese
0.08
Medical Category
51
Sample size n: 59
1.00
* body fat percentage category

This means there is a...
0.05 (ve percent) probability of a female student in the sample having a body fat percentage between 12 and 20 (athlecally t)
0.25 (25%) probability of a female student in the sample has body fat percentage between 20.1 (the Tanita unit only measured to the nearest tenth) and 24
(physically t)
0.41 (41%) probability of a female student in the sample has body fat percentage between 24.1 and 31 (acceptable but not t level of fat)
0.20 (20%) probability of a female student in the sample has body fat percentage between 31.1 and 39 (on the borderline between acceptable and obese)
0.08 (8%) probability of a female student in the sample has body fat percentage between 39.1 and 51 (medically obese)
The most probable result (most likely) is a body fat measurement between 24.1 and 31 with a 41% probability of a student being in each of either of these intervals.
The same table, but for male students:
Athlecally t*
BFI male CUL Frequency Relative Frequency

x
f
f/n or P(x)
13
9
0.18
Physically t
17
11
0.22
Acceptable
20
10
0.20
Borderline obese (overfat) 25
0.18
Medically obese
12
0.24
Sample size n: 51
1.00
Medical Category
50
The male students have a higher probability of being obese than the female students!
Kosraens abroad: Another example
What is the probability that a Kosraen lives outside of Kosrae? An informal survey [../kosrae/20071225.html#diaspora] done on the 25th of December 2007 produced the
following data. The table also includes data gathered Christmas 2003.
Kosraen population estimates
Location
Ebeye
2003 Conservative 2003 Possible 2007 Growth

30
-
Guam
200
300
300
Honolulu
600
1000
1000 67%
Kona
200
200
800
300%
Maui
100
100
60
-40%
Pohnpei
200
200
300
50%
Seale
200
200
600
200%
Texas
200
200
N/A
Virgina Beach
200
200
N/A
50%
31 de 56
Location
USA Other
2003 Conservative 2003 Possible 2007 Growth

200
N/A -
Diaspora sums:
1700
2400
3090 -
Kosrae
7663
7663
8183 -
Est. Total Pop.:
9363
10063
11273 -
23.8%
27%
Percentage abroad: 18.2%
48%
The relave frequency of 27% is a point esmate for the probability that a Kosraen lives outside of Kosrae.
Law of Large Numbers
For relave frequency probability calculaons, as the sample size increases the probabilies get closer and closer to the true populaon parameter (the actual
probability for the populaon). Bigger samples are more accurate.
07 Introduction to the Normal Distribution
7.1 Distribution shape
Inferenal stascs is all about measuring a sample and then using those values to predict the values for a populaon. The measurements of the sample are called
stascs, the measurements of the populaon are called parameters. Some sample stascs are good predictors of their corresponding populaon parameter.
Other sample stascs are not able to predict their populaon parameter.
The sample size will always be smaller than the populaon. The populaon size N cannot be predicted from the sample size n. The sample mode is not usually the
same as the populaon mode. The sample median is also not necessarily a good predictor of the populaon median.
The sample mean for a good, random sample, is a good point estimate of the populaon mean . The sample standard deviaon sx predicts the populaon
standard deviaon . The shape of the distribuon of the sample is a good predictor of the shape of the distribuon of the populaon.
That the shape of the populaon distribuon can be predicted by the shape of the distribuon of a good random sample is important. Later in the course we will be
predicng the populaon mean . Instead of predicng a single value we will predict a range in which the populaon mean will likely be found.
Consider as an example the following queson, "How long does it take to drive from Kolonia to the naonal campus on Pohnpei?" A typical answer would be "Ten to
twenty minutes." Everyone knows that the me varies, so a range is quoted. The average me to drive to the naonal campus is somewhere in that range.
Determining the appropriate range in which a populaon mean will be found depends on the shape of the distribuon. A bimodal distribuon is likely to need a
larger range than a symmetrical bell shaped distribuon in order to be sure to capture the populaon mean.
As a result of the above, we need to understand the shape of distribuons generated by dierent systems. The most important shape in stascs is the shape of a
purely random distribuon, like that generated by tossing many pennies.
In class exercise: flipping seven pennies. Student flip seven pennies and record the number of heads. The data for a section is gathered and tabulated. The students
then prepare a relative frequency histogram of the number of heads and calculate the mean number of heads from x*p(x).
7.2 Seven Pennies
In the table below, seven pennies are tossed eight hundred and y eight mes. For each toss of the seven pennies, the number of pennies landing heads up are
counted.
# of heads x Frequency Rel Freq P(x)
7
9
0.0105
6
112
0.1305
147
0.1713
228
0.2657
195
0.2273
120
0.1399
45
0.0524
0.0023
858
1.00
32 de 56
The relave frequency histogram for a large number of pennies is usually a heap-like shape. For seven pennies the theorec shape of an innite number of tosses
can be calculated by considering the whole sample space for seven pennies
HHHHHHH
.......
.......
.......
HHHHHHT
HHHHHTH
HHHHTHH
... ...
HHHHHHTT HHHHTTT HHHTTTT HHTTTTTT HTTTTTT TTTTTTT

HHHHHTHT HHHTHTT HHTHTTT THTTTTTH TTTTTTH
HHHHTHHT HHTHHTT HTHHTTT THTTTTHT TTTTTHT
... ... ... ...
If one works out all the possible combinaons then one aains:
(two sides)^(7 pennies) = 128 total possibilies
1 way to get seven heads/128 total possible outcomes = 1/128= 0.0078
7 ways to get six heads and one tail/128 possibilies = 7/128 =0.0547
21 ways to get ve heads and two tails/128 = 21/128 = 0.1641
35 ways to get four heads and three tails/128 = 35/128 = 0.2734
35 ways to get three heads and four tails/128 = 35/128 = 0.2734
21 ways to get two heads and ve tails/128 = 21/128 = 0.1641
7 ways to get one head and six tails/128 possibilies = 7/128 =0.0547
1 way to get seven tails/128 total possible outcomes = 1/128= 0.0078
If the theorec relave frequencies (probabilies) are added to our table:
# of heads
Rel Freq
Frequency
Theoretic
x
P(x)
7
9
0.0105 0.0078
6
112
0.1305 0.0547
147
0.1713 0.1641
228
0.2657 0.2734
195
0.2273 0.2734
120
0.1399 0.1641
45
0.0524 0.0547
0.0023 0.0078
858
1.00
1.00
If the theorec relave frequencies are added as a line to our graph, the following graph results:
The gray line represents the shape of the distribuon for an innite number of coin tosses. The shape of the distribuon is symmetrical.
If both the number of pennies is increased as well as the number of tosses, then the graph would become smoother and increasingly symmetrical. Below is a graph
for tens of thousands of tosses of 21 pennies.
33 de 56
The shape of the smooth curve is called the "normal distribuon" in stascs.
08 Sampling Distribution of the Mean
8.1 Distribution of Statistics
As noted in earlier chapters, stascs are the measures of a sample. The measures are used to characterize the sample and to infer measures of the populaon
termed parameters.
Parameter
A parameter is a numerical descripon of a populaon. Examples include the populaon mean and the populaon standard deviaon .
Statistic
A stasc is a numerical descripon of a sample. Examples include a sample mean x and the sample standard deviaon sx.
Good samples are random samples where any member of the populaon is equally likely to be selected and any sample of any size n is equally likely to be selected.
Consider four samples selected from a populaon. The samples need not be mutually exclusive as shown, they may include elements of other samples.
The sample means x1, x2, x3, x4, can include a smallest sample mean and a largest sample mean. Choosing a number of classes can generate a histogram for the
sample means. The queson this chapter answers is whether the shape of the distribuon of sample means from a populaon is any shape or a specic shape.
Sampling Distribution of the Mean
The shape of the distribuon of the sample mean is not any possible shape. The shape of the distribuon of the sample mean, at least for good random samples
with a sample size larger than 30, is a normal distribuon. That is, if you take random samples of 30 or more elements from a populaon, calculate the sample
mean, and then create a relave frequency distribuon for the means, the resulng distribuon will be normal.
In the following diagram the underlying data is bimodal and is depicted by the columns with thin outlines. Thirty data elements were sampled forty mes and forty
sample means were calculated. A relave frequency histogram of the sample means is ploed with a heavy black outline. Note that though the underlying
distribuon is bimodal, the distribuon of the forty means is heaped and close to symmetrical. The distribuon of the forty sample means is normal.
The center of the distribuon of the sample means is, theorecally, the populaon mean. To put this another simpler way, the average of the sample averages is the
populaon mean. Actually, the average of the sample averages approaches the populaon mean as the number of sample averages approaches innity.
34 de 56
The sample mean distribuon is a heap shaped, as in the shape of the normal distribuon, and centered on the populaon mean.
If the sample size is 30 or more, then the sample means are NORMALLY distributed even when the underlying data is NOT normally distributed! If the sample size is
less than 30, then the distribuon of the samples means is normal if and only if the underlying data is normally distributed.
The normal distribuon of the sample means (averages) allows us to use our normal distribuon probabilies to esmate a range for . The mean of the sample
means is a point estimate for the populaon mean .
The mean of the sample means can be wrien as:
In this text the above is somemes wrien as x

The value of the mean of the sample means x is, for a very large number of samples each of which has a very large sample size, the populaon mean. As a
praccal maer we use the mean of a single large sample. How large? The sample size must be at least n = 30 in order for the sample mean (a stasc) to be a good
esmate for the populaon mean (a parameter). This requirement is necessary to ensure that the distribuon of the sample means will be normal even when the
underlying data is not normal. If we are certain the data is normally distributed, then a sample size n of less than 30 is acceptable.
Later in the course we will modify the normal distribuon to handle samples of sizes less than 30 for which the distribuon of the underlying data is either unknown
or not normal. This modicaon will be called the student's t-distribution. The student's t-distribuon is also heap-shaped.
The normal distribuon, and later the student's t-distribuon, will be used to quote a range of possible values for a populaon mean based on a single sample
mean. Knowing that the sample mean comes from a heap-shaped distribuon of all possible means, we will center the normal distribuon at the sample mean and
then use the area under the curve to esmate the probability (condence) that we have "captured" the populaon mean in that range.
8.2 Central Limit Theorem
The Central Limit Theorem is the theory that says "for increasingly large sample sizes n, the sample mean approaches ever closer the populaon mean."
Standard Error
The standard deviation of the distribution of the sample means

There is one complicaon: the sample standard deviaon of a single sample is not a good esmate of the standard deviaon of the sample means. Note that the
distribuon of the sample means is NARROWER than the sample in the above example. The shape of the distribuon of the sample means is narrower and taller
than the shape of the underlying data. In the diagram, the shape of the underlying data is normal, the taller narrower distribuon is the distribuon of all the
sample means for all possible samples.
The standard deviaon of a single sample has to be reduced to reect this. This reducon turns out to be
inversely related to the square root of the sample size. This is not proven here in this text.
The standard deviaon of the distribuon of the sample means is equal to the actual populaon standard
deviaon divided by the square root of n.
The standard deviaon divided by the square root of the sample size is called the standard error of the mean.
If is known, then the above formula can be used and the distribuon of the sample mean is normal.
As a praccal maer, since we rarely know the populaon standard deviaon , we will use the sample standard deviaon sx in class to esmate the standard
deviaon of the sample means. This formula will then appear in various permutaons in formulas used to esmate a populaon mean from a sample mean. When
we use the sample standard deviaon sx we will use the student's t-distribuon. The student's t-distribuon looks like a normal distribuon. The student's
t-distribuon, however, is adjusted to be a more accurate predictor of the range for a populaon mean. Later we will learn to use the student's t-distribuon. Unl
that me we will play a lile fast and loose and use sample standard deviaons to calculate the standard error of the mean.
35 de 56
Another way to think about the standard error is that the standard errors captures the "fuzziness" of the mean. The mean is dierent than individual data points,
individual numbers. The mean is composed of a collecon of data values. The mean is composed of a sample of data values. Pick a dierent sample from the
populaon, you get a dierent mean. The change in the mean is only a random result. The change in the mean has no meaning. The standard error is a measure of
that fuzziness. In the next chapter that "fuzziness" will be expanded to two standard errors to either side of the mean. Later that "two standard errors" will be
adjusted for small sample sizes. Two standard errors, and the subsequent adjustment to that value of two, are ways of mathemacally describing the fuzziness of
the mean.
09 Confidence Intervals
9.1 Inference and Point Estimates
Whenever we use a single stasc to esmate a parameter we refer to the esmate as a point estimate for that parameter. When we use a stasc to esmate a
parameter, the verb used is "to infer." We infer the populaon parameter from the sample stasc.
Some populaon parameters cannot be inferred from the stasc. The populaon size N cannot be inferred from the sample size n. The populaon minimum,
maximum, and range cannot be inferred from the sample minimum, maximum, and range. Populaons are more likely to have single outliers than a smaller random
sample.
The populaon mode and median usually cannot be inferred from a smaller random sample. There are special circumstances under which a sample mode and
median might be a good esmate of a populaon mode and median, these circumstances are not covered in this class.
The stasc we will focus on is the sample mean x. The normal distribuon of sample means for many samples taken from a populaon provides a mathemacal
way to calculate a range in which we expect to "capture" the populaon mean and to state the level of condence we have in that range's ability to capture the
populaon mean .
Point Estimate for the population mean and Error
The sample mean x is a point estimate for the populaon mean

The sample mean x for a random sample will not be the exact same value as the true populaon mean .
The error of a point esmate is the magnitude of esmate minus the actual parameter (where the magnitude is always posive). The error in using x for is ( x ).
Note that to take a posive value we need to use either the absolute value |( x - )| or ( x - )2.
Note that the error of an esmate is the distance of the stasc from the parameter.
Unfortunately, the whole reason we were using the sample mean x to esmate the populaon mean is because we did not know the populaon mean .
For example, given the mean body fat index (BFI) of 51 male students at the naonal campus is x = 19.9 with a sample standard deviaon of sx = 7.7, what is the
error |( x - )| if is the average BFI for male COMFSM students?
We cannot calculate this. We do not know ! So we say x is a point esmate for . That would make the error equal to (x x)2 = zero. This is a silly and meaningless
answer.
Is x really the exact value of for all the males at the naonal campus? No, the sample mean is not going to be the same as the true populaon mean.
Point estimate for the population standard deviation
The sample standard deviaon sx is a reasonable point estimate for the populaon standard deviaon . In more advanced stascs classes concern over bias in
the sample standard deviaon as an esmator for the populaon standard deviaon is considered more carefully. In this class, and in many applicaons of stascs,
the sample standard deviaon sx is used as the point esmate for the populaon standard deviaon .
9.12 Introduction to Confidence Intervals
We might be more accurate if we were to say that the mean is somewhere between two values. We could esmate a range for the populaon mean by going
one standard error below the sample mean and one standard error above the sample mean. Remember, the standard error is the /(n). Note that the formula for
the standard error requires knowing the populaon standard deviaon . We do not usually know this value. In fact, if we knew then we would probably also
know the populaon mean . In secon 9.2 we will use the sample standard deviaon or sx/(n) and the student's t-distribuon to calculate a range in which we
expect to nd the populaon mean .
In the diagram the lower curve represents the distribuon of data in a populaon with a normal distribuon.
Remember, distribuon simply means the shape of the frequency or relave frequency histogram, now charted
as a connuous line. The narrower and taller line is the distribuon of all possible sample means from that
populaon.
For the populaon curve (lower, broader) the distance to each inecon point is one standard deviaon: . For
the sample means (higher, narrower) the distance to each inecon point is one standard error of the mean:
36 de 56
/(n).
The area from minus one standard error to plus one standard error is sll 68.2%.
Here is a key point: If I set my esmate for to be between x - /n and x + /n, then there is a 0.682 probability that will be included in that interval.
The "68.2% probability" is termed "the level of condence."
Probability note: the reality is that the populaon mean is either inside or outside the range we have calculated. We are right or wrong, 100% or 0%. Thus saying
that there is a 68.2% probability that the populaon mean has been "captured" by the range is not actually correct. This is the main reason why we shi to calling
the range for the mean a confidence interval. We start saying things such as "I am 68.2 percent confident the mean is in the range quoted." Stascians assert that
over the course of a lifeme, if one always uses a 68.2% condence interval one will right 68.2% of the me in life. This is small comfort when an individual
experimental result might be very important to you.
95% Confidence Intervals
In many elds of inquiry a common level of condence used is a 95% level of condence. For the purposes of this course a 95% condence interval is oen used.
Note that when a condence interval is not 95%, then specic reference to the chosen condence level must be stated. Stang the level of condence is always
good form. While many studies are done at a 95% level of condence, in some elds higher or lower levels of condence may be common. Scienc studies oen
use 99% or higher levels of condence.
There is always, however, a chance that one will be wrong. In Florida an elecon was "called" in favor of candidate Al Gore in the year 2000 in the United States
based on a 99.5% level of condence. Hours later the news organizaons said George Bush had won Florida. A few hours later the news organizaons would retract
this second esmate and decide that the race was too close too call. The news organizaons decided they had been wrong two mes in row. Eventually a court case
nally seled who had won the state of Florida. Even at a half a percent chance of being wrong one can sll be wrong, even two mes in a row.
The 95% condence interval is roughly the sample mean plus and minus two standard errors. If the sample size is large, then the use of plus or minus two standard
errors will produce a reasonable esmate of the 95% condence interval. If the sample size is small, less than 30, then the condence interval generated by plus and
minus two standard errors will be too small. The problem is the factor of "two" - this has to be adjusted for small sample sizes.
Neglecng the issue of sample size, the following images show a 95% condence interval in LibreOce.org, the corresponding min-max chart in Gnumeric, and a
box plot of heart rate data in Gnumeric for comparison purposes. The condence intervals are based on twice the standard error and do not take the sample size
into account.
The LibreOce.org chart in image above is a stock chart. Arrange the data in the order header row, lower bound, upper bound, and the mean as seen in the image.
Select the header row (data tles), the bounds, and the mean. Use a stock chart. In the second dialog box of the chart wizard switch to Data series in rows. Finish
the chart wizard.
37 de 56
9.2 Confidence intervals for n > 5 using sx
When using the sample standard deviaon sx to generate a condence interval for the populaon mean, a distribuon called the Student's t-distribuon is used.
The Student's t-distribuon looks like the normal distribuon, but the t-distribuon changes shape slightly as the sample size n changes. The t-distribuon looks like
a normal distribuon, but the shape "aens" as n decreases. As the sample size decreases, the t-distribuon becomes aer and wider, spreading out the
condence interval and "pushing" the lower and upper limits away from the center. For n > 30 the Student's t-distribuon is almost idencal to the normal
distribuon. When we sketch the Student's t-distribuon we draw the same heap shape with two inecon points.
To use the Student's t-distribuon the sample must be a good, random sample. The sample size can be as small as n = 5. For n 10 the t-distribuon will generate
very large ranges for the populaon mean. The range can be so large that the esmate is without useful meaning. A basic rule in stascs is "the bigger the sample
size, the beer."
The spreadsheet funcon used to nd limits from the Student's t-distribuon does not calculate the lower and upper limits directly. The funcon calculates a value
called "t-crical" which is wrien as tc. t-crical muliplied by the Standard Error of the mean SE will generate the margin of error for the mean E.
Do not confuse the standard error of the mean with the margin of error for the mean. The Standard Error of the mean is sx/(n). The Margin of Error for the mean
(E) is the distance from either end of the condence interval to the middle of the condence interval. The margin of error is produced from the Standard Error:
Margin of Error for the mean = tc*standard error of the mean
Margin of Error for the mean = tc*sx/n
The condence interval will be:
x-Ex+E
Calculating tc
The t-crical value will be calculated using the spreadsheet funcon TINV. TINV uses the area in the tails to calculate t-crical. The area under the whole curve is
100%, so the area in the tails is 100% condence level c. Remember that in decimal notaon 100% is just 1. If the condence level c is in decimal form use the
spreadsheet funcon below to calculate tc:
=TINV(1c,n1)
If the condence level c is entered as a percentage with the percent sign, then make sure the 1 is wrien as 100%:
=TINV(100%c%,n1)
Degrees of Freedom
The TINV funcon adjusts t-crical for the sample size n. The formula uses n 1. This n 1 is termed the "degrees of freedom." For condence intervals of one
variable the degrees of freedom are n 1.
Example 9.2.1
Runners run at a very regular and consistent pace. As a result, over a xed distance a runner should be able to repeat their time consistently. While
individual times over a given distance will vary slightly, the long term average should remain approximately the same. The average should remain
within the 95% condence interval.
For a sample size of n = 10 runs from the college in Palikir to Kolonia town, a runner has a sample mean x time of 61 minutes with a sample
standard deviation sx of 7 minutes. Construct a 95% condence interval for my population mean run time.
Step 1: Determine the basic sample statistics
sample size n = 10
sample mean x = 61
38 de 56
[61 is also the point est. for the pop. mean ]

sample standard deviation sx = 7
Step 2: Calculate degrees of freedom, tc, standard error SE
degrees of freedom = 10 - 1 = 9
tc =TINV(1-0.95,10-1) = 2.2622
Standard Error of the mean sx/n = 7/sqrt(10) = 2.2136
Keeping four decimal places in intermediate calculations can help reduce rounding errors in calculations. Alternatively use a spreadsheet and cell
references for all calculations.
Step 3: Determine margin of error E
Margin of error E for the mean
= tc*sx/n = 2.2622*7/10 = 5.01
Given that: x - E x + E, we can substitute the values for x and E to obtain the 95% condence interval for the population mean :
Step 4: Calcuate the condence interval for the mean
61 5.01 61 + 5.01
55.99 66.01
I can be 95% condent that my population mean run time should be between 56 and 66 minutes.
Example 9.2.2
Jumps
102
79
68
66
61
69
42
45
79
22
43
13
24
10
11
107
17
34
8
20
58
26
45
40
111
105
213
On Thursday 08 November 2007 a jump rope contest was held at a local elementary school festival. Contestants jumped with their feet together, a
double-foot jump. The data seen in the table is the number of jumps for twenty-seven female jumpers. Calculate a 95% condence interval for the
population mean number of jumps.
The sample mean x for the data is 56.22 with a sample standard deviation of 44.65. The sample size n is 27. You should try to make these
calculations yourself. With those three numbers we can proceed to calculate the 95% condence interval for the population mean :
Step 1: Determine the basic sample statistics
sample size n = 27
sample mean x = 56.22
sample standard deviation sx = 44.65
Step 2: Calculate degrees of freedom, tc, standard error SE
The degrees of freedom are n 1 = 26
Therefore tcritical = TINV(1-0.95,27-1) = 2.0555
The Standard Error of the mean SE = sx/27 = 8.5924
Step 3: Determine margin of error E
Therefore the Margin of error for the mean E
tc* SE = 2.0555*8.5924 = 17.66
The 95% condence interval for the mean is x E x + E
Step 4: Calcuate the condence interval for the mean
56.22 17.66 56.22 + 17.66
38.56 73.88
The population mean for the jump rope jumpers is estimated to be between 38.56 and 73.88 jumps.
9.3 Confidence intervals for a proportion
In 2003 a staer at the Marshall Islands department of educaon noted in a newspaper arcle that Marshall's Island public school system was not the weakest in
Micronesia. The staer noted that Marshall's was second weakest, commenng that educaon metrics in the Marshall's outperform those in Chuuk's public schools.
In 2004 y students at Marshall Islands High School took the entrance test. Ten students Achieved admission to regular college programs. In Chuuk state 7% of the
public high school students gain admission to the regular college programs. If the 95% condence interval for the Marshall Islands proporon includes 7%, then the
Marshallese students are not academically more capable than the Chuukese students, not stascally signicantly so. If the 95% condence interval does not
include 7%, then the Marshallese students are stascally signicantly stronger in their admissions rate.
Finding the 95% condence interval for a proporon involves esmang the populaon proporon p. The y students at Marshall Islands High School are taked as
39 de 56
a sample. The proporon who gained admission, 10/50 or 20%, is the sample proporon. The populaon proporon is treated as unknown, and the sample
proporon is used as the point esmate for the populaon proporon.
Note: In this text the leer p is used for the sample proporon of successes instead of "p hat". A capital P is used to refer to the populaon proporon.
The leer n refers to the sample size. The leer p is the sample proporon of successes. The leer q is the sample proporon of failures. In the above example n is
50, p is 10/50 or 0.20, and q is 40/50 or 0.80
Esmang the populaon proporon P can only be done if the following condions are met:
np > 5
nq > 5
In the example np = (50)(0.20) = 10 which is > 5. nq = (50)(0.80) = 40 which is also > 5
The standard error of a proporon is:
SE=
For the example above the standard error is:

=sqrt(0.2*0.8/50)
For the calculaon of the condence interval of a proporon, only the standard error calculaon is new. The rest of the steps are the same as in the preceding
secon.
The standard error for the proporon is 0.0566. The margin of error E is then calculated in much the same way as in the secon above, by mulplying t c by the
standard error. tc is sll found from the TINV funcon. The degrees of freedom will remain n-1.
The margin of error E is:
E=
=TINV(1-0.95,50-1)*sqrt((0.2)*(0.8)/50)
The margin of error E is 0.1137
The condence interval for the populaon proporon P is:
pEPp+E
0.20 0.1137 P 0.20 + 0.1137
0.0863 P 0.3137
The result is that the expected populaon mean for Marshall Island High School is between 8.6% and 31.2%. The 95% condence interval does not include the 7%
rate of the Chuuk public high schools. While the college entrance test is not a measure of overall academic capability, there are few common measures that can be
used across the two naons. The result does not contradict the staer's asseron that MIHS outperformed the Chuuk public high schools. This lack of contradicon
acts as support for the original statement that MIHS outperformed the public high schools of Chuuk in 2004.
Homework: In twelve sumo matches Hakuho bested Tochiazuma seven mes. What is the 90% condence interval for the populaon proporon of wins by Hakuho
over Tochiazuma. Does the interval extend below 50%? A commentator noted that Tochiazuma is not evenly matched. If the interval includes 50%, however, then
we cannot rule out the possibility that the two-win margin is random and that the rikishi (wrestlers) are indeed evenly matched.
Hakuho won that night, upping the rao to 8 wins to 5 losses to Tochiazuma. Is Hakuho now stascally more likely to win or could they sll be evenly matched at a
condence level of 90%?
9.4 Deciding on a sample size
Suppose you are designing a study and you have in mind a parcular error E you do not want to exceed. You can determine the sample size n you'll need if you have
prior knowledge of the standard deviaon sx. How would you know the sample standard deviaon in advance of the study? One way is to do a small "pre-study" to
obtain an esmate of the standard deviaon. These are oen called "pilot studies."
If we have an esmate of the standard deviaon, then we can esmate the sample size needed to obtain the desired error E.
Since E = tc*sx/n, then solving for n yields = (tc*sx/E)
Note that this is not a proper mathemacal soluon because tc is also dependent on n. While many texts use zc from the normal distribuon in the formula, we
have not learned to calculate zc.
In the "real world" what oen happens is that a result is found to not be stascally signicant as the result of an inial study. Stascal signicance will be covered
in more detail later. The researchers may have goen "close" to stascal signicance and wish to shrink the condence interval by increasing the sample size. A
larger sample size means a smaller standard error (n is in the denominator!) and this in turn yields a smaller margin of error E. The queson is how big a sample
would be needed to get a parcular margin of error E.
The value for tc from pilot study can be used to esmate the new sample size n. The resulng sample size n will be slightly overesmated versus the tradional
calculaon made with the normal distribuon. This overesmate, while slightly unorthodox, provides some assurance that the error E will indeed shrink as much as
needed.
In a study of body fat for 51 males students a sample mean x of 19.9 with a standard deviaon of 7.7 was measured. This led to a margin of error E of 2.17 and a
condence interval 17.73 22.07
Suppose we want a margin of error E = 1.0 at a condence level of 0.95 in this study of male student body fat. We can use the sx from the sample of 51 students to
esmate my necessary sample size:
40 de 56
n = (2.0086*7.7/1)2 = 239.19 or 239 students. Thus I esmate that I will need 239 male students to reduce my margin of error E to 1 in my body fat study.
Other texts which use zc would obtain the result of 227.77 or 228 students. The eleven addional students would provide assurance that the margin of error E does
fall to 1.0.
That one can calculate a sample size n necessary to reduce a margin of error E to a parcular level means that for any hypothesis test (chapter ten) in which the
means have a mathemacal dierence, stascal signicance can be eventually be aained by suciently increasing the sample size. This may sound appealing to
the researcher trying to prove a dierence exists, but philosophically it leaves open the concept that all things can be proven true for suciently large samples.
10 Hypothesis Testing
10.1 Confidence Interval Testing
In this chapter we explore whether a sample has a sample mean x that could have come from a populaon with a known populaon mean . There are two
possibilies. In Case I below, the sample mean x comes from the populaon with a known mean . In Case II, on the right, the sample mean x does not come from
the populaon with a known mean . For our purposes the populaon mean could be a pre-exisng mean, an expected mean, or a mean against which we intend
to run the hypothesis test. In the next chapter we will consider how to handle comparing two samples to each other to see if they come from the same populaon.
Suppose we want to do a study of whether the female students at the naonal campus gain body fat with age during their years at COM-FSM. Suppose we already
know that the populaon mean body fat percentage for the new freshmen females 18 and 19 years old is = 25.4.
We measure a sample size n = 12 female students at the naonal campus who are 21 years old and older and determine that their sample mean body fat
percentage is x = 30.5 percent with a sample standard deviaon of sx = 8.7.
Can we conclude that the female students at the naonal campus gain body fat as they age during their years at the College?
Not necessarily. Samples taken from a populaon with a populaon mean of = 25.4 will not necessarily have a sample mean of 25.4. If we take many dierent
samples from the populaon, the sample means will distribute normally about the populaon mean, but each individual mean is likely to be dierent than the
populaon mean.
In other words, we have to consider what the likelihood of drawing a sample that is 30.5 - 25.4 = 5.1 units away from the populaon mean for a sample size of 12. If
we knew more about the populaon distribuon we would be able to determine the likelihood of a 12 element sample being drawn from the populaon with a
sample mean 5.1 units away from the actual populaon mean.
In this case we know more about our sample and the distribuon of the sample mean. The distribuon of the sample mean follows the student's t-distribuon. So
we shi from centering the distribuon on the populaon mean and center the distribuon on the sample mean. Then we determine whether the condence
interval includes the populaon mean or not. We construct a condence interval for the range of the populaon mean for the sample.
If this condence interval includes the known populaon mean for the 18 to 19 years olds, then we cannot rule out the possibility that our 12 student sample is
from that same populaon. In this instance we cannot conclude that the women gain body fat.
If the condence interval does NOT include the known populaon mean for the 18 to 19 year old students then we can say that the older students come from a
dierent populaon: a populaon with a higher populaon mean body fat. In this instance we can conclude that the older women have a dierent and probably
higher body fat level.
41 de 56
One of the decisions we obviously have to make is the level of condence we will use in the problem. Here we enter a contenous area. The level of condence we
choose, our level of bravery or temerity, will determine whether or not we conclude that the older females have a dierent body fat content. For a detailed if
somewhat advanced discussion of this issue see The Fallacy of the Null-Hypothesis Signicance Test [hp://psy.ed.asu.edu/~classics/Rozeboom/] by William Rozeboom.
In educaon and the social sciences there is a tradion of using a 95% condence interval. In some elds three dierent condence intervals are reported, typically
a 90%, 95%, and 99% condence interval. Why not use a 100% condence interval? The normal and t-distribuons are asymptoc to the x-axis. A 100% condence
interval would run to plus and minus innity. We can never be 100% condent.
In the above example a 95% condence interval would be calculated in the following way:
n = 12
x = 30.53
sx = 8.67
c = 0.95
degrees of freedom = 12 -1 = 11
tc = nv((1-0.95,11) = 2.20
E = tc*sx/sqrt(12) = 5.51
x-E<<x+E
25.02 < < 36.04
The 95% condence interval for our n = 12 sample includes the populaon mean 25.3. We CANNOT conclude at the 95% condence level that this sample DID NOT
come from a populaon with a populaon mean of 25.3.
Another way of thinking of this is to say that 30.5 is not suciently separated from 25.8 for the dierence to be stascally signicant at a condence level of 95% in
the above example.
In common language, the women are not gaining body fat.
The above process is reduced to a formulaic structure in hypothesis tesng. Hypothesis tesng is the process of determining whether a condence interval includes
a previously known populaon mean value. If the populaon mean value is included, then we do not have a stascally signicant result. If the mean is not
encompassed by the condence interval, then we have a stascally signicant result to report.
Homework
If I expand my study of female students 21+ to n = 24 and nd a sample mean x = 28.7 and an sx=7, is the new sample mean stascally signicantly dierent from a
populaon mean of 25.4 at a condence level of c = 0.90?
10.2 Hypothesis Testing
The null hypothesis H0
The null hypothesis is the supposion that there is no change in a value from some pre-exisng, historical, or expected value. The null hypothesis literally supposes
that the change is null, non-existent, that there is no change.
In the previous example the null hypothesis would have been H0: = 25.4
The alternate hypothesis H1
The alternate hypothesis [hp://www.stascs.com/content/glossary/althypo.html] is the supposion that there is a change in the value from some pre-exisng, historical, or
expected value. Note that the alternate hypothesis does NOT say the "new" value is the correct value, just that whatever the mean might be, it is not that given by
the null hypothesis.
H1: 25.4
Statistical hypothesis testing
42 de 56
We run hypothesis test to determine if new data conrms or rejects the null hypothesis.
If the new data falls within the condence interval, then the new data does not contradict the null hypothesis. In this instance we say that "we fail to reject the null
hypothesis." Note that we do not actually arm the null hypothesis. This is really lile more than semanc shenanigans that stascians use to protect their
derriers. Although we run around saying we failed to reject the null hypothesis, in pracce it means we le the null hypothesis standing: we de facto accepted the
null hypothesis.
If the new data falls outside the condence interval, then the new data would cause us to reject the null hypothesis. In this instance we say "we reject the null
hypothesis." Note that we never say that we accept the alternate hypothesis. Accepng the alternate hypothesis would be asserng that the populaon mean is the
sample mean value. The test does not prove this, it only shows that the sample could not have the populaon mean given in the null hypothesis.
For two-tailed tests, the results are idencal to a condence interval test. Note that condence interval never asserts the exact populaon mean, only the range of
possible means. Hypothesis tesng theory is built on condence interval theory. The condence interval does not prove a parcular value for the populaon mean ,
so neither can hypothesis tesng.
In our example above we failed to reject the null hypothesis H0 that the populaon mean for the older students was 25.4, the same populaon mean as the younger
students.
In the example above a 95% condence interval was used. At this point in your stascal development and this course you can think of this as a 5% chance we have
reached the wrong conclusion.
Imagine that the 18 to 19 year old students had a body fat percentage of 24 in the previous example. We would have rejected the null hypothesis and said that the
older students have a dierent and probably larger body fat percentage.
There is, however, a small probability (less than 5%) that a 12 element sample with a mean of 30.5 and a standard deviaon of 8.7 could come from a populaon
with a populaon mean of 24. This risk of rejecng the null hypothesis when we should not reject it is called alpha . Alpha is 1-condence level, or = 1-c. In
hypothesis tesng we use instead of the condence level c.
And we accept H0 as true
Reject H0 as false
Suppose
H0 is true Correct decision. Probability: 1 Type I error. Probability:
H0 is false Type II error. Probability:
Correct decision. Probability: 1
Hypothesis tesng seeks to control alpha . We cannot determine (beta) with the stascal tools you learn in this course.
Alpha is called the level of signicance. 1 is called the "power" of the test.
The regions beyond the condence interval are called the "tails" or crical regions of the test. In the above example there are two tails each with an area of 0.025.
Alpha = 0.05
A type I error, the risk of which is characterized by alpha , is also known as a false positive. A type I error is nding that a change has happened, nding that a
dierence is signicant, when it is not.
A type II error, the risk of which is characterized by beta , is also known as a false negative. A type II error is the failure to nd that a change has happened, nding
that a dierence is not signicant, when it is.
If you increase the condence level c, then alpha decreases and beta increases. High levels of condence in a result, small alpha values, small risks of a type I error,
leader to higher risks of comming a type II error. Thus in hypothesis tesng there is a tendency to ulize an alpha of 0.05 or 0.01 as a way to controlling the risk of
comming a type II error.
Another take on type I and type II errors:
Source information: Jim Thornton [https://twitter.com/jimgthornton] via Flowing Data [http://flowingdata.com/2014/05/09/type-i-and-ii-errors-simplified/]

For hypothesis tesng it is simply safest to always use the t-distribuon. In the example further below we will run a two-tail test.
Steps
43 de 56
1. Write down H0, the null hypothesis

2. Write down H1, the alternate hypothesis
3. If not given, decide on a level of risk of rejecng a true null hypothesis H0 by choosing an .
4. Determine the t-crical values from TINV(,df).
5. Determine the t-stasc from:
=
(
)

6. Make a sketch
7. If the t-stasc is "beyond" the t-crical values then reject the null hypothesis. By "beyond" we mean larger in absolute value. Otherwise we fail to reject the
null hypothesis.
Put another way, if the absolute value of the t-stasc is larger than t-crical, then the result is stascally signicant and we reject the null hypothesis.
Example 10.2.1
Using the data from the rst section of these notes:

1. H0: = 25.4
2. H1: 25.4
3. Alpha = 0.05 ( = 1 c, c = 0.95)
4. Determine the t-crical values: degrees of freedom: n 1 = 12 1; tc = TINV(,df) = nv(0.05,11) = 2.20
5. Determine the t-stasc =
(
)

= (30.53-25.4)/(8.67/sqrt(8.67)) = 2.05
6. Make a sketch:
7. The t-stasc t is NOT "beyond" the t-crical values. We FAIL to reject the null hypothesis H 0. We cannot say the older female students came from a
dierent populaon than the younger students with an populaon mean of 25.4. Why not now accept H 0: = 25.4 as the populaon mean for the 21 year
old female students and older? We risk making a Type II error: failing to reject a false null hypothesis. We are not trying to prove H 0 as being correct, we are
only in the business of trying to "knock it down."
More simply, the t-stasc is NOT bigger in absolute value than t-crical.
Note the changes in the above sketch from the condence interval work. Now the distribuon is centered on with the distribuon curve described by a
t-distribuon with eleven degrees of freedom. In our condence interval work we centered our t-distribuon on the sample mean. The result is, however, the same
due to the symmetry of the problems and the curve. If our distribuon were not symmetric we could not perform this sleight of hand.
The hypothesis test process reduces decision making to the queson, "Is the t-stasc t greater than the t-crical value t c? If t > tc, then we reject the null
hypothesis. If t < tc, then we fail to reject the null hypothesis. Note that t and tc are irraonal numbers and thus unlikely to ever be exactly equal.
Another example
I have a previously known population mean running pace of 6'09" (6.15). In 2001 I've been too busy to run regularly. On my ve most recent runs
I've averaged a 6'23" (6.38) pace with a standard deviation 1'00" At an alpha = 0.05, am I really running dierently this year?
H0: = 6.15
H1: 6.15
Pay close attention to the above! We DO NOT write H1: = 6.23. This is a common beginning mistake.
1. H0: = 6.15
2. H1: 6.15
3. Alpha = 0.05 ( = 1 c, c = 0.95)
4. Determine the t-crical values: degrees of freedom: n 1 = 5 1; tc = TINV(,df) = nv(0.05,4) = 2.78
(
)
sx
= (6.38-6.15)/(1.00/sqrt(5)) = 0.51
6. Make a sketch:
44 de 56
7. The t-stasc t is NOT "beyond" the t-crical values. We FAIL to reject the null hypothesis H 0.
Note that in my sketch I am centering my distribution on the population mean and looking at the distribution of sample means for sample sizes of 5
based on that population mean. Then I look at where my actual sample mean falls with respect to that distribution.
Note that my t-statistic t does not fall "beyond" the critical values. I do not have enough separation from my population mean: I cannot reject H0. So I
fail to reject H0. I am not performing dierently than last year. The implication is that I am not slower.
10.3 P-value
Return to our rst example in these notes where the body fat percentage of 12 female students 21 years old and older was x = 30.53 with a standard deviaon
sx=8.67 was tested against a null hypothesis H0 that the populaon mean body fat for 18 to 19 year old students was = 25.4. We failed to reject the null hypothesis
at an alpha of 0.05. What if we are willing to take a larger risk? What if we are willing to risk a type I error rate of 10%? This would be an alpha of 0.10.
1. H0: = 25.4
2. H1: 25.4
3. Alpha = 0.10 ( = 1 - c, c = 0.90)
4. Determine the t-crical values: degrees of freedom: n - 1 = 12 - 1; tc = TINV(,df) = nv(0.10,11) = 1.796
5. Determine the t-stasc:
=
(
)

= (30.53-25.4)/(8.67/sqrt(12)) = 2.05
6. Make a sketch:
7. The t-stasc is "beyond" the t-crical value. We reject the null hypothesis H 0. We can say the older female students came from a dierent populaon than
the younger students with an populaon mean of 25.4. Why not now accept an H 1: = 30.53 as the populaon mean for the 21 year old female students and
older? We do not actually know the populaon mean for the 21+ year old female students unless we measure ALL of the 21+ year old students.
With an alpha of 0.10 (a condence interval of 0.90) our results are stascally signicant. These same results [#m254a05] were NOT stascally signicant at an
alpha of 0.05. So which is correct:
We FAIL to reject H0 because the t-stasc based on x = 30.53, =25.4, sx = 8.76, is NOT beyond the crical value for alpha =0.05 OR
We reject H0 because the t-stasc based on x = 30.53, =25.4, sx = 8.76, is beyond the crical value for alpha = 0.10.
Note how we would have said this in condence interval language:
We FAIL to reject H0 because =25.4 is within the 95% condence interval for x = 30.53, sx=8.76 OR
We reject H0 because =25.4 is NOT within the 90% condence interval for x=30.53, sx=8.76.
The answer is that it depends on how much risk you are willing take, a 5% chance of comming a Type I error (rejecng a null hypothesis that is true) or a larger
10% chance of comming a Type I error. The result depends on your own personal level of aversity to risk. That's a heck of a mathemacal mess: the answer
depends on your personal willingness to take a parcular risk.
Consider what happens if someone decides they only want to be wrong 1 in 15 mes: that corresponds to an alpha of = 0.067. They cannot use either of the
above examples to decide whether to reject the null hypothesis. We need a system to indicate the boundary at which alpha changes from failure to reject the null
hypothesis to rejecon of the null hypothesis.
Consider what it would mean if t-crical were equal to the t-stasc. The alpha at which t-crical equals the t-stasc would be that boundary value for alpha . We
will call that boundary value the p-value.
The p-value is the alpha for which nv( , df) =
(
).

But how to solve for ?
45 de 56
The soluon is to calculate the area in the tails under the t-distribuon using the tdist funcon. The p-value is calculated using the formula:
=TDIST(ABS(t),degrees of freedom,number of tails)
For a single variable sample and a two-tailed distribuon, the spreadsheet equaon becomes:
=TDIST(ABS(t),n1,2)
The degrees of freedom are n 1 for comparison of a sample mean to a known or pre-exisng populaon mean .
Note that TDIST can only handle posive values for the t-stasc, hence the absolute value funcon.
p-value = TDIST(ABS(2.05,11,2) = 0.06501
The p-value represents the SMALLEST alpha for which the test is deemed "stascally signicant" or, perhaps, "worthy of note."
The p-value is the SMALLEST alpha for which we reject the null hypothesis.
Thus for all alpha greater than 0.065 we reject the null hypothesis. The "one in een" person would reject the null hypothesis (0.0667 > 0.065). The alpha = 0.05
person would not reject the null hypothesis.
If the pre-chosen alpha is more than the p-value, then we reject the null hypothesis. If the pre-chosen alpha is less than the p-value, then we fail to reject the null
hypothesis.
The p-value lets each person decide on their own level of risk and removes the arbitrariness of personal risk choices.
Because many studies in educaon and the social sciences are done at an alpha of 0.05, a p-value at or below 0.05 is used to reject the null hypothesis.
1 p-value is the condence interval for which the new value does not include the pre-exisng populaon mean. Another way to say this is that 1 p-value is the
maximum condence level c we can have that the dierence (change) is signicant. We usually look for a maximum condence level c of 0.95 (95%) or higher.
The p-value is oen misunderstood [hp://news.sciencemag.org/2009/10/mission-improbable-concise-and-precise-denion-p-value] and misinterpreted
[hp://www.stascsdonewrong.com] . The p-value should be thought of as a measure of whether one should be surprised by a result. If the p-value is less than a
pre-chosen alpha, usually 0.05, that would be a surprising result. If the p-value is greater than the pre-chosen alpha, usually 0.05, then that would NOT be a
surprising result.
The p-value is also a much abused concept. In March 2016 the American Stascal Associaon issued the following six principles which which address
misconcepons and misuse of the p- value, are the following:
1. P-values can indicate how incompable the data are with a specied stascal model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scienc conclusions and business or policy decisions should not be based only on whether a p-value passes a specic threshold.
4. Proper inference requires full reporng and transparency.
5. A p-value, or stascal signicance, does not measure the size of an eect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
American Stascal Associaon (ASA) statement on stascal signicance and P-Values [hp://www.amstat.org/newsroom/pressreleases/P-ValueStatement.pdf] . See also
Stascians Found One Thing They Can Agree On: Its Time To Stop Misusing P-Values [hp://vethirtyeight.com/features/stascians-found-one-thing-they-can-agree-on-itsme-to-stop-misusing-p-values/] and The mismeasure of scienc signicance [hp://www.stats.org/mismeasure-scienc-signicance/] The full AMA manuscript is at The ASA's
statement on p-values: context, process, and purpose [hp://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108] .
The American Stascal Associaon seled on the following informal denion of the P-value, "Informally, a p-value is the probability under a specied stascal
model that a stascal summary of the data (for example, the sample mean dierence between two compared groups) would be equal to or more extreme than its
observed value."
10.4 One Tailed Tests
All of the work above in condence intervals and hypothesis tesng has been with two-tailed condence intervals and two-tailed hypothesis tests. There are
stascians who feel one should never leave the realm of two-tailed intervals and tests.
Unfortunately, the pracce by sciensts, business, educators and many of the elds in social science, is to use one-tailed tests when one is fairly certain that the
sample has changed in a parcular direcon. The eect of moving to a one tailed test is to increase one's risk of comming a Type I error.
One tailed tests, however, are popular with researchers because they increase the probability of rejecng the null hypothesis (which is what most researchers are
hoping to do).
The complicaon is that starting with a one-tailed test presumes a change, as in ANY change in ANY direcon has occurred. The proper way to use a one-tailed test
is to rst do a two-tailed test for change in any direcon. If change has occurred, then one can do a one-tailed test in the direcon of the change.
46 de 56
Returning to the earlier example of whether I am running slower, suppose I decide I want to test to see if I am not just performing dierently (), but am actually
slower (<). I can do a one tail test at the 95% condence level. Here alpha will again be 0.05. In order to put all of the area into one tail I will have to use the
spreadsheet funcon TINV(2*,df).
H0: = 6.15
H1: < 6.15
=6.15
x = 6.38
sx = 1.00
n=5
degrees of freedom (df)=4
tc = TINV(2*,df) = TINV(2*0.05,4) = 2.13
t-stasc = =
(
)

= (6.38-6.15)/(1.00/sqrt(5)) = 0.51
Note that the t-stasc calculaon is unaected by this change in the problem.
Note that my t-stasc would have to exceed only 2.13 instead of 2.78 in order to achieve stascal signicance. Sll, 0.51 is not beyond 2.13 so I sll DO NOT reject
the null hypothesis. I am not really slower, not based on this data.
Thus one tailed tests are idencal to two-tailed tests except the formula for t c is TINV(2*,df) and the formula for p is =TDIST(ABS(t),n1,1).
Suppose we decide that the 30.53 body fat percentage for females 21+ at the college denitely represents an increase. We could opt to run a one tailed test at an
alpha of 0.05.
1. H0: = 25.4
2. H1: 25.4
3. Alpha = 0.05 ( = 1 c, c = 0.95)
4. Determine the t-crical values: degrees of freedom: n 1=12 1; tc = TINV(2*,df) = nv(2*0.05,11) = 1.796
(
)

= (30.53-25.4)/(8.67/sqrt(8.67)) = 2.05
6. Make a sketch:
7. The t-stasc is "beyond" the t-crical value. We reject the null hypothesis H 0. We can say the older female students came from a dierent populaon than
the younger students with an populaon mean of 25.4. Why not now accept an H 1: = 30.53 as the populaon mean for the 21 year old female students and
older? We do not actually know the populaon mean for the 21+ year old female students unless we measure ALL of the 21+ year old students.
8. The p value is =TDIST(ABS(2.05),11,1)=0.033
This result should look familiar: it is the result of the two tail test at alpha = 0.10, only now we are claiming we have halved the Type I error rate () to 0.05. Some
stascians object to this saying we are aempng to arcially reduce our Type I error rate by pre-deciding the direcon of the change. Either that or we are
making a post-hoc decision based on the experimental results. Either way we are allowing assumpons into an otherwise mathemacal process. Allowing personal
decisions into the process, including those involving , always involve some controversy in the eld of stascs.
=TDIST(0.51,4,2) = 0.64
10.5 Hypothesis test for a proportion
For a sample proporon p and a known or pre-exisng populaon proporon P, a hypothesis can be done to determine if the sample with a sample proporon p
could have come from a populaon with a proporon P. Note that in this text, due to typeseng issues, a lower-case p is used for the sample proporon while an
upper case P is used for the populaon proporon.
In another departure from other texts, this text uses the student's t-distribuon for t c providing a more conservave determinaon of whether a change is
signicant in smaller samples sizes. Rather than label the test stasc as a z-stasc, to avoid confusing the students and to conform to usage in earlier secons the
47 de 56
test stasc is referred to as a t-stasc.

A survey of college students found 18 of 32 had sexual intercourse. An April 2007 study of absnence educaon programs [hp://www.mathemaca-mpr.com/publicaons
/PDFs/impactabsnence.pdf] in the United States reported that 51% of the youth, primarily students, surveyed had sexual intercourse. Is the proporon of sexually
acve students in the college dierent from that reported in the absnence educaon program study at a condence level of 95%?
The null and alternate hypotheses are wrien using the populaon proporon, in this case the value reported in the study.
H0: P = 0.51
H1: P 0.51
sample proporon p = 18/32 = 0.5625
sample proporon q = 1 p = 0.4375
Note that n*p must be > 5 and n*q must also be > 5 just as was the case in construcng a condence interval.
Condence level c = 0.95
The t-crical value is sll calculated using alpha along with the degrees of freedom:
=TINV(0.05,321)
=2.04
The only "new" calculaon is the t-stasc t:
Note that the form is sll (sample stasc - populaon parameter)/standard error for the stasc.
=(0.5625-0.51)/SQRT(0.5625*0.4375/32)
=0.5545
The t-stasc t does not exceed the t-crical value, so the dierence is not stascally signicant. We fail to reject the null hypothesis of no change.
The p-value is calculated as above using the absolute value of the t-stasc.
=TDIST(ABS(0.5545),32-1,2)
=0.58
The maximum level of condence c we can have that this dierence is signicant is only 42%, far too low to say there is a dierence.
11 Testing the two sample means with the t-test
11.1 Paired differences: Dependent samples
Many studies invesgate systems where there are measurements taken before and aer. Usually there is an experimental treatment or process between the two
measurements. A typical such system would be a pre-test and a post-test. Inbetween the pre-test and the post-test would typically be an educaonal or training
event. One could examine each student's score on the pre-test and the post-test. Even if everyone did beer on the post-test, one would have to prove that the
dierence was stascally signicant and not just a random event.
These studies are called "paired t-tests" or "inferences from matched pairs". Each element in the sample is considered as a pair of scores. The null hypothesis would
be that the average dierence for all the pairs is zero: there is no dierence. For a condence interval test, the condence interval for the mean dierences would
include zero if there is no stascally signicant dierence.
If the dierence for each data pair is referred to as d, then the mean dierence could be wrien d. The hypothesis test is whether this mean dierence d could come
48 de 56
from a populaon with a mean dierence d equal to zero (the null hypothesis). If the mean dierence d could not come from a populaon with a mean dierence
d equal to zero, then the change is stascally signicant. In the diagram above the mean dierence d is equal to before aer.
Confidence interval test
Consider the paired data below. The rst column are female body fat measurements from the beginning of a term. The second column are the body fat
measurements sixteen weeks later. The third column is the dierence d for each pair.
BodyFat before Bodyfat after Bodyfat difference d
23.5
20.8
-2.7
28.9
27.5
-1.4
29.2
28.4
-0.8
24.7
24.1
-0.6
26.4
26.1
-0.3
23.7
24
0.3
46.9
47.2
0.3
23.6
24
0.4
26.4
27.1
0.7
15.9
17
1.1
30.3
31.5
1.2
28.0
29.3
1.3
36.2
37.6
1.4
31.3
32.8
1.5
31.5
33.2
1.7
26.7
28.6
1.9
26.5
29.0
2.5
The condence interval is calculated on the dierences d (third column above) using the sample size n, sample mean dierence d, and the sample standard
deviaon of the dierence data d. The following table includes calculaons using a 95% condence interval.
Count of the dierences
17
sample mean dierence d
0.50
Standard deviaon of the dierence data d
1.33
Standard error for the mean of the dierence data d 0.32

tc for condence level = 0.95
2.12
Margin of the error E for the mean
0.68
Lower bound for the 95% condence interval
-0.18
Upper bound for the 95% condence interval
1.18
The 95% condence interval includes a possible populaon mean of zero. The populaon mean dierence d could be equal to zero.
This means that "no change" is a possible populaon mean. To use the double negave, we cannot rule out the possibility of no change. We fail to reject the null
hypothesis of no change. The women have not stascally signicantly gained body fat over the sixteen weeks of the term.
Hypothesis test
Spreadsheets provide a funcon to calculate the p-value for paired data using the student's t-distribuon. This funcon is the TTEST funcon. If the p-value is less
than your chosen risk of a type I error then the dierence is signicant. The funcon does not require generang the dierence column d as seen above, only the
original data is used in this funcon.
The funcon takes as inputs the before data (data_range_pre), the aer data (data_range_post), the number of tails, and a nal variable that species the type of
test. A paired t-test is test type number one.
=TTEST(data_range_pre,data_range_post,2,1)
To ensure that the spreadsheet calculates the p-value correctly, delete any data missing the pre or post value in the pair. Data missing a pre or post value is not
paired data!
Note too that while many paired t-tests for a dierence of sample means involve pre and post data, there are situaons in which the paired data is not pre and post
in terms of me.
The smallest alpha for which we could say the dierence is stascally signicant is 1 p-value. That said, alpha should be chosen prior to running the hypothesis
test.
p-value
0.14
Maximum condence level c 0.86
49 de 56
The p-value conrms the condence interval analysis, we fail to reject the null hypothesis. At a 5% risk of a type I error we would fail to reject the null hypothesis.
We can have a maximum condence of only 86%, not the 95% standard typically employed. Some would argue that our concern for limited the risk of rejecng a
true null hypothesis (a type I error) has led to a higher risk of failing to reject a false null hypothesis (a type II error). Some would argue that because of other known
factors - the high rates of diabetes, high blood pressure, heart disease, and other non-communicable diseases - one should accept a higher risk of a type I error. The
average shows an increase in body fat. Given the short me frame (a single term), some might argue for reacng to this number and intervening to reduce body fat.
They would argue that given other informaon about this populaon's propensity towards obesity, 86% is "good enough" to show a developing problem. Ulmately
these debates cannot be resolved by stascians.
The TTEST funcon allows one to calculate the p-value directly from two samples. One does not even have to calculate the means in order to use the TTEST
funcon.
If one has chosen to use an alpha of 5%, then a p-value of less than 0.05 indicates that the means are stascally signicantly dierent, and we would reject the null
hypothesis of no dierence between the means. The means are not stascally equal.
If the p-value is larger than 0.05, then the means are not stascally signicantly dierent, and we would fail to reject the null hypothesis. The means are
stascally equal.
11.2 T-test for means for independent samples
One of the more common situaons is when one is seeking to compare two independent samples to determine if the means for each sample are stascally
signicantly dierent. In this case the samples may dier in sample size n, sample mean, and sample standard deviaon.
In this text the two samples are refered to as the x1 data and the x2 data. The use of the same variable, x, refers to a comparison of sample means being a
comparison between two variables that are the same. The test is to see whether the two samples could both come from the same populaon X. The sample size for
the x data is nx1. The sample mean for the x1 data is x1. The sample standard deviaon for the x1 data is sx1. For the x2 data, the sample size is nx2, the sample mean
is x2, and the sample standard deviaon is sx2.
Two possibilies exist. Either the two samples come from the same populaon and the populaon mean dierence is stascally zero. Or the two samples come
from dierent populaons where the populaon mean dierence is stascally not zero.
Note the sample mean tests are predicated on the two samples coming from populaons X 1 and X2 with populaon standard deviaons 1 = 2 where the capital
leers refer to the populaon from which the x1 and x2 samples were drawn respecvely. "Fortunately it can usually be assumed in pracce that since we most
oen wish to test the hypothesis that 1 = 2; it is rather unlikely that the two distribuons should have the same means but dierent variances." (where the
variance is the square of the standard deviaon). [M. G. Bulmer, Principles of Stascs (Dover Books on Mathemacs), Dover Publicaons (April 26, 2012)] . That said, knowledge of
the system being studied and an understanding of populaon distribuon would be important to a formal analysis. In this introductory text the focus is on basic
tools and operaons, providing a foundaon on which to potenally build a more nuanced understanding of stascs.
11.21 Confidence Interval tests
When working with two independent samples, tesng for a dierence of means can also be explored using condence intervals for each sample. Condence
intervals for each sample provide more informaon than a p-value and the declaraon of a signicant dierence is more conservave. Condence intervals for each
sample cannot sort out the indeterminate case where the intervals overlap each other but not the other sample mean. The following diagrams show three dierent
possible relaonships between the condence intervals and the mean. There are more possibilies, these are meant only as samples for guidance. Sample one has a
sample mean x1, sample two has a sample mean x2.
50 de 56
x1
x1
x1
x2
x2
No
signicant
dierence
in the means
Signicant
dierence
in the means
x2
Indeterminate
dierence
in the means:
Run t-test
95% Condence Intervals for the means using t-distribuon

Signicance level for alpha of 0.05
id:ci112
11.22 Confidence interval for the mean difference
The following is another condence interval approach to determining whether two samples have dierent means. Where the approach above charts the condence
intervals separately, this approach looks at whether the condence interval for the dierence in the means could include a populaon mean dierence of zero. Note
that this approach would not lead to proving that the populaon mean dierence is zero. That is not being proved, a populaon mean dierence of zero is taken as
a given by the null hypothesis. The test is whether that null model can be rejected, whether the null model is false, not whether the null model is true.
Each sample has a range of probable values for their populaon mean . If the condence interval for the sample mean dierences includes zero, then there is no
stascally signicant dierence in the means between the two samples. If the condence interval does not include zero, then the dierence in the means is
stascally signicant.
Note that the margin of error E for the mean dierence is sll tc mulplied by the standard error. The standard error formula changes to account for the dierences
in sample size and standard deviaon.
standard error SE=
( )
( )
Image variaon of the above formula using x and y for the two samples for browsers that do not support MathML:
Thus the margin of error E can be calculated using:

margin of error E=
( )
( )
For the degrees of freedom in the t-crical tc calculaon use n 1 for the sample with the smaller size. This produces a conservave esmate of the degrees of
freedom. Advanced stascal soware uses another more complex formula to determine the degrees of freedom.
The condence interval is calculated from:
(x1 x2) E < (x1 x2) < (x1 x2) + E
Where x1 is the sample mean of one data set and x2 is the sample mean of the other data. Some texts use the symbol xd for this dierence and d for the
hypothesized dierence in the populaon means. This leads to the more familiar looking formulaon:
xd E < d < xd + E
Where:
d = x1 x2 and
xd = x1 x2
Remember, x1 and x2 are not known. These are le as symbols. Aer calculang the interval, check to see if the condence includes zero. If zero is inside the
interval, then the sample means are not signicantly dierent and we fail to reject the null hypothesis.
The following table uses a local business example. Data was recorded as to how many cup of sakau were consumed per custome in a single night at two sakau
markets on Pohnpei. The variable is the number of cups of sakau consumed per customer per night. Each column is measuring the same variable. Here on Pohnpei
the implicaon is that the lower the mean (average), the stronger the sakau. Even if there is a dierence in the mean, that dierence is not necessarily signicant.
Stascal tests can help determine whether a dierence is signicant.
Song mahs (x1) Rush Hour (x2)
2
2
3
10
1.5
51 de 56
Song mahs (x1) Rush Hour (x2)

3
5.5
3.5
4.5
7.5
5.5
2.5
4.5
5.5
10
2
4
5
5
5.5
15
14
2
2
4
Sample statistics
sample size n
16 25
sample mean
3.44 5.6
sample stdev
1.77 3.73
Confidence interval statistics

standard error
0.87
t-crical tc
2.13
margin of error E
1.85
dierence of means
-2.16
lower bound ci
-4.01
upper bound ci
-0.31
Note that 15 was used for the degrees of freedom in the t-crical calculaon. Sixteen is the sample size of the smaller sample.
Note that the condence interval does not include zero. The condence interval indicates that whatever the populaon mean dierence d might be, the
populaon mean d cannot be zero. This means that the sample means are stascally signicantly dierent. We would reject a null hypothesis of no dierence
between the two markets. The implicaon is that Song Mahs is stronger than Rush Hour, at least on these two nights. Bear in mind that while the dierence in the
sample means is signicant for the chosen risk of a type I error alpha, the dierence may or may not be important. Whether a dierence is a small, medium, or large
dierence - how "important" the dierence might be - cannot be determined from a hypothesis test alone. Eect size will need to be considered, and an
understanding of the nature of the system that generated the data is also required. For a sakau drinker paying by the cup on a ght budget, a six cups is twice as
expensive as a three cups.
11.23 T-test for difference in independent sample means
As noted above, spreadsheets provide a funcon to calculate p-values. If the the p-value is less than your chosen risk of a type I error then the dierence is
signicant.
The funcon takes as inputs one the data for one if the two samples (data_range_x1), the data for the other sample (data_range_x2), the number of tails, and a nal
variable that species the type of test. A t-test for means from independent samples is test type number three.
=TTEST(data_range_x1,data_range_x2,number of tails,3)
For the above data, the p-value is given in the following table:
p-value
0.02
Maximum condence level c 0.98

The TTEST funcon does not use the smaller sample size to determine the degrees of freedom. The TTEST funcon uses a dierent formula that calculates a larger
number of degrees of freedom, which has the eect of reducing the p-value. Thus the condence interval result could produce a failure to reject the null hypothesis
while the TTEST could produce a rejecon of the null hypothesis. This only occurs when the p-value is close to your chosen .
11.24 Calculating the t-statistic [Optional material]
52 de 56
If you have doubts and want to explore further, take the dierence of the means and divide by the standard error to obtain the t-stasc t. Then use the TDIST
funcon to determine the p-value, using the smaller sample size 1 to calculate the degrees of freedom.
=
Note that (1 2) is presumed to be equal to zero. Thus the formula is the dierence of the means divided by the standard error (given further above).
t = xd (standard error)
Once t is calculated, use the TDIST funcon to determine the p-value.
=TDIST(ABS(t),n1,2)
11.3 Effect size
If there is a signicant dierence in the means, then the eect size of that dierence can be calculated. Note that if there no signicant dierence in the means then
there is no eect size. If the result was a failure to reject the null hypothesis, then the eect size is meaningless and should not be reported.
The p-value provides informaon on how oen a dierence in means will be seen that is as large or larger for two samples that have no actual dierence in the
means. The p-value does not tell one whether the dierence is meaningful. A very small dierence that might be meaningless in the context of the experiment can
be shown to be signicant if the sample sizes are large enough. For two sample means, the eect size provides an esmate of the standardized mean dierence
between two sample means. The eect size is mathemacally related to z-scores. The eect size for a dierence of independent sample means is referred to as
Cohen's effect size d.
The eect size for two sample means can be calculated from:
=
mean sample onemean sample two

pooled standard deviation s
(

)
where sp is the pooled standard deviaon:

=
Interpreng whether the dierence in sample means has "meaning" in terms of the experiment is complex. Cohen provided some general guidelines. He also
cauoned that these interpretaons should be used with care. That said, in a beginning stascs course the guidelines provide a way to start thinking about eect
size.
Cohen suggested that in the behavorial sciences an eect size of 0.2 is small, an eect size of 0.5 is medium, and an eect size of 0.8 is large. These values may not
be correct for other elds of study. Educators in parcular have noted that "small" eect sizes may sll be important in educaon studies. The eect size is also
aected by whether the data is normally distributed and is free of bias.
Think of eect size as a way to begin looking at whether the dierence has real meaning, not just whether the dierence is "surprising" from a p-value perspecve.
For more informaon on eect size start with It's the Eect Size, Stupid: What eect size is and why it is important [hp://www.leeds.ac.uk/educol/documents
/00002182.htm] .
Cohen's effect size d calculation in a spreadsheet.

12 Data Exploration
Instructor provides data of their choice for open data exploration.

12.1 Open data exploration questions to ask and seek answers to
Is the sample representave of the populaon?

What is the level of measurement?
What stascs can you report?
If rao level data, what does a box plot of the quarles reveal?
What are the measures of the middle?
53 de 56
What are the measures of spread?

What is the shape of the distribuon?
If the shape is a normal distribuon, is the variaon due only to random processes?
Are there outliers?
What, if anything, do the outliers mean? If they are errors, can they be/should they be removed?
If appropriate to the data, does the data show a trend?
Can you generate a condence interval for the mean?
If there is more than one sample, are you looking at a hypothesis test situaon? Paired? Independent? Condence interval test? A relaonship between the
variables?
Note that above list of quesons are those appropriate for a student in an introductory stascs course for use in exploring data in ways that demonstrate
knowledge of basic stascal funcons. If one is a researcher with some knowledge of stascs, then the quesons to be asked will dier. For guidance to a
researcher looking to engage in eecve stascal pracce the following guidelines were suggested by Kass, Cao, Davidian, Meng, Yu, and Reid in 2016:
[hp://journals.plos.org/ploscompbiol/arcle?id=10.1371/journal.pcbi.1004961]
Ten Simple Rules for Effective Statistical Practice
1. Stascal Methods Should Enable Data to Answer Scienc Quesons

2. Signals Always Come with Noise
3. Plan Ahead, Really Ahead
4. Worry about Data Quality
5. Stascal Analysis Is More Than a Set of Computaons
6. Keep it Simple
7. Provide Assessments of Variability
8. Check Your Assumpons
9. When Possible, Replicate!
10. Make Your Analysis Reproducible
For details on how to implement these recommendaons, see Ten Simple Rules for Eecve Stascal Pracce [hp://journals.plos.org/ploscompbiol/arcle?id=10.1371
/journal.pcbi.1004961]
12.2 A variables analysis approach to open data exploration
Another way to tackle analysis of the data is to explore the number and nature of the variables being presented. How many variables? What level of measurement?
In introductory stascs one is usually either exploring basic stascs, running correlaons, or comparing means.
Data is oen organized into tables. In stascs columns are oen variables while rows are individual data values. This is not always true, but in introductory stascs
this is almost always true. If there is a single data column, then there is one variable. If there are two data columns, then there are two variables. The variable name
and the units, if any, are usually listed in the rst row of the table.
What can be analyzed, what can be done, depends in part on how many variables are present and the level of measurement. The following chart is for rao level
data. Note that basic stascs can be calculated for any rao level variable. Remember that columns are variables.
There is a caveat in using this approach, one best captured by the arcle Ten Simple Rules for Eecve Stascal Pracce [hp://journals.plos.org/ploscompbiol
/arcle?id=10.1371/journal.pcbi.1004961] :
While it is obvious that experiments generate data to answer scientific questions, inexperienced users of statistics tend to take for granted the link between data and
scientific issues and, as a result, may jump directly to a technique based on data structure rather than scientific goal. For example, if the data were in a table, as for
microarray gene expression data, they might look for a method by asking, Which test should I use? while a more experienced person would, instead, start with the
underlying question, such as, Where are the differentiated genes? and, from there, would consider multiple ways the data might provide answers. Perhaps a
formal statistical test would be useful, but other approaches might be applied as alternatives, such as heat maps or clustering techniques.
With that in mind, for the student in an introductory stascs course where the objecve is to pracce stascal operaons, an data structures approach is arguably
appropriate. The data structures do somemes provide informaon on what can be done with the data.
54 de 56
Single variable
Single data column
Var
Two variables
Two data columns
Equal sample size n
Var 1
Var 2
Two variables
Two data columns
Unequal sample size n
Var 1
data
data
data
Var 2
data
data
xy paired
paired t-test independent
data.
"before"
t-test
Scaergraph and
variables
Linear,
"aer"
are the
non-linear data.
same,
random
Vars
dierent
Basic statistics If linear:
are the
samples
sample size n slope
same,
measured:
min, max
intercept
measured
Independent
range
correlaon twice:
samples
measures of
coef det.
Paired
t-test for
middle,
(x ,y)
t-test
dierence
spread
y=mx+b
p-value
of means,
quarles
vars oen
eect size
conf intervals
boxplot
dierent;
p-value
IQR outliers? relaonships
eect size
z outliers?
between
frequency
variables explored
distribuon
histogram chart
shape of distribuon
95% condence interval for mean
If known then hyp test Ho:=value can be run
independent
t-test
Variables
are the
same,
dierent
samples
measured.
Independent
samples
t-test for
dierence
of means,
conf intervals
p-value
eect size
id:odxt
In the chart above there are three data layouts: single variable, two variables with the same sample sizes n, and two variables with dierent sample sizes n.
Any and all variables can be analyzed by the basic stascs - each column can be analyzed for measures of the middle and spread. The measures of the middle and
spread that are appropriate will depend on the level of measurement.
The single variable can be explored for outliers. For a single variable boxplots, frequency tables, and histograms may be appropriate. A 95% condence interval for
the mean can be calculated. If there is an expected populaon mean , then a hypothesis test can be run to test whether the sample mean is signicantly dierent
from that known populaon mean .
If there are two columns with dierent sample sizes n, then there is a strong probability in an introductory stascs class that a t-test for a dierence of means will
be called for. Basic stascs for each variable can also be calculated.
When there are two variables with equal sample sizes, then there are three possibilies. The data could be xy coordinate pairs where one is tesng to see if the y
variable is correlated with the x variable. In this situaon the variables are usually dierent. A second possibility is that the data represents a "before-and-aer" set
of measurements. A paired t-test for a dierence of means is oen called for. The variables will be the same, and the elements in each row will be something that
was measured twice. The data is called paired data. The third possibility is that the same variable was measured for dierent elements, not something that was
measured twice. In this situaon a likely test is an independent samples t-test for a dierence of means.
For both the paired data and independent samples data there is also the possibility that one could be tesng for a dierence of medians or a dierence of variances
(standard deviaons). There are other tests such as sign test, Wilcoxon Signed Rank test, Wilcoxon Man Whitney test, and the F-test for generang p-values in those
situaons. At present these tests are beyond the scope of this introductory text.
Multiple columns of the same variable from different samples
At the introductory level mulple columns where the variables use the same units may be analyzed by basic stascal analyses of each column unless ANOVA and
other mul-sample approaches have been taught. If the variables use dierent units, then an analysis of relaonships and correlaons may be appropriate. These
are intended as general guidelines to help frame one's thinking about data. These recommendaons and suggesons are guidelines, not rules.
55 de 56
Mulple variables with the same units

Mulple data columns
Var 1
x1
data
Var 2
x2
data
Var 3
x3
data
Var 4
x4
data
Basic statistics
sample sizes n
mins, maxs
ranges
measures of
middle,
spread
quarles
boxplots
IQR outliers?
z-score outliers for each variable
frequency
distribuons
histogram charts
shape of distribuons
95% condence interval for means
id:odxt2
In the above chart one has mulple variables with the same units for each column. In an advanced course one might be running an analysis of variance, but in this
introductory course only basic metrics are likely to be examined. The data can sll "tell a story" that can be supported by the cing of the appropriate stascs.
Consider whether rows have a meaning. Are rows measuring something from dierent samples with the same units in each column? At the introductory course
level basic stascs, searches for outliers, might be most appropriate. Or are the rows each a single "data point" with each column being in dierent units? Then
there is a greater likelihood that looking for correlaons among the columns might a useful approach.
Multiple columns of different variables from the same sample
Where there are mulple columns of data and each column contains a dierent variable, typically from a single sample, then there is the possibility that a
correlaon analysis will produce useful informaon on whether the variables are related to each other or not.
Mulple variables with dierent units
Mulple data columns
Var 1
x1
data
Var 2
y1
data
Var 3
y2
data
Var 4
y3
data
Possibly x1, y1, y2, y3 data

Make a scaergraph
Look for paerns
Linear,
non-linear
random
If linear:
slopes
intercepts
correlaons
for each column
id:odxt3
Note that the above variables analysis presumes that the rst column will be treated as "x" data and the subsequent columns as "y" data. Data does not have to be
arranged this way, but in an introducon to stascs this arrangement is rather likely. Depending on the quesons asked of the data, running correlaons between
the rst and each subsequent column in a pairwise fashion might provide insight into whether relaonships exist between the data columns.
12.3 A Tools Approach
A third way to tackle open data exploraon in an introductory stascs course is to consider the stascal tools one has learned to work with during the course. One
can be 95% condent that the instructor has chosen a problem that can be resolved by the tools taught in the course. In the "wild" there are many more tools to
consider. F-tests for a dierence of variances (standard deviaons), condence intervals for a slope, tests for dierences of medians, tests for normality. All of these
are beyond the scope of this parcular text. Thus the student is le with basic stascs (chapters one, two, three), correlaons (chapter four), condence intervals
(chapter nine), hypothesis tests against a known mean (chapter ten), and tests for a dierence in two sample means (chapter eleven). Those are the tools that have
been covered, in the course an open data exploraon exercise is likely to ulize those same tools.
56 de 56

Statistics Using Libreoffice

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Using Libreoffice

Uploaded by

Copyright:

Available Formats

Introduction to Statistics Using

Dana Lee Ling

Introduction to Statistics Using

Dana Lee Ling

[hp://creavecommons.org/licenses/by/4.0/] 24 May 2015

Introduction to Statistics Using

Qualitave data refers to descripve measurements, typically non-numerical.

In rank order, there exists an order Grading systems: A, B, C, D, F

Dierence and raos have

5. Analyze the data.

Favorite color Frequency f Relative Frequency or p(color)

02 Measures of Middle and Spread

sum of the sample data

Quartiles, Box and Whisker plots

taught by the author of the text.

Consider the following data:

The funcon that calculates the sample standard deviaon is:

Levels of measurement and appropriate measures

range or standard deviaon

About once About A few times

Level of measurement Typical variable type

Appropriate measure of variation

median (can also report mode)

2.4 Z: A Measure of Relative Standing

Number of student seats held by state at the national

Fractional share of national population (relative

Fractional share of the national campus student

Circle or pie charts

confused with xy scaergraphs.

Ordinal level of measurement

Data values into classes comprised of each unique data value

3.3 Frequency tables and histograms at the ratio level of measurement

Class Upper Limits (CUL) Frequency

61.2 + 1.6 62.8

62.8 + 1.6 64.4

Relave frequency is one way to determine a probability.

The area under the relave frequency columns is equal to one.

Another example using integers:

Frequency Relative Frequency f/n

The shapes of distribuons have names by which they are known.

Endnote: Creating histograms with spreadsheets

Time x (minutes) Distance y (km)

Turn-o for Nahnpohnmal 25

Boom of the beast

Top of the beast

Linear: Negave relaonship

No relaonship: random correlaon

Equally Likely Events: Probabilies from mathemacal formulas

The sample space set of all possible outcomes in an experiment or system.

BFI fem CUL Frequency Relative Frequency

Borderline obese (overfat) 39

* body fat percentage category

BFI male CUL Frequency Relative Frequency

Borderline obese (overfat) 25

2003 Conservative 2003 Possible 2007 Growth

2003 Conservative 2003 Possible 2007 Growth

Est. Total Pop.:

Percentage abroad: 18.2%

HHHHHHTT HHHHTTT HHHTTTT HHTTTTTT HTTTTTT TTTTTTT

In this text the above is somemes wrien as x

The standard deviation of the distribution of the sample means

The sample mean x is a point estimate for the populaon mean

9.2 Confidence intervals for n > 5 using sx

[61 is also the point est. for the pop. mean ]