You are on page 1of 64

Outline

Populations, Samples, and Census


Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Lesson 1
Chapter 1: Basic Statistical Concepts
Michael Akritas
Department of Statistics
The Pennsylvania State University
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
1
Populations, Samples, and Census
2
Some Sampling Concepts
3
Random Variables and Statistical Populations
4
Basic Graphics for Data Visualization
5
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Introduction to R
R is a GNU project. The GNU (recursive acronym for
GNUs Not Unix) project, sponsored by the Free Software
Foundation, was launched in 1984 to develop a complete
Unix-like operating system which is free software.
To nd out about R go to http://www.R-project.org/ .
See also the NY Times article http://www.nytimes.com/
2009/01/07/technology/business-computing/
07program.html?pagewanted=all
To download R go to http://cran.r-project.org/.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
You can start using R as a calculator: 2*4; 2**3; sqrt(16);
sin(pi); cos(2*pi); log(exp(1)); log(10,base=10)
Try some simple commands: 1:10, seq(1,10), seq(1,10,1),
seq(2,10, 2). Also, rep(1,5), rep(a,5), rep(seq(1,4),2) or
rep(1:4,2), c(rep(0,5),rep(1,7)).
Can store the numbers in objects: x=c(rep(0,5),rep(1,7)).
x=seq(2,10,2); sum(x); mean(x). Try also x/2; x**2; sqrt(x)
Dene functions: f=function(x){x**2}. Try f(2); f(c(2,3))
Integrate a function: integrate(f, 0, 3). Try also
g=function(x){x**(-2)}; integrate(g, 1, Inf)
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Why Statistics?
Example (Examples of Engineering/Scientic Studies)
Comparing the compressive strength of two or more
cement mixtures.
Comparing the effectiveness of three cleaning products in
removing four different types of stains.
Predicting failure time on the basis of stress applied.
Assessing the effectiveness of a new trafc regulatory
measure in reducing the weekly rate of accidents.
Testing a manufacturers claim regarding a products
quality.
Studying the relation between salary increases and
employee productivity in a large corporation.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
These studies require Statistics due to the intrinsic variability:
The compressive strength of different preparations of the
same cement mixture will differ. The gure in
http://sites.stat.psu.edu/

mga/401/fig/
HistComprStrCement.pdf shows 32 compressive
strength measurements (MegaPascal units), of test
cylinders (6 in. diameter, 12 in. high), using water/cement
ratio of 0.4, measured on the 28th day after they are made.
Under the same stress, two beams fail at different times.
The proportion of defective items of a certain product will
differ from batch to batch.
Intrinsic variability renders the objectives of the case studies, as
stated, ambiguous.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
The objectives of the case studies can be made precise if
stated in terms of averages or means.
Comparing the average hardness of two different cement
mixtures.
Predicting the average failure time on the basis of stress
applied.
Estimation of the average coefcient of thermal expansion.
Estimation of the average proportion of defective items.
Moreover, because of variability, the words average and
mean have a technical meaning which can be made clear
through the concepts of population and sample.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Denition
Population is a well-dened collection of objects or subjects, of
relevance to a particular study, which are exposed to the same
treatment or method.
Population members are called units.
The objective of a study is to investigate certain
characteristic(s) of the units of the population(s) of
interest.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Example (Populations and Unit Characteristics)
All water samples taken from a lake. Characteristics:
Mercury concentration; Concentration of other pollutants.
All items of a certain manufactured product (that have, or
will be produced). Characteristic: Proportion of defective
items.
All students enrolled in Big Ten universities during the
2013-14 academic year. Characteristics: Favorite type of
music; Political afliation.
Two types of cleaning products. Characteristic: cleaning
effectiveness.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Populations consisting of the same type of units but differ
in the treatment, or method, applied to them are called
treatment populations.
Example (Treatment Populations)
The concentration of pollutants in water samples is
analyzed by two different labs. Water samples sent to Lab
1 constitute population 1, and those sent to Lab 2
constitute population 2.
The time to failure of beams is studied under different
stress conditions. The beams subjected to each stress
condition constitute different populations.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Full (i.e. population-level) understanding of a characteristic
requires the examination of all population units, i.e. a
census.
For example, full understanding of the relation between
salary and productivity of a corporations employees
requires obtaining these two characteristics from all
employees.
However,
taking a census can be time consuming and expensive:
The 2000 U.S. Census costed $6.5 billion, while the 2010
Census costed $13 billion.
Moreover, census is not feasible if the population is
hypothetical or conceptual, i.e. not all members are
available for examination.
Because of the above, we typically settle for examining all
units in a sample, which is a subset of the population.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Due to the intrinsic variability, the sample properties/attributes
of the characteristic of interest will differ from those of the
population. For example
The average mercury concentration in 25 water samples
will differ from the overall mercury concentration in the lake.
The proportion in a sample of 100 PSU students who favor
the use of solar energy will differ from the corresponding
proportion of all PSU students.
The relation between bears chest girth and weight in a
sample of 10 bears, will differ from the corresponding
relation in the entire population of 50 bears in a forested
region.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
The GOOD NEWS is that, if the sample is suitably drawn, then
sample properties approximate the population properties.
20 25 30 35 40 45 50 55
1
0
0
2
0
0
3
0
0
4
0
0
Chest Girth
W
e
i
g
h
t
Figure: Population and sample relationships between chest girth and
weight of black bears.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Sampling Variability
Samples properties of the characteristic of interest also
differ from sample to sample. For example:
1
The number of US citizens, in a sample of size 20, who
favor expanding solar energy, will (most likely) be different
from the corresponding number in a different sample of 20
US citizens.
2
The average mercury concentration in two sets of 25 water
samples drawn from a lake will differ.
The term sampling variability is used to describe such
differences in the characteristic of interest from sample to
sample.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
20 25 30 35 40 45 50 55
1
0
0
2
0
0
3
0
0
4
0
0
Chest Girth
W
e
i
g
h
t
Figure: Illustration of Sampling Variability.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Population level properties/attributes of characteristic(s) of
interest are called (population) parameters.
Examples of parameters include averages, proportions,
percentiles, and correlation coefcient.
The corresponding sample properties/attributes of
characteristics are called statistics.
Sample statistics approximate the corresponding
population parameters but are not equal to them.
Statistical inference deals with the uncertainty issues
which arise in approximating parameters by statistics.
The tools of statistical inference include point and interval
estimation, hypothesis testing and prediction.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Example (Examples of Estimation, Hypothesis Testing and
Prediction)
Estimation (point and interval) would be used in the task of
estimating the coefcient of thermal expansion of a metal,
or the air pollution level.
Hypothesis testing would be used for deciding whether to
take corrective action to bring the air pollution level down,
or whether a manufacturers claim regarding the quality of
a product is false.
Prediction arises in cases where we would like to predict
the failure time on the basis of the stress applied, or the
age of a tree on the basis of its trunk diameter.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
For valid statistical inference the sample must be
representative of the population. For example, a sample
of PSU basketball players is not representative of PSU
students, if the characteristic of interest is height.
Typically it is hard to tell whether a sample is
representative of the population. So, we dene a sample to
be representative if . . . (cyclical denition!!)
it allows for valid statistical inference.
The only guarantee for that comes from the method used
to select the sample (sampling method).
The good news is that there are several sampling methods
guarantee representativeness.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Denition
A sample of size n is a simple random sample if the selection
process ensures that every sample of size n has equal chance
of being selected.
In simple random sampling every member of the
population has the same chance of being included in the
sample. The reverse, however, is not true.
Example
To select a sample of 2 students from a population of 20 male
and 20 female students, one selects at random one male and
one female students. Is this a s.r.s.? (Does every student have
the same chance of being included in the sample?)
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Another sampling method for obtaining a representative sample
is called stratied sampling.
Denition
A stratied sample consists of simple random samples from
each of a number of groups (which are non-overlapping and
make up the entire population) called strata.
Examples of strata include: ethnic groups, age groups, and
production facilities.
If the units in the different strata differ in terms of the
characteristic under study, stratied sampling is preferable
to s.r.s. For example, if different production facilities differ
in terms of the proportion of defective products, a stratied
sample is preferable.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
How do we select a s.r.s. of size n from a population of N
units?
STEP 1: Assign to each unit a number from 1 to N.
STEP 2: Write each number on a slips of paper, place the
N slips of paper in an urn, and shufe them.
STEP 3: Select n slips of paper at random, one at a time.
Alternatively, the entire process can be performed in software
like R. We will see this in the next lab session.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Sampling without replacement simply means that a
population unit can be included in a sample at most once.
For example, a simple random sample is obtained by
sampling without replacement: Once a units slip of paper
is drawn, it is not placed back into the urn.
Sampling with replacement means that after a units slip of
paper is chosen, it is put back in the urn. Thus a
population unit could be included in the sample anywhere
between 0 and n times. Rolling a die can be thought of as
sampling with replacement from the numbers 1, 2, . . . , 6.
Though conceptually undesirable, sampling with
replacement is easier to work with from a mathematical
point of view.
When a population is very large, sampling with and without
replacement are practically equivalent.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Non-representative samples arise whenever the sampling
plan is such that a part, or parts, of the population of
interest are either excluded from, or systematically
under-represented in, the sample. This is called selection
bias.
Two examples of non-representative samples are
self-selected and convenience samples.
A self-selected sample often occurs when people are
asked to send in their opinions in surveys or
questionnaires. For example, in a political survey, often
those who feel that things are running smoothly or who
support an incumbent will (apathetically) not respond,
whereas those activists who strongly desire change will
voice their opinions.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Convenience samples are made up from the most easily
accessed units. For example, randomly selecting students
from your classes will not result in a sample that is
representative of all PSU students since your classes are
mostly comprised of students with the same major as you.
Example (The Literary Digest poll of 1936)
The magazine had been extremely successful in predicting the
results in US presidential elections, but in 1936 it predicted a
3-to-2 victory for Republican Alf Landon over the Democratic
incumbent Franklin Delano Roosevelt. Worth noting is that this
prediction was based on 2.3 million responses (out of 10 million
questionnaires sent). On the other hand Gallup correctly
predicted the outcome of that election by surveying only 50,000
people.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Variable = a Numerical Characteristic
If the characteristic of interest can be measured expressed as a
number, e.g. thermal expansion of a metal, hardness of
cement, mercury concentration, or number of accidents it is are
called quantitative.
Examples of non-quantitative characteristics are gender, make
of car, eye color, strength category, political afliation. Such
characteristics are called categorical or qualitative.
Because statistical procedures are applied to numerical data
sets, the categories in categorical characteristic are labeled
with arbitrarily chosen numbers (i.e. male= 1, female= +1).
A characteristic expressed as a number is called a variable.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Types of Variables
Qualitative variables are a particular kind of discrete
variables. Quantitative variables can also be discrete.
All variables expressing counts, such as the number of
earthquakes, the number of sh caught etc, are discrete.
Quantitative variables expressing measurements on a
continuous scale are examples of continuous variables.
Measurements of length, strength, weight, or time to failure
are examples of continuous variables.
When two or more characteristics are measured on each
population unit, we have bivariate or multivariate
variables.
Example of bivariate: Salary increase and productivity.
Example of multivariate: Age, income, education level.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Random Variables
When a unit is randomly sampled from a population, the
value of its variable will be denoted by X (or Y, or Z, etc).
Because of the intrinsic variability, X is not known a-priori
and thus it is called a random variable (r.v.).
The population from which a random variable is drawn is
called the underlying population of the r.v.
The collection of of the variable values of all population
units is called the statistical population.
The statistical population of a r.v. is NOT the same as the
set of values a variable can take.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Example
1
A list of the weight of every PSU student is the statistical
population of the r.v. weight.
2
A list of 1s and 0s representing every students opinion on
whether solar energy should be expanded is the statistical
population of the r.v. expressing opinion on solar energy.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Sampling from the Statistical Population
It should be intuitively clear that taking a sample of n units form
some population and recording the variable of each sampled
unit, is equivalent to taking a sample of n units from the
statistical population of the random variable and its underlying
population.
Henceforth, the word sample will also be used to denote a
sample from the statistical population. Such a sample
1
Consists of units of the statistical population i.e. numbers.
2
The numbers are not known a-priori, so they are rvs.
3
A sample of size n will be denoted by X
1
, X
2
, . . . , X
n
.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Histograms and Stem and Leaf Plots
In histograms the range of the data is divided into bins, and
a box is constructed above each bin.
The height of each box is the bins frequency. Alternatively,
the heights can be adjusted so the histograms area is one.
R will automatically choose the number of bins but it also
allows user specied intervals. Moreover, R offers the
option of constructing a smooth histogram.
In stem and leaf plots each observation gets split into its
stem, which is the beginning digit(s), and its leaf, which is
the rst of the remaining digits.
They retain more information about the original data but do
not offer as much exibility in selecting the bins.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
The R data set faithful
The histogram, with superimposed smooth histogram, for a
sample of 272 eruption durations from the Old Faithful
geyser is shown in http://stat.psu.edu/

mga/401/
fig/HistOldFaith1.pdf
The stem and leaf plot for the same data set is shown in
http:
//stat.psu.edu/

mga/401/fig/StemLeaf.pdf
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Scatterplots
The basic scatterplot is useful for exploring the
relationship between two variables. An enhance version
identies subclasses of data. See http://stat.psu.
edu/

mga/401/fig/BearsChG_W_by_S.pdf
A scatterplot matrix is a matrix of scatterplots for all pairs
of variables in a data set. See http://stat.psu.edu/

mga/401/fig/BearMeas_by_S.pdf. It helps identify


the best single predictor of weight.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Scatterplots Continued
Scatterplots with marginal histograms shows
histograms of the two variables in the margins of the
scatterplot. See http://stat.psu.edu/

mga/401/
fig/BearMeas_with_MarginalHist.pdf
3D Scatterplots are useful for exploring the relationship
between three variables. For example, http://stat.
psu.edu/

mga/401/fig/TempProdElect2.pdf gives
a three dimensional view of the joint effect of temperature
and production volume on electricity consumed.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Pie Charts and Bar Graphs
Pie charts and bar graphs are used with count data to
display the proportion of each category in a sample.
The pie chart is popular in the mass media and one of the
most widely used statistical charts in the business world.
It is a circular chart, where the circle is divided into
sections whose areas represent proportions.
The pie chart in http:
//www.stat.psu.edu/

mga/401/fig/LvMsPie.pdf
displays information on the November, 2011 light vehicle
market share of car companies (source:
http://wardsauto.com/keydata/USSalesSummary0702.xls).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
According to Stevens power law bar lengths is better than
section areas for comparing the different proportions.
Bar graphs resemble histograms with the heights of the
bars equal to the proportion of each category. The bar
graph display of the November 2011 light vehicle market
share data is shown in http:
//stat.psu.edu/

mga/401/fig/LvMsBar2.pdf.
Remark: When the heights of the bars are arranged in a decreasing
order, the bar graph is also called Pareto chart. The Pareto chart is
one of the key tools used in quality control, where it is often used to
represent the most common sources of defects in a manufacturing
process, or the most frequent reasons for customer complaints, etc.
[Google Pareto principle]
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
The Most Common Parameters
For a univariate statistical population these are:
The proportion. For example, the proportion of Honda
Accords that will require warranty repair work in 36,000
miles.
The average. For example, the average failure time at a
given stress level.
The variance and standard deviation. These parameters
quantify the intrinsic variability.
The median and other percentiles. Can be used to quantify
both location and variability.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Outline
1
Populations, Samples, and Census
2
Some Sampling Concepts
3
Random Variables and Statistical Populations
4
Basic Graphics for Data Visualization
5
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Proportions are relevant whenever the variable of interest is
categorical, or has been categorized.
Denition
1
If the population has N units, and N
i
units are in category i ,
then the population proportion for category i , is
p
i
=
#{population units of category i }
#{population units}
=
N
i
N
.
2
If a sample of size n is taken, and n
i
sample units are in
category i , then the sample proportion for category i , is

p
i
=
#{sample units of category i }
#{sample units}
=
n
i
n
.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example
1
In a sample of 1000 adults, 72% favor tougher penalties for
drunk driving. Is the correct notation for 0.72 p or

p?
2
In a population of 80 engineering majors taking a required
statistics class, 40 are enthusiastic about having computer
labs. If a s.r. sample of 20 from these students 8 are
enthusiastic. What is the correct notation for 40/80 = 0.5
and for 8/20 = 2/5?
Always remember that, under s.r. sampling,

p
approximates, but in general is different from p.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Outline
1
Populations, Samples, and Census
2
Some Sampling Concepts
3
Random Variables and Statistical Populations
4
Basic Graphics for Data Visualization
5
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Consider a population of N units, and let v
1
, v
2
, . . . , v
N
denote
the statistical population corresponding to some variable. Then
the population average or population mean, denoted by , is
the arithmetic average of all values in the statistical population.
Thus,
=
1
N
N

i =1
v
i
.
If the random variable X denotes the value of the variable of a
randomly selected population unit, then a synonymous
terminology for the population mean is expected value of X, or
mean value of X, and is denoted by
X
or E(X).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example
In a population of 500 tin plates, the number of plates with 0, 1
and 2 scratches is N
0
= 190, N
1
= 160 and N
2
= 150. Thus, in
the statistical population v
1
, . . . , v
500
, 190 v
i
equal 0, 160 equal
1, and 150 equal 2. The population mean is
=
1
500
500

i =1
v
i
=
0 N
0
500
+
1 N
1
500
+
2 N
2
500
= 0.92
If a tin plate is selected at random and X is the rv denoting the
number of scratches, the mean value of X is 0.92 and we write

X
= 0.92, or E(X) = 0.92.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
If a sample of size n is taken, and x
1
, x
2
, . . . , x
n
denote the
variable values of the sample units, then the sample average
or sample mean, denoted by x, is
x =
1
n
n

i =1
x
i
Under s.r. sampling, a sample mean approximates, but in
general is different from the population mean.
Example
If a s.r. sample of n = 100 is taken from the 500 tin plates, it
could be that there are n
0
= 40, n
1
= 34 and n
2
= 26 plates
with 0, 1 and 2 scratches. In this case, x = 0.86.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Proportions are Averages!
A proportion is a special case of a mean. To see this:
Consider the example with the tin plates, where N
1
= 160
out of N = 500 have one scratch, and let the variable X
take the value 1 if a tin plate has one scratch and the value
0 otherwise.
Note that for the statistical population, v
1
, . . . , v
500
, of this
variable, 160 v
i
are equal to 1 and 340 are equal to 0.
Thus,

X
=
160
500
= 0.32, which equals p =
N
1
N
.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Outline
1
Populations, Samples, and Census
2
Some Sampling Concepts
3
Random Variables and Statistical Populations
4
Basic Graphics for Data Visualization
5
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Let v
1
, v
2
, . . . , v
N
be a statistical population with mean .
Denition
The population variance,
2
, is dened as

2
=
1
N
N

i =1
(v
i
)
2
.
The standard deviation is the positive square root of the
variance: =

2
.
If the rv X denotes a randomly selected value from the
statistical population, then a synonymous terminology for the
population variance is variance of X, and is denoted by
2
X
, or
Var(X). The standard deviation of X is
X
=

2
X
.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
A simpler computational formula for the variance is

2
=
1
N
N

i =1
v
2
i

2
.
Example
Consider the tin plate example, so the statistical population
v
1
, . . . , v
500
, has 190 v
i
equal 0, 160 equal 1, 150 equal 2, and
= 0.92. Then,

2
=
190 0
500
+
1 160
500
+
4 150
500
0.92
2
= 0.6736.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
If x
1
, x
2
, . . . , x
n
denotes a sample from the statistical population,
the sample variance and its computational formula are:
S
2
=
1
n 1
n

i =1
(x
i
x)
2
=
1
n 1

i =1
x
2
i

1
n

i =1
x
i

.
The sample standard deviation is S =

S
2
. Under s.r.
sampling, S
2
approximates, but in general is different from
2
.
Example
Consider the s.r. sample of n = 100 tin plates, which has 40, 34
and 26 plates with 0, 1 and 2 scratches. Then,
S
2
=
1
99
[138 73.96] = 0.647
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Why Divide by n 1?
Because this assures that the average of the sample variances
resulting from all possible samples is equal to the population
average.
Example
The variance of the population {0, 1}, which corresponds to
tossing a fair coin, is 0.25 (why?). The possible samples of size
two, taken with replacement, are {0, 0}, {0, 1}, {1, 0}, {1, 1}.
Verify that the four sample variances average to 0.25.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Outline
1
Populations, Samples, and Census
2
Some Sampling Concepts
3
Random Variables and Statistical Populations
4
Basic Graphics for Data Visualization
5
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Roughly speaking, the (1 )100th sample percentile
separates the part having the (1 )100% smaller values,
from that which has the 100% larger values. Thus:
The 90th sample percentile separates the largest 10% from
the lower 90% values in the data set.
The 50th sample percentile is also called the sample
median. The 25th, the 50th and the 75th sample
percentiles are also called sample quartiles. The 25th
and 75th percentiles are the lower quartile and upper
quartile, respectively.
The distance between the lower and upper quartiles is
called the interquartile range or IQR.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Order Statistics as Sample Percentiles
Let X
1
, . . . , X
n
be a s.r. sample from a continuous
distribution. The ordered sample values are denoted
X
(1)
, X
(2)
, . . . , X
(n)
.
Thus, X
(1)
< X
(2)
< < X
(n)
.
X
(i )
, the i th smallest sample value, is dened to be the

100
i 0.5
n

-th sample percentile.


Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example
A s.r.s. of 10 black bears weights is: 154 158 356 446 40 154
90 94 150 142. Give the order statistics, and state the
population percentiles they estimate.
Solution: The R command
sort( c(154, 158, 356, 446, 40, 154, 90, 94, 150, 142) )
returns the order statistics: 40, 90, 94, 142, 150, 154, 154, 158,
356, 446. These order statistics estimate the
5th, 15th, 25th, 35th, 45th, 55th, 65th, 75th, 85th and 95th
population percentiles, respectively. For example, X
(3)
= 94 is
the 100(3 0.5)/10 = 25th percentile and estimates the
corresponding population percentile. [In R the percentiles are
obtained with: 100*(1:10 - 0.5)/10.]
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
In the above example none of the order statistics corresponds
to the median or the 90th percentile. In general, if n is even,
none of the order statistics corresponds to the median. For
example,
If n = 5 then X
(3)
, the 3rd smallest value, is the
100
2.5
5
= 50th sample percentile or median.
If n = 4 then
X
(2)
is the 100
1.5
4
= 37.5th sample percentile,
while X
(3)
is the 100
2.5
4
= 62.5th sample percentile.
Thus, none of the ordered values is the median.
Depending on n, the above denition may not identify other
percentiles of interest. In such cases, we use interpolations.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Percentiles in R
R uses a different interpolation algorithm for evaluating sample
percentiles from a given data set. With the data set in the
object x, the commands
median(x)
quantile(x,0.25)
quantile(x,c(0.3,0.7,0.9))
summary(x)
R commands
for percentiles
give, respectively, the median, the 25th percentile, the 30th,
70th and 90th percentiles, and a ve number summary of the
data consisting of x
(1)
, q
1
,

x, q
3
, and x
(n)
.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example
Using the previous sample of 10 black bear weights, estimate the
population median, 70th, 80th and 90th percentiles.
Solution: With the sample values in the object w, i.e.
w=c(154, 158, 356, 446, 40, 154, 90, 94, 150, 142)
the R command
quantile(w,c(0.5, 0.7, 0.8, 0.9))
returns 152.0, 155.2, 197.6, 365.0 for the sample median, 70th, 80th
and 90th percentiles, respectively.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
The ve number summary of the data given by the
summary(x) command in R is the basis for the boxplot.
A boxplot displays the central 50% of the data with a box,
the lower and upper edges are at q
1
and q
3
, respectively,
a line inside the box represents the median.
The lower 25% and upper 25% of the data are represented
by lines (or whiskers) which extend from each edge of the
box.
The lower (upper) whisker extends from q
1
(q
3
) until the
smallest (largest) observation within 1.5 interquartile
ranges from q
1
(q
3
).
Observations further from the box than the whisker ends
(i.e. smaller than q
1
1.5 IQR or larger than
q
3
+ 1.5 IQR) are called outliers, and are plotted
individually.
See http://sites.stat.psu.edu/

mga/401/fig/
BoxplotOzoneR.pdf
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example
Scientists have been monitoring the ozone hole since 1980.
See the images shown in http://ozonewatch.gsfc.nasa.gov/ The
14 Ozone measurements (Dobson units) given in
http://stat.psu.edu/mga/401/Data/OzoneData.txt. are taken in
2002 from the lower stratosphere, between 9 and 12 miles
altitude. Give the ve number summary of this data and
construct the box plot.
Solution: Read the data in the R object oz using
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
oz=read.table(http://stat.psu.edu/mga/401/Data/OzoneData.txt,
header =T)
Then, use the command
summary(oz) (or summary(oz$OzoneData)) to get the ve
number summary of this data. For the boxplot use
boxplot(oz, col=grey), or boxplot(oz$OzoneData, col=grey).
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Hand Calculation of Sample Median
Denition
Let X
(1)
, X
(2)
, . . . , X
(n)
denote the ordered sample values in a
sample of size n. The sample median is dened as

X =

X
(
n+1
2
)
, if n is odd
X
(
n
2
)
+ X
(
n
2
+1
)
2
, if n is even
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example (Relation Between

X and X)
Find the sample median of X
1
= 2.3, X
2
= 3.2, X
3
= 1.8,
X
4
= 2.5, X
5
= 2.7.
Solution. We rst order the values from smallest to largest:
X
(1)
= 1.8, X
(2)
= 2.3, X
(3)
= 2.5, X
(4)
= 2.7, X
(5)
= 3.2.
Since sample size is odd,

X = X
(
n+1
2
)
= X
(3)
= 2.5.
For this data, X =

X = 2.5.
If X
(5)
is changed to 4.2, then X = 2.7 but

X = 2.5. Thus X
is affected by outliers, where as

X is not.
In general, if the histogram of the data is positively skewed
X >

X, and if it is negatively skewed X <

X.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Hand Calculation of Sample Quartiles and Sample
IQR
Denition
The sample lower quartile or SLQ is dened as
the median of the smallest n/2 values, if n is even
the median the smallest (n + 1)/2 values, if n is odd
The sample upper quartile or SUQ is dened as
the median of the largest n/2 values, if n is even
the median the largest (n + 1)/2 values, if n is odd
The sample interquartile range, or SIQR, dened as
SIQR = SUQ SLQ
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Example
Find the lower and upper quartiles of the n = 9 observations
9.39, 7.04, 7.17, 13.28, 9.00, 7.46, 21.06, 15.19, 7.50.
Solution. Since n is odd, the SLQ is the median of the
Smallest 5(= (n + 1)/2) values: 7.04, 7.17, 7.46, 7.50, 9.00
and the SUQ is the median of the
Largest 5(= (n + 1)/2) values: 9.00, 9.39, 13.28, 15.19, 21.06.
Thus SLQ = 7.46, and SUQ = 13.28.
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts
Outline
Populations, Samples, and Census
Some Sampling Concepts
Random Variables and Statistical Populations
Basic Graphics for Data Visualization
Proportions, Averages, Variances and Percentiles
Proportions: Population- and Sample-
Averages: Population- and Sample-
Variance: Population- and Sample-
Sample Percentiles and the Box Plot
Go to next lesson http://stat.psu.edu/

mga/401/
course.info/lesson2.pdf
Go to the Stat 401 home page
http://stat.psu.edu/

mga/401/course.info/
Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts

You might also like