Professional Documents
Culture Documents
Whenever we want to make claims about the distribution of data or whether one set of results are different
from another set of results in applied machine learning, we must rely on statistical hypothesis tests.
In this tutorial, you will discover statistical hypothesis testing and how to interpret and carefully state the
results from statistical tests.
Statistical hypothesis tests are important for quantifying answers to questions about samples of data.
The interpretation of a statistical hypothesis test requires a correct understanding of p-values and critical
values.
Regardless of the significance level, the finding of hypothesis tests may still contain errors.
Update May/2018: Added note about “reject” vs “failure to reject”, improved language on this issue.
https://machinelearningmastery.com/statistical-hypothesis-tests/ 1/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
Tutorial Overview
This tutorial is divided into 3 parts; they are:
Click to sign-up and also get a free PDF Ebook version of the course.
In statistics, when we wish to start asking questions about the data and interpret the results, we use
statistical methods that provide a confidence or likelihood about the answers. In general, this class of
methods is called statistical hypothesis testing, or significance tests.
The term “hypothesis” may make you think about science, where we investigate a hypothesis. This is along
the right track.
In statistics, a hypothesis test calculates some quantity under a given assumption. The result of the test
allows us to interpret whether the assumption holds or whether the assumption has been violated.
Two concrete examples that we will use a lot in machine learning are:
The assumption of a statistical test is called the null hypothesis, or hypothesis 0 (H0 for short). It is often
called the default assumption, or the assumption that nothing has changed.
A violation of the test’s assumption is often called the first hypothesis, hypothesis 1 or H1 for short. H1 is
really a short hand for “some other hypothesis,” as all we know is that the evidence suggests that the H0 can
be rejected.
Hypothesis 0 (H0): Assumption of the test holds and is failed to be rejected at some level of
significance.
Hypothesis 1 (H1): Assumption of the test does not hold and is rejected at some level of significance.
Before we can reject or fail to reject the null hypothesis, we must interpret the result of the test.
https://machinelearningmastery.com/statistical-hypothesis-tests/ 2/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
This is a point that may cause a lot of confusion for beginners and experienced practitioners alike.
There are two common forms that a result from a statistical hypothesis test may take, and they must be
interpreted in different ways. They are the p-value and critical values.
For example, we may perform a normality test on a data sample and find that it is unlikely that sample of data
deviates from a Gaussian distribution, failing to reject the null hypothesis.
A statistical hypothesis test may return a value called p or the p-value. This is a quantity that we can use to
interpret or quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by
comparing the p-value to a threshold value chosen beforehand called the significance level.
The significance level is often referred to by the Greek lower case letter alpha.
A common value used for alpha is 5% or 0.05. A smaller alpha value suggests a more robust interpretation of
the null hypothesis, such as 1% or 0.1%.
The p-value is compared to the pre-chosen alpha value. A result is statistically significant when the p-value is
less than alpha. This signifies a change was detected: that the default hypothesis can be rejected.
If p-value > alpha: Fail to reject the null hypothesis (i.e. not signifiant result).
If p-value <= alpha: Reject the null hypothesis (i.e. significant result).
For example, if we were performing a test of whether a data sample was normal and we calculated a p-value
of .07, we could state something like:
The test found that the data sample was normal, failing to reject the null hypothesis at a 5%
significance level.
The significance level can be inverted by subtracting it from 1 to give a confidence level of the hypothesis
given the observed sample data.
The test found that the data was normal, failing to reject the null hypothesis at a 95% confidence
level.
https://machinelearningmastery.com/statistical-hypothesis-tests/ 3/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
This means that when we interpret the result of a statistical test, we do not know what is true or false, only
what is likely.
Rejecting the null hypothesis means that there is sufficient statistical evidence that the null hypothesis does
not look likely. Otherwise, it means that there is not sufficient statistical evidence to reject the null hypothesis.
We may think about the statistical test in terms of the dichotomy of rejecting and accepting the null
hypothesis. The danger is that if we say that we “accept” the null hypothesis, the language suggests that the
null hypothesis is true. Instead, it is safer to say that we “fail to reject” the null hypothesis, as in, there is
insufficient statistical evidence to reject it.
When reading “reject” vs “fail to reject” for the first time, it is confusing to beginners. You can think of it as
“reject” vs “accept” in your mind, as long as you remind yourself that the result is probabilistic and that even
an “accepted” null hypothesis still has a small probability of being wrong.
It does mean that we have chosen to reject or fail to reject the null hypothesis at a specific statistical
significance level based on empirical evidence and the chosen statistical test.
You are limited to making probabilistic claims, not crisp binary or true/false claims about the result.
p-value as Probability
A common misunderstanding is that the p-value is a probability of the null hypothesis being true or false
given the data.
1 Pr(hypothesis | data)
This is incorrect.
Instead, the p-value can be thought of as the probability of the data given the pre-specified assumption
embedded in the statistical test.
1 Pr(data | hypothesis)
It allows us to reason about whether or not the data fits the hypothesis. Not the other way around.
The p-value is a measure of how likely the data sample would be observed if the null hypothesis were true.
Post-Hoc Tuning
It does not mean that you can re-sample your domain or tune your data sample and re-run the statistical test
until you achieve a desired result.
https://machinelearningmastery.com/statistical-hypothesis-tests/ 4/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
It also does not mean that you can choose your p-value after you run the test.
This is called p-hacking or hill climbing and will mean that the result you present will be fragile and not
representative. In science, this is at best unethical, and at worst fraud.
Instead, they might return a list of critical values and their associated significance levels, as well as a
test statistic.
The choice of returning a p-value or a list of critical values is really an implementation choice.
The results are interpreted in a similar way. Instead of comparing a single p-value to a pre-specified
significance level, the test statistic is compared to the critical value at a chosen significance level.
If test statistic < critical value: Fail to reject the null hypothesis.
If test statistic >= critical value: Reject the null hypothesis.
Again, the meaning of the result is similar in that the chosen significance level is a probabilistic decision on
rejection or fail to reject the base assumption of the test given the data.
Results are presented in the same way as with a p-value, as either significance level or confidence level. For
example, if a normality test was calculated and the test statistic was compared to the critical value at the 5%
significance level, results could be stated as:
The test found that the data sample was normal, failing to reject the null hypothesis at a 5%
significance level.
Or:
The test found that the data was normal, failing to reject the null hypothesis at a 95% confidence
level.
That means that the evidence of the test may suggest an outcome and be mistaken.
For example, if alpha was 5%, it suggests that (at most) 1 time in 20 that the null hypothesis would be
mistakenly rejected or failed to be rejected because of the statistical noise in the data sample.
Given a high p-value, it may mean that the null hypothesis is true (we got it right) or that the null hypothesis is
false and some unlikely event occurred (we made a mistake). If this type of error is made, it is called a false
positive. We falsely believe the null hypothesis or assumption of the statistical test.
https://machinelearningmastery.com/statistical-hypothesis-tests/ 5/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
Alternately, a low p-value either means that the null hypothesis false (we got it right) or it is true and some
rare and unlikely event has been observed (we made a mistake). If this type of error is made, it is called a
false negative. We falsely believe the rejection of the null hypothesis.
Type I Error: The incorrect rejection of a true null hypothesis or a false positive.
Type II Error: The incorrect failure of rejection of a false null hypothesis or a false negative.
All statistical hypothesis tests have a chance of making either of these types of errors. False findings or false
disoveries are more than possible; they are probable.
Ideally, we want to choose a significance level that minimizes the likelihood of one of these errors. E.g. a very
small significance level. Although significance levels such as 0.05 and 0.01 are common in many fields of
science, harder sciences, such as physics, are more aggressive.
It is common to use a significance level of (3 * 10)^-7 or 0.0000003, often referred to as 5-sigma. This means
that the finding was due to chance with a probability of 1 in 3.5 million independent repeats of the
experiments. To use a threshold like this may require a much large data sample.
Nevertheless, these types of errors are always present and must be kept in mind when presenting and
interpreting the results of statistical tests. It is also a reason why it is important to have findings independently
verified.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Find an example of a research paper that does not present results using p-values.
Find an example of a research paper that presents results with statistical significance, but makes one of
the common misinterpretations of p-values.
Find an example of a research paper that presents results with statistical significance and correctly
interprets and presents the p-value and findings.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Articles
Statistical hypothesis testing on Wikipedia
Statistical significance on Wikipedia
p-value on Wikipedia
Critical value on Wikipedia
Type I and type II errors on Wikipedia
Data dredging on Wikipedia
Misunderstandings of p-values on Wikipedia
What does the 5 sigma mean?
https://machinelearningmastery.com/statistical-hypothesis-tests/ 6/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
Summary
In this tutorial, you discovered statistical hypothesis testing and how to interpret and carefully state the
results from statistical tests.
Statistical hypothesis tests are important for quantifying answers to questions about samples of data.
The interpretation of a statistical hypothesis test requires a correct understanding of p-values.
Regardless of the significance level, the finding of hypothesis tests may still contain errors.
A Gentle Introduction to Normality Tests in Python Introduction to Nonparametric Statistical Significance Tests in Python
REPLY
DC May 14, 2018 at 2:18 pm #
https://machinelearningmastery.com/statistical-hypothesis-tests/ 7/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
REPLY
Jason Brownlee May 14, 2018 at
2:25 pm #
REPLY
Jason Brownlee May
18, 2018 at 2:42 pm #
Agreed. We do not
accept, but fail to reject.
https://machinelearningmastery.com/statistical-hypothesis-tests/ 8/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
REPLY
Jason Brownlee May 15, 2018 at
7:51 am #
Thanks.
REPLY
Jason Brownlee May 15, 2018 at
8:03 am #
REPLY
Scot May 15, 2018 at 3:57 am #
https://machinelearningmastery.com/statistical-hypothesis-tests/ 9/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
REPLY
Jason Brownlee May 15, 2018 at
8:05 am #
Thanks, fixed.
REPLY
MM May 18, 2018 at 12:58 pm #
REPLY
Jason Brownlee May 18, 2018 at
2:38 pm #
REPLY
Anthony The Koala May 18, 2018 at
1:25 pm #
Dear Dr Jason,
Have you encountered situations where if you
reject the null hypothesis, there is a chance of
accepting it. Conversely if you accept the null
hypothesis there is a chance that it is
rejected?
Thank you,
Anthony of Belfield
REPLY
Jason Brownlee May 18, 2018 at
2:40 pm #
https://machinelearningmastery.com/statistical-hypothesis-tests/ 10/11
6/20/2018 A Gentle Introduction to Statistical Hypothesis Tests
https://machinelearningmastery.com/statistical-hypothesis-tests/ 11/11