You are on page 1of 3

Dispersion Tests for Poisson and Binomial

Poisson Distribution
Fitting the Poisson Distribution to Data
The Poisson distribution is often used for modelling counts. A classic example is given by the number of deaths by horse kicks in the Prussian Army from 1875-1894 for 14 Corps. which is available in the file:
HorseKick.txt

The probability function for the Poisson distribution with parameter l may be written,
l pk = e-l k!
k

for k = 0, 1, ... . In general the parameter l can be estimated by the sample mean of the data.

Testing the Model Adequacy


The Poisson dispersion test provides the most powerful ominbus method for testing the adequacy of the fitted Poisson distribution. The statistic for this test is based on the fact that under the assumption that the Poisson distribution, the mean and variance are equal. Furthermore, R.A. Fisher showed that under the assumption that data X1 , ..., Xn are generated by a Poisson distribution with some parameter l, then the statistic,
n 2 i=1 HXi - X L D = X

where X is the sample mean has a c2 -distribution with n - 1 degrees of freedom. The p-value under a one sided upper tail test is given by, p = 1 - FHDL, where FHDL is the value of cumulative distribution of the c2 -distribution with n - 1 degrees of freedom. The cumulative distribution for the c2 distribution is available in S-Plus as the function pchisq. The reason why an upper tail test is often used in practice is that it often happens with actual data that the variance is significantly larger than the mean. This phenomenon is known as overdispersion and has been researched for over 150 years.

Application to Horse Kick Data


A script file to generate the output is available HorseKickDispersionTest.ssc on our course homepage.

DispersionTests.nb

> x <- scan("n:/259b/data/horsekick.txt") > mu <- mean(x) > mu [1] 0.7 > D <- sum((x - mu)^2)/mu > df <- length(x) - 1 > pval <- 1 - pchisq(D, df) > ans <- c(D, df, pval) > names(ans) <- c("D", "df", "p-value") > ans D df p-value 304 279 0.1454152

We conclude that there is no evidence against the null hypothesis that the true distribution is Poisson with a mean of about 0.7 deaths per year per Corps.

Binomial Dispersion Test


Let x1 , ..., xn be a random sample. To test if the data are from a binomial distribution with parameters N and p, Fisher recommended the binomial dispersion statistic, x n Hxi - L2 i=1 D = ` ` N p H1 - pL

` x where denotes the sample mean and p = N . If the data were generated by a binomial distribution then D x has a c2 -distribution with n - 1 degrees of freedom. Large values of D indicate suggest that the data has a dispersion or variance greater than that given by the binomial distribution. In practice, an upper one-tailed test is normally done.

Weldon Dice Data


W.F.R. Weldon (1860-1906) reportedly threw 12 dice a total of 26,306 times. The observed frequencies for the number of dice showing a 5 or 6 is shown in the following table:

DispersionTests.nb

Number of Dice With 5 or 6 0 1 2 3 4 5 6 7 8 9 10 11 12

Observed Frequency 185 1149 3265 5475 6114 5194 3067 1331 403 105 14 4 0

Source: R.A. Fisher (1925), Statistical Methods for Research Workers, p.64. For convenience, the second column of the above table is available in the file: n:259b data/Weldon.txt ` Input these observed frequencies. Compute p and compute the Fisher Dispersion Test for testing if this data was generated by a Binomial distribution. Test if the the dice appear to be fair, that is, if the probability of 5 or 6 is equal to p = 1 3. A script file to generate the output is available WeldonBinomialDispersionTest.ssc on our course homepage.
> weldon <- scan("n:/259B/data/weldon.txt") > x <- rep(0:12, weldon) > p.hat <- mean(x)/12 > p.hat [1] 0.3376986 > FisherD <- sum((x - mean(x))^2)/(p.hat * (1 - p.hat) * 12) > df <- length(x) - 1 > pvalue <- 1 - pchisq(FisherD, df) > ans <- c(FisherD, df, pvalue) > names(ans) <- c("Fisher D", "df", "p-value") > ans Fisher D df p-value 26445.78 26305 0.2690794

Solution

Remarks: 1. Note how using a vector form of the rep function avoids using a for loop. 2. Notice that the table function gives you back the frequency tabulations. That is:
> table(x) 0 1 2 3 4 5 6 7 8 9 10 11 185 1149 3265 5475 6114 5194 3067 1331 403 105 14 4

You might also like