Statistical Analysis of Environmental Quality Data

Statistical analysis of environmental quality data
STATISTICAL ANALYSIS OF ENVIRONMENTAL QUALITY DATA
Malcolm Sutherland
(student matriculation no. 0204783)
A coursework report submitted for the requirements of the module in Environmental Management (WW533) at the University of Abertay Dundee, December 2002 REVISED MAY 2013
Copyright of LabSearch, a working title of Dr Malcolm Sutherland 2013
CONTENTS
PART A: a statistical analysis of water quality data from two sewage treatment works
Introduction The statistical methods used Outliers An overview of the data Basic statistics Comparing the BOD and TSS levels The performance of the 2 STWs Conclusion
PART B: an appraisal of the statistical methods used in a journal paper References Appendices
Appendix 1: BOD and TSS data for the 2 STWs Appendix 2: histograms for the BOD and TSS data Appendix 3: graphs (BOD against TSS) Appendix 4: box-plots for BOD and TSS data Appendix 5: t-test and F values Appendix 6: histograms of data collected in summer and winter months Appendix 7: the 2-sample t-test on Minitab
PART A A statistical analysis of water quality data from two sewage treatment works
INTRODUCTION
Monthly water sampling data was collected by 2 STWs (sewage treatment works) between March 1998, and August 2000. 2 parameters are discussed in this report BOD (Biochemical Oxygen Demand), and TSS (Total Suspended Solids). Effluent quality standards for BOD and TSS, under EU regulations, must not exceed 25mg/L and 30mg/L, respectively (Harrison, 2001). The data from the STWs is given in Appendix 1. Not all of the figures given for each sample are actual numbers. In addition to occasional gaps in the data columns, some BOD and TSS levels are noted as "<", or "less than" a value. These cannot be used in a statistical analysis, as there is no way in which the actual value can be determined, which would affect the patterns being analysed. These numbers have been omitted in the analysis as a result, and incomplete rows will also ignored when comparing data columns. There are other features, which affect the statistical methods being used- The data for each STW was collected on different dates, and for STW B, more than one sample was taken on some dates. As a result, the BOD or TSS levels in effluent from each plant, should not be correlated, or plotted on a graph against each other. Furthermore, the data columns from the plants are of different lengths, so this is not possible anyhow. The purpose of this section, is to decide: (1) which of me 2 STWs is performing better than the other (in meeting legal requirements); (2) whether or not the BOD and TSS levels in the effluent are improving or worsening; (3) if any seasonal changes are occurring; and, (4) whether or not there any relationship which exists between the BOD and TSS levels.
THE STATISTICAL METHODS USED (Mendenhall et al, 1999)

The methods used (with Minitab VI 3) are described as follows: Basic statistics (mean, median, standard deviation and standard error) The mean ( ) is the average value in a set of data; the median is the (most) central value in a set of numbers laid out in increasing order. The standard deviation ( ) values, is the squareroot ( ) of the variance in a "population" of data (over-page):
The standard deviation reflects how much the data is spread apart above and below the mean value (the range). The greater the standard deviation, the less precise is the mean. The standard error (SE) reflects the reliability of the mean, which increases with the size of the data set, and decreases with increasing range of the data. Thus it is directly proportional to the variance, and inversely proportional to N (Skoog et al, 1996):
The 2-sample t-test Minitab V13 performs an independent 2-tailed, two-sample t-test, to calculate a confidence interval of the difference between two population means, even when the sets of data are of different lengths. The test compares the variance of the data sets, in order to determine whether or not they are of the same sampling population. The Minitab program truncates the df (degrees of freedom), and the t value referred to in the t table (Appendix 5) = (df- 1). If the calculated t value is greater than that in the tables, it is unlikely that the sets of data originate from the same population, i.e., the mean ( ) values will be different from one another. Appendix 8 explains this method in more detail. Linear regression (Mann, 1995) These methods both graphically measure the relationship between 2 variables, although the regression is a more thorough analysis. Linear regression is based on the assumption, that: y = Bx + c + ( is a symbol for random error). (e) is an estimator for the random error, which is (Value for y) - (Predicted y) (see Figure 1 over-page). The Minitab output data for a linear regression analysis, produces 2 rows of data, beside the "Regression" (predicted line on graph), and the Residual Error" (the unexplained variations seen). The results include the SS (Sum of Squares), the MS (Mean of Squares), the (R-Sq), the F, t, and the p values, all of which provide an insight into the relationship. The aim of linear regression, is to produce a line, whereby all the values of (e) from the data, produce the smallest possible total value. The graphs in Appendix 3 show the predicted line. The (e) value indicates the residual error in the prediction (or how much the plotted data deviates above and below the Sine on the graph).
Figure 1: the residual error in linear regression
The analysis includes the R-Sq percentage value, which shows the certainty of a relationship between the y and x variables (in tins report, BOD against TSS): R-Sq (r2) = (Regression SS* / Total SS**) This is therefore a measure, of how important the relationship between y and x is, in producing the line on the graph. The table of F values (Appendix 5), contains values, listed along the (df) for the numerator, and (df) for the denominator, axes. The former is (k - 1), which is 1; the latter is (N - 2), which for STW A, would be (28 - 2), or 26. The p value ranges from 0.000 to 1.000 (from a significant relationship, to no relationship) (Mann, 1995; Mendenhall, 1999). Box plot and histogram charts The box-plot and histogram, graphically display the range of the data, and how frequently the entries occur at a particular value. The histogram is a bar chart, which groups the data in relation to a range of numbers (e.g., the number of entries of the values 0-9, 10-19, 20-29, etc.). The box plots in Appendix 4 are a scale, alongside which me data is split into sections, in order to detect skewness, and possible outliers:
Figure 2: the lines of each end of the bar show the spread of data, which is likely to occur. The asterisk (*) is a suspected outlier, and the circle is a definite outlier
AN OVERVIEW OF THE DATA

The first 2 graphs over-page show the BOD and TSS levels against time, for each of the sewage treatment works. Graph 1 The BOD levels for effluent from STW A exceed legal requirements on 3 occasions (9th April 1998, 9th August 1999 and 9th July 2000), with a near infringement on 9th July 1999. During 1998, the levels remain well below 20mg/L. During 1999 the levels fluctuate (particularly around summer), then generally fall during the winter months, and fluctuate during the spring months in 2000. It appears that the BOD levels fluctuate in unison with the TSS levels, which may indicate that these variables are not independent of one another. The TSS levels exceed legal requirements on 4 occasions, with one extreme value being recorded on 9 April 1998 (123mg/L). None of the other infringements occur on the same date as the unacceptably high BOD levels, although all 3 are concentrated between April and October 1999. Graph 2 The BOD levels for STW B fluctuate throughout the time, with five infringements being recorded. Three unacceptably high values for the TSS are recorded, both in 1998 (August and November), beside values which are within legal limits (a third occurs in July 2000). The TSS levels follow a similar trend to the BOD levels. Dates where more than one sample was taken One some dates, more than one water sample was taken. The averaged numbers will not be used during the statistical analysis which follows, as all the numbers are relevant. However, it is worth looking at the data being averaged in Table 1:
Table 1: average values for sample values, on dates where more than one sample was taken
The percentage values show the difference between the average value and the greatest and lowest values in each data set. These tend to be large, and this may suggest that errors may have occurred during the sampling procedure (particularly with the 123mg/L value), or that the TSS and BOD levels fluctuate regularly. The averaged values need to be observed with caution, as do the trends in Graphs 3 and 4, showing the BOD and TSS levels for each STW with me averaged values (over-page).
Graphs 3 and 4 show that there is a moderately consistent relationship between BOD and TSS levels (except for the outlier TSS value for STW A (14th April 1998)), with occasional exceptions. Outliers It is important to identify these, before conducting a thorough analysis of the data:
Figure 3: a printout from Minitab V13, showing the general statistics for the BOD and TSS in effluents from STWs A and B (BOD/TSS and BOD2/TSS2 respectively). These do not include averaged values. All data with numbers expressed as (<) "less than", and with gaps (e.g. where a TSS value but not a BOD value for a sample was given) has been omitted.
The z-test can be used to decide if an outlying number should be rejected from the data set used in an analysis. For the z-test, the sample z-score be used (Mendenhall et al, 1999):
The z value for x (which is any number in the data set), is the number of standard deviations between x, and the mean. An x value which is more than (3s) away from the mean, is considered to be an outlier (Mendenhall et al, 1999). The results are provided in Table 2:
Table 2: the minimum and maximum x values which are likely to occur in each data set
In the data for STW A, the values produced for the BOD and TSS on 14th April 1998 can therefore be rejected. For STW B, the BOD value for the 24th August 1998 can also be excluded- In addition, the BOD value recorded on 25th March 1998 and the TSS value recorded on 9th August 2000 are also rejected. Data from the rows in Appendix 1 containing
these outliers are excluded from the data being analysed further.
BASIC STATISTICS
The descriptive statistics are shown below (with the outliers having been omitted). "NewBOD" and "NewTSS" are the values from STW A, and "nBOD2" and "nTSS2" are the values for STWB:
Figure 4: descriptive statistics (NewBOD, NewTSS, nBOD2, nTSS2)
The skewness of each edited data set is listed below: STW A STW B BOD data set = 1.413 BOD data set = 1.398 TSS data set = 1.504 TSS data set = 0.700
Histograms, skewness and basic statistics are graphically displayed in Appendix 2. Both the mean and the median for all 4 sets of data, are values which are well within legal limits. For both the BOD and TSS levels, the mean and median values for STW B are slightly higher, than those for STW A. The standard deviation values are generally the same for all variables (between 5 and 7), except for the TSS levels in the STW A effluent, which produces a slightly higher value (nearly 10). This indicates, that the TSS levels in the STW B effluent are more variable, but oiherwise, BOD and TSS levels from the STW plants are very stable. The histograms in Appendix 2 are all positively skewed, and so it appears that the occurrence of TSS or BOD levels above legal limits is unlikely, as there are only some isolated values towards the right hand side of the histograms. There may be a small possibility of the TSS levels in the STW B effluent approaching legal limits.
COMPARING THE BOD AND THE TSS LEVELS

Linear regression can be used to analyse for any relationship between the two parameters. The fitted-line graphs for comparing the BOD levels against those of TSS, are presented in Appendix 3. As described, the R-squared, F and p values all indicate the certainty of a relationship between the two parameters. For the STW A values, N (number of entries) = 28,
and for STW B, N = 40. F and p values are given in Table 3:

Table 3
There is a small certainly of a relationship between the BOD and the TSS levels in the effluent produced by STW A. The F value for the graph is slightly larger than its equivalent in statistical tables in Appendix 5, and the p value shows a small probability of other errors affecting the predicted line. In comparison, there is a much stronger relationship between the TSS and BOD levels in the STW B effluent, with negligible probability of the predicted line being affected by errors, and a substantially larger F value. It therefore appears that the BOD and TSS levels may fluctuate in unison, and could be that the suspended solids are an important source of the BOD.
THE PERFORMANCE OF THE TWO SEWAGE TREATMENT WORKS

Which STW plant performs better? Appendix 4 contains the box plots produced using the Paired t-test, for comparing the BOD and TSS data sets for each STW plant. The degrees of freedom will vary according to the variance, as discussed in Appendix 8. Appendix 5 shows the table of t-values used here. The t-values for (df = 52 - 1) are (approximately) 2.021 (at 95% confidence level), and 2.704 (at 99% confidence level). It may be that the two populations are significantly different, as the confidence interval numbers are both below zero, which indicates that the p. value for BOD levels in the STW A effluent, are lower than the value for the BOD in the STW B effluent. The box plots (Appendix 4) for the two BOD levels are both narrow and set apart, and so it can be said that STW A performs slightly better than STW B, in producing a low BOD effluent. The t-value for (44 - 1) df = 2.021 (at the 95% confidence interval), and 2.704 at the 99% confidence interval. The t value in Figure 6 is -1.88, and so the two populations may be, but are not significantly, different from one another. The box plots for the TSS levels arc more spread out (Appendix 4). The confidence interval range includes the value of zero, but the p value (0.063) is greater than that for the BOD levels (0.031), and so it is more certain, that the null hypothesis... (
STW A
STW B
= H0)
...should be rejected, or that he TSS levels in the plants are significantly different from each other. It may therefore be said, that STW B performs slightly better than STW A in treating
Total Suspended Solids in wastewater, although this is not significantly proven.
Figure 5
Figure 6
DO THE T.S.S. AND B.O.D. LEVELS VARY WITH THE TIME OF YEAR?
One pattern that was observed in Graphs 3 and 4, was the change in the levels and variation between the summer and winter months. Tlie data from the four sets is split into 2 groups: "Cold" - October - March, and "Warm" - April September. The values for each category for the four data sets are listed in Figure 7 over-page. The descriptive statistics are given, and the histograms are provided in Appendix 6. Only the descriptive statistics will be used here, as this is only a brief insight, which deals with small numbers of samples from each time of the year for each of the three years (1998 - 2000). The outliers (except the extreme outliers on 14/04/98, for STW A) have been included, as they may have occurred as a result of climatic conditions (Figure 8).
Figure 7: the TSS and BOD levels in STWs A and B during cold and warm months
Figure 8: descriptive statistics for the data in Figure 7
For all four data sets, the average (mean) and most common (median) levels are tower during the colder months. The standard deviation is also (slightly) larger for data gathered during the warmer months, in comparison to that taken in colder months. The histograms (C1, C2) for the BOD levels from STW A, show a normal distribution with the colder month samples, and a skewed histogram for samples in the warmer months, which suggests that the BOD is not slightly influenced by the seasons. Histograms C3 and C4 show
that the TSS levels from STW A are almost fixed at below 15mg/L, whereas this changes during the warmer months. For STW B, the histograms for the BOD levels (C5, C6) show almost no change; the histograms for the TSS levels (C7, C8l are also similar, although the levels during warmer months may range up to 50mg/L, even though they are focused at around 15mg/L. It may be that TSS and BOD levels in the effluent (particularly for STW A) rise during warmer temperatures, although without data for the temperature, this cannot be proven. The descriptive statistics in Figure 8 give the impression that the time of year may influence these parameters, although the histograms do not give conclusive evidence that this is the case.
CONCLUSION
The BOD and TSS levels in the effluent from both works are generally below the legal limit, although a few infringements have occurred. Occasionally there are samples which contain very high results, but a few outliers were rejected The remaining data showed that STW A performed slightly better than STW B, in controlling the BOD levels. It is not certain, which plant performs best in controlling die TSS levels. The performance of STW A is partly determined by the time of year, although this is not definite. The BOD and TSS levels share a weak relationship with the STW A effluent, but there is a very certain relationship between these pollutants in the STW B effluent. From the graphs, it can be said that the performance of the STW plants has not worsened or unproved considerably between 1998 and 2000.
PART B Appraisal of the statistical methods used in a journal paper

SOURCE: Kunito, T; Saeki, K; Oyaizu, H; Matsumoto, S. Influences of Copper Forms on the Toxicity to Microorganisms in Soils. Journal of Ecotoxicology and Environmental Safety 44 (1999) pp174 181 (this can be purchased online; the author is not licensed to reproduce the paper in this report)
This article focuses on the relationships between soil microorganism populations and the types of copper in the soil, alongside relevant parameters such as IC5o (the conditions under which 50% of the microbial population dies), soil pH, and organic carbon. The authors chose to analyse the relationships between the microbial population, soil copper "fractions", using correlation graphs, multiple regression, and the Principal Component Analysis.
DATA ASSEMBLY
They collected 11 soil samples, and used triplicate samples from each one for extraction and analysis. All the values used in their analyses are based on the averaged data alone, but they do not mention the variation in the underlying data. Chemists usually carry between 2 and 5 replicates of a sample through a laboratory procedure, and 3 replicates is a common choice - increasing the number of replicates will increase the time spent in the laboratory, and the financial cost (Skoog et al, 1996). The total number of samples analysed is rather small, and may require more than one statistical method, in order to confirm any trends in the data. Methods such as the analysis of variance and t-tests are used to decide whether or not 2 or more sets of data are significantly different from one another. This article concerns the relationship between different kinds of data (e.g. soil pH and copper concentration), and thus the use of methods such as linear correlation and multiple regression are more appropriate. Several parameters are used. The chemical principles behind these will not be explained, but they are listed (and italicised in Times font):
Soil pH (Types of copper in the soil) Ex-Cu (soluble and exchangeable copper) Pyro-Cu (organically bound copper) (Soil nutrients) Total-N (nitrogen in soil) (Microbial measurements) Org-C (carbon in organic molecules) cfu (microbial populations in soil samples) IC50, Cmic (carbon in microbial biomass) Fe (iron) Mn (manganese) HOAc-Cu (specifically absorbed copper) T-Cu (total copper levels in the soil
The authors are using a wide range of information in their study, and so their analyses involve studying the multiple relationships amongst these parameters. Most of these are described in the paper, although the authors do not discuss how any of the 3 soil nutrients influence the relationship between the soil copper fractions, and microbial populations.
LINEAR CORRELATION
In order to study the relationship between the IC50 and each copper fraction (in order to investigate whether or not the concentration of each type of copper in the soil contributed to the death of the microbes), the authors used linear correlation, which involves comparing values of one parameter against another on a graph. This is a simpler method in comparison to regression, although this is best used to decide if two parameters are directly, or inversely, proportional to each other. The authors are investigating the influence of copper on the microbes, and linear regression could have been used as well to give an indication of llie residual error (i.e. the variations on a graph, which are not explained by the effects of copper on microbes) (Mann, 1995; Mendenhall, 1999).
MULTIPLE REGRESSION
Using one method of analysing data, gives a limited and biased insight into comparing data sets and the authors wisely use other methods. The multiple regression model considers the relationship between several components (Mendenhall et al, 1999.) Not only does this model analyse the several possible relationships between the components (in this article, the copper fractions), but it also measures the residual error overall. The usefulness of the multiple regression model can be tested using the Analysis of Variance F test, which is mentioned on page 176 of the paper. The F value gives an indication of how much the several components affect one another.
THE PRINCIPAL COMPONENT ANALYSIS

The authors have analysed ihe relationship between the soil microorganisms, and each data set alone (linear correlation), bin also with all the copper fractions together (multiple regression). However, the patterns observed by these methods are not always conclusive. The linear regression only considers one variable. The multiple regression method produces a model, which can be affected by poor relationships among the parameters being analysed. The PCA is a highly sophisticated tool, which analyses all the possible patterns among several parameters, and can identify significant relationships (Thermo-Galactic, 2002; Hyvarinen, 1999; ieroset, 1999). This is a complicated computer program, and so the authors demonstrate caution in making their interpretations.
IN CONCLUSION
Kunito et al (1999) chose appropriate methods of analysing their data, to investigate whether or not a few soil parameters had any influence on the microbial population. They also look a note of some soil characteristics (by measuring the soil pH, and analysing its organic matter content). Although their statistical analysis was thorough, the number of soil samples and replicates used was small, and so this analysis may have beer; more appropriate for the same type of soil (in terms of pH and organic matter content, which affect microbial population). Altogether, their methods do take variations of each parameter into account, although the author's themselves recommend further research on their laboratory work at the end of the journal.
REFERENCES
Harrison, R.M. Pollution: Causes, Effects and Control (4th Ed.). Ch.5: Sewage and Sewage Sludge Treatment, pi 16. 2001 The Royal Society of Chemistry. Mann, P.M. Statistics for Business and Economics. Ch.12: Analysis of Variance, pp628-635. Ch.13; Simple Linear Regression, pp662-667, 676-679. Copyright 1995, John Wiley & Sons, Inc. Mendenhall, W., Beaver, R,.l., Beaver, B.M. Introduction to Probability and Statistics. Ch.2: Describing Daia with Numerical Measures, pp77-84. Ch.5: Several Useful Discrete Distributions, pl83. Ch.12: Linear Regression and Correlation, p548. Ch.13: Multiple Regression Analysis, p565-568. Copyright 1999 Brooks/Cole Publishing Company. Skoog, D.A., West, D.M., Holler, F.J. Fundamentals of Analytical Chemistry. Ch.3: Random Errors in Analysis, pp27-33. 1996 Saunders College Publishing.
Websites (these were viewed in November 2002, and are no longer available) Thermo Galactic, 2002. Algorithms; Principal Component Analysis Methods: Optimization. Copyright 2002 Thermo Galactic. Hyvarinen, A. Principal Component Analysis (1999). Oieroset, M. Principal Component Analysis (1999).

Statistical Analysis of Environmental Quality Data

Uploaded by

Document Information

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Statistical Analysis of Environmental Quality Data

Uploaded by

Copyright:

Statistical analysis of environmental quality data