Professional Documents
Culture Documents
Malcolm Sutherland
(student matriculation no. 0204783)
A coursework report submitted for the requirements of the module in Environmental Management (WW533) at the University of Abertay Dundee, December 2002 REVISED MAY 2013
CONTENTS
PART A: a statistical analysis of water quality data from two sewage treatment works
Introduction The statistical methods used Outliers An overview of the data Basic statistics Comparing the BOD and TSS levels The performance of the 2 STWs Conclusion
PART B: an appraisal of the statistical methods used in a journal paper References Appendices
Appendix 1: BOD and TSS data for the 2 STWs Appendix 2: histograms for the BOD and TSS data Appendix 3: graphs (BOD against TSS) Appendix 4: box-plots for BOD and TSS data Appendix 5: t-test and F values Appendix 6: histograms of data collected in summer and winter months Appendix 7: the 2-sample t-test on Minitab
PART A A statistical analysis of water quality data from two sewage treatment works
INTRODUCTION
Monthly water sampling data was collected by 2 STWs (sewage treatment works) between March 1998, and August 2000. 2 parameters are discussed in this report BOD (Biochemical Oxygen Demand), and TSS (Total Suspended Solids). Effluent quality standards for BOD and TSS, under EU regulations, must not exceed 25mg/L and 30mg/L, respectively (Harrison, 2001). The data from the STWs is given in Appendix 1. Not all of the figures given for each sample are actual numbers. In addition to occasional gaps in the data columns, some BOD and TSS levels are noted as "<", or "less than" a value. These cannot be used in a statistical analysis, as there is no way in which the actual value can be determined, which would affect the patterns being analysed. These numbers have been omitted in the analysis as a result, and incomplete rows will also ignored when comparing data columns. There are other features, which affect the statistical methods being used- The data for each STW was collected on different dates, and for STW B, more than one sample was taken on some dates. As a result, the BOD or TSS levels in effluent from each plant, should not be correlated, or plotted on a graph against each other. Furthermore, the data columns from the plants are of different lengths, so this is not possible anyhow. The purpose of this section, is to decide: (1) which of me 2 STWs is performing better than the other (in meeting legal requirements); (2) whether or not the BOD and TSS levels in the effluent are improving or worsening; (3) if any seasonal changes are occurring; and, (4) whether or not there any relationship which exists between the BOD and TSS levels.
The standard deviation reflects how much the data is spread apart above and below the mean value (the range). The greater the standard deviation, the less precise is the mean. The standard error (SE) reflects the reliability of the mean, which increases with the size of the data set, and decreases with increasing range of the data. Thus it is directly proportional to the variance, and inversely proportional to N (Skoog et al, 1996):
The 2-sample t-test Minitab V13 performs an independent 2-tailed, two-sample t-test, to calculate a confidence interval of the difference between two population means, even when the sets of data are of different lengths. The test compares the variance of the data sets, in order to determine whether or not they are of the same sampling population. The Minitab program truncates the df (degrees of freedom), and the t value referred to in the t table (Appendix 5) = (df- 1). If the calculated t value is greater than that in the tables, it is unlikely that the sets of data originate from the same population, i.e., the mean ( ) values will be different from one another. Appendix 8 explains this method in more detail. Linear regression (Mann, 1995) These methods both graphically measure the relationship between 2 variables, although the regression is a more thorough analysis. Linear regression is based on the assumption, that: y = Bx + c + ( is a symbol for random error). (e) is an estimator for the random error, which is (Value for y) - (Predicted y) (see Figure 1 over-page). The Minitab output data for a linear regression analysis, produces 2 rows of data, beside the "Regression" (predicted line on graph), and the Residual Error" (the unexplained variations seen). The results include the SS (Sum of Squares), the MS (Mean of Squares), the (R-Sq), the F, t, and the p values, all of which provide an insight into the relationship. The aim of linear regression, is to produce a line, whereby all the values of (e) from the data, produce the smallest possible total value. The graphs in Appendix 3 show the predicted line. The (e) value indicates the residual error in the prediction (or how much the plotted data deviates above and below the Sine on the graph).
The analysis includes the R-Sq percentage value, which shows the certainty of a relationship between the y and x variables (in tins report, BOD against TSS): R-Sq (r2) = (Regression SS* / Total SS**) This is therefore a measure, of how important the relationship between y and x is, in producing the line on the graph. The table of F values (Appendix 5), contains values, listed along the (df) for the numerator, and (df) for the denominator, axes. The former is (k - 1), which is 1; the latter is (N - 2), which for STW A, would be (28 - 2), or 26. The p value ranges from 0.000 to 1.000 (from a significant relationship, to no relationship) (Mann, 1995; Mendenhall, 1999). Box plot and histogram charts The box-plot and histogram, graphically display the range of the data, and how frequently the entries occur at a particular value. The histogram is a bar chart, which groups the data in relation to a range of numbers (e.g., the number of entries of the values 0-9, 10-19, 20-29, etc.). The box plots in Appendix 4 are a scale, alongside which me data is split into sections, in order to detect skewness, and possible outliers:
Figure 2: the lines of each end of the bar show the spread of data, which is likely to occur. The asterisk (*) is a suspected outlier, and the circle is a definite outlier
The percentage values show the difference between the average value and the greatest and lowest values in each data set. These tend to be large, and this may suggest that errors may have occurred during the sampling procedure (particularly with the 123mg/L value), or that the TSS and BOD levels fluctuate regularly. The averaged values need to be observed with caution, as do the trends in Graphs 3 and 4, showing the BOD and TSS levels for each STW with me averaged values (over-page).
Graphs 3 and 4 show that there is a moderately consistent relationship between BOD and TSS levels (except for the outlier TSS value for STW A (14th April 1998)), with occasional exceptions. Outliers It is important to identify these, before conducting a thorough analysis of the data:
Figure 3: a printout from Minitab V13, showing the general statistics for the BOD and TSS in effluents from STWs A and B (BOD/TSS and BOD2/TSS2 respectively). These do not include averaged values. All data with numbers expressed as (<) "less than", and with gaps (e.g. where a TSS value but not a BOD value for a sample was given) has been omitted.
The z-test can be used to decide if an outlying number should be rejected from the data set used in an analysis. For the z-test, the sample z-score be used (Mendenhall et al, 1999):
The z value for x (which is any number in the data set), is the number of standard deviations between x, and the mean. An x value which is more than (3s) away from the mean, is considered to be an outlier (Mendenhall et al, 1999). The results are provided in Table 2:
Table 2: the minimum and maximum x values which are likely to occur in each data set
In the data for STW A, the values produced for the BOD and TSS on 14th April 1998 can therefore be rejected. For STW B, the BOD value for the 24th August 1998 can also be excluded- In addition, the BOD value recorded on 25th March 1998 and the TSS value recorded on 9th August 2000 are also rejected. Data from the rows in Appendix 1 containing
these outliers are excluded from the data being analysed further.
BASIC STATISTICS
The descriptive statistics are shown below (with the outliers having been omitted). "NewBOD" and "NewTSS" are the values from STW A, and "nBOD2" and "nTSS2" are the values for STWB:
The skewness of each edited data set is listed below: STW A STW B BOD data set = 1.413 BOD data set = 1.398 TSS data set = 1.504 TSS data set = 0.700
Histograms, skewness and basic statistics are graphically displayed in Appendix 2. Both the mean and the median for all 4 sets of data, are values which are well within legal limits. For both the BOD and TSS levels, the mean and median values for STW B are slightly higher, than those for STW A. The standard deviation values are generally the same for all variables (between 5 and 7), except for the TSS levels in the STW A effluent, which produces a slightly higher value (nearly 10). This indicates, that the TSS levels in the STW B effluent are more variable, but oiherwise, BOD and TSS levels from the STW plants are very stable. The histograms in Appendix 2 are all positively skewed, and so it appears that the occurrence of TSS or BOD levels above legal limits is unlikely, as there are only some isolated values towards the right hand side of the histograms. There may be a small possibility of the TSS levels in the STW B effluent approaching legal limits.
There is a small certainly of a relationship between the BOD and the TSS levels in the effluent produced by STW A. The F value for the graph is slightly larger than its equivalent in statistical tables in Appendix 5, and the p value shows a small probability of other errors affecting the predicted line. In comparison, there is a much stronger relationship between the TSS and BOD levels in the STW B effluent, with negligible probability of the predicted line being affected by errors, and a substantially larger F value. It therefore appears that the BOD and TSS levels may fluctuate in unison, and could be that the suspended solids are an important source of the BOD.
STW B
= H0)
...should be rejected, or that he TSS levels in the plants are significantly different from each other. It may therefore be said, that STW B performs slightly better than STW A in treating
Copyright of LabSearch, a working title of Dr Malcolm Sutherland 2013
Figure 5
Figure 6
DO THE T.S.S. AND B.O.D. LEVELS VARY WITH THE TIME OF YEAR?
One pattern that was observed in Graphs 3 and 4, was the change in the levels and variation between the summer and winter months. Tlie data from the four sets is split into 2 groups: "Cold" - October - March, and "Warm" - April September. The values for each category for the four data sets are listed in Figure 7 over-page. The descriptive statistics are given, and the histograms are provided in Appendix 6. Only the descriptive statistics will be used here, as this is only a brief insight, which deals with small numbers of samples from each time of the year for each of the three years (1998 - 2000). The outliers (except the extreme outliers on 14/04/98, for STW A) have been included, as they may have occurred as a result of climatic conditions (Figure 8).
Figure 7: the TSS and BOD levels in STWs A and B during cold and warm months
For all four data sets, the average (mean) and most common (median) levels are tower during the colder months. The standard deviation is also (slightly) larger for data gathered during the warmer months, in comparison to that taken in colder months. The histograms (C1, C2) for the BOD levels from STW A, show a normal distribution with the colder month samples, and a skewed histogram for samples in the warmer months, which suggests that the BOD is not slightly influenced by the seasons. Histograms C3 and C4 show
that the TSS levels from STW A are almost fixed at below 15mg/L, whereas this changes during the warmer months. For STW B, the histograms for the BOD levels (C5, C6) show almost no change; the histograms for the TSS levels (C7, C8l are also similar, although the levels during warmer months may range up to 50mg/L, even though they are focused at around 15mg/L. It may be that TSS and BOD levels in the effluent (particularly for STW A) rise during warmer temperatures, although without data for the temperature, this cannot be proven. The descriptive statistics in Figure 8 give the impression that the time of year may influence these parameters, although the histograms do not give conclusive evidence that this is the case.
CONCLUSION
The BOD and TSS levels in the effluent from both works are generally below the legal limit, although a few infringements have occurred. Occasionally there are samples which contain very high results, but a few outliers were rejected The remaining data showed that STW A performed slightly better than STW B, in controlling the BOD levels. It is not certain, which plant performs best in controlling die TSS levels. The performance of STW A is partly determined by the time of year, although this is not definite. The BOD and TSS levels share a weak relationship with the STW A effluent, but there is a very certain relationship between these pollutants in the STW B effluent. From the graphs, it can be said that the performance of the STW plants has not worsened or unproved considerably between 1998 and 2000.
This article focuses on the relationships between soil microorganism populations and the types of copper in the soil, alongside relevant parameters such as IC5o (the conditions under which 50% of the microbial population dies), soil pH, and organic carbon. The authors chose to analyse the relationships between the microbial population, soil copper "fractions", using correlation graphs, multiple regression, and the Principal Component Analysis.
DATA ASSEMBLY
They collected 11 soil samples, and used triplicate samples from each one for extraction and analysis. All the values used in their analyses are based on the averaged data alone, but they do not mention the variation in the underlying data. Chemists usually carry between 2 and 5 replicates of a sample through a laboratory procedure, and 3 replicates is a common choice - increasing the number of replicates will increase the time spent in the laboratory, and the financial cost (Skoog et al, 1996). The total number of samples analysed is rather small, and may require more than one statistical method, in order to confirm any trends in the data. Methods such as the analysis of variance and t-tests are used to decide whether or not 2 or more sets of data are significantly different from one another. This article concerns the relationship between different kinds of data (e.g. soil pH and copper concentration), and thus the use of methods such as linear correlation and multiple regression are more appropriate. Several parameters are used. The chemical principles behind these will not be explained, but they are listed (and italicised in Times font):
Soil pH (Types of copper in the soil) Ex-Cu (soluble and exchangeable copper) Pyro-Cu (organically bound copper) (Soil nutrients) Total-N (nitrogen in soil) (Microbial measurements) Org-C (carbon in organic molecules) cfu (microbial populations in soil samples) IC50, Cmic (carbon in microbial biomass) Fe (iron) Mn (manganese) HOAc-Cu (specifically absorbed copper) T-Cu (total copper levels in the soil
The authors are using a wide range of information in their study, and so their analyses involve studying the multiple relationships amongst these parameters. Most of these are described in the paper, although the authors do not discuss how any of the 3 soil nutrients influence the relationship between the soil copper fractions, and microbial populations.
LINEAR CORRELATION
In order to study the relationship between the IC50 and each copper fraction (in order to investigate whether or not the concentration of each type of copper in the soil contributed to the death of the microbes), the authors used linear correlation, which involves comparing values of one parameter against another on a graph. This is a simpler method in comparison to regression, although this is best used to decide if two parameters are directly, or inversely, proportional to each other. The authors are investigating the influence of copper on the microbes, and linear regression could have been used as well to give an indication of llie residual error (i.e. the variations on a graph, which are not explained by the effects of copper on microbes) (Mann, 1995; Mendenhall, 1999).
MULTIPLE REGRESSION
Using one method of analysing data, gives a limited and biased insight into comparing data sets and the authors wisely use other methods. The multiple regression model considers the relationship between several components (Mendenhall et al, 1999.) Not only does this model analyse the several possible relationships between the components (in this article, the copper fractions), but it also measures the residual error overall. The usefulness of the multiple regression model can be tested using the Analysis of Variance F test, which is mentioned on page 176 of the paper. The F value gives an indication of how much the several components affect one another.
IN CONCLUSION
Kunito et al (1999) chose appropriate methods of analysing their data, to investigate whether or not a few soil parameters had any influence on the microbial population. They also look a note of some soil characteristics (by measuring the soil pH, and analysing its organic matter content). Although their statistical analysis was thorough, the number of soil samples and replicates used was small, and so this analysis may have beer; more appropriate for the same type of soil (in terms of pH and organic matter content, which affect microbial population). Altogether, their methods do take variations of each parameter into account, although the author's themselves recommend further research on their laboratory work at the end of the journal.
REFERENCES
Harrison, R.M. Pollution: Causes, Effects and Control (4th Ed.). Ch.5: Sewage and Sewage Sludge Treatment, pi 16. 2001 The Royal Society of Chemistry. Mann, P.M. Statistics for Business and Economics. Ch.12: Analysis of Variance, pp628-635. Ch.13; Simple Linear Regression, pp662-667, 676-679. Copyright 1995, John Wiley & Sons, Inc. Mendenhall, W., Beaver, R,.l., Beaver, B.M. Introduction to Probability and Statistics. Ch.2: Describing Daia with Numerical Measures, pp77-84. Ch.5: Several Useful Discrete Distributions, pl83. Ch.12: Linear Regression and Correlation, p548. Ch.13: Multiple Regression Analysis, p565-568. Copyright 1999 Brooks/Cole Publishing Company. Skoog, D.A., West, D.M., Holler, F.J. Fundamentals of Analytical Chemistry. Ch.3: Random Errors in Analysis, pp27-33. 1996 Saunders College Publishing.
Websites (these were viewed in November 2002, and are no longer available) Thermo Galactic, 2002. Algorithms; Principal Component Analysis Methods: Optimization. Copyright 2002 Thermo Galactic. Hyvarinen, A. Principal Component Analysis (1999). Oieroset, M. Principal Component Analysis (1999).