You are on page 1of 11

Class 9 Regression & Correlation Analysis Regression http://www.math.csusb.edu/faculty/stanton/probstat/regression.html http://www.ruf.rice.

edu/~lane/stat_sim/reg_by_eye/

If you have two variables in a sample, for example life satisfaction (y-axis) and stress level (x-axis) and both are continuous variables, you can plot their values and observe whether they form a pattern. Try the above applet. Itll show you how a regression line (a straight line that fits the plot of the values of two variables in a sample) can be drawn and how it will be influenced by the locations of the values in the plot. Curve Fitting The simplest type is linear regression : y = a + bx Other common forms (non-linear) quadratic : y = b0 + b1x + b2x2. exponential : y = a.bx. logarithmic : y = a + b log x. Multiple regression is a kind of linear regression with more than one independent variables: y = b 0 + b1x1 + b2x2.... + bkxk. In social work research studies, we usually deal with phenomena which are multi-causational, that is, more than one factor (independent variable) affecting one dependent variable. Thus, we usually use multiple regressions instead of simple regression. The interpretation of multiple regressions is very similar to that of simple regression. Furthermore, we can usually transform the other types of regression into linear regression. For instance, where y = a + b Log x, we can use y = a + bZ, with Z = log x. Linear Equation: y = a + bx, or sometimes referred to as f(x) = a + bx where "a" represents the intercept of the line on the y-axis, and "b" represents the slope of the line relative to the x-axis, i.e. b = tan as shown in Figure 1.

The interpretation of "b" Sometimes, the value of "b" is used to provide us some idea how the knowledge of the value x helps us in predicting the value of y. In figure 2 where the value b = 0, even if the value of x is known, it does not help to improve our ability to predict the value of y. On the other hand, we should also note the value of b depends on the unit of measurement of the variables of x and y. For example, if we hypothesize that the examination score (y) is a linear function of the time spent in studying (x), we have y = a + bx. The value of "b" if we use "week" as the unit of measure of x will be 7 times the value of the "b" if we use "day" as the unit of measure instead. As a convention, in research reports (, ) are used to represent the coefficients when both the values of (x, y) are standardized (note: instead of representing the true parameters of the linear equation described in the subsequent sections of this set of notes.), whereas (a, b) are used to represent the coefficients when the values of (x, y) are not standardized.

y `

Least Square
2

Say, we have a set of observations for x and y, e.g. x representing age and y representing income of a sample of persons and we try to predict income by using age, i.e. fitting a straight line (y = a + bx). The question is how to find the values of a and b. The objective is to find the values of a and b that will most fit the set of (x, y) observed. In figure 3, we will try to find a straight line that will have the least deviation from the observed value. Say, we have a straight line (y = a + bx), given the set of observed x, we would be able to use the equation to estimate the value of y. Let be the estimate and y be the observed value. Thus, what we are trying to do is to find a straight line (y = a + bx) such that the value of ( -y) 2 would be minimized. Mathematically, ( - y)2 is the minimum when b = Sxy / Sxx ; a = [y - b(x)] / n Notation Sxx stands for the sum square of x: (x - x)2 or dx2 [Where x is the mean of x] Sxy /n stands for the covariance of xy: (x - x) . (y - y) or dx dy [Where y is the mean of y] Syy stands for the sum square of y: (y - y)2 or dy2 t test for significance of (a, b) For the regression equation: y = a + bx If there is a significant linear relationship between the independent variable x and the dependent variable y, the slope will not equal zero. H0: b = 0 Ha: b 0 Null hypothesis: The slope of the regression line fitting two variables in the population is equal to zero. The t test calculate the t value by the formula t = b / SE where SE standards for the standard error of the slope of the population for the two variables x and y, and b is the slope of the two variables in the sample. SE = sb = sqrt [ (yi - i)2 / (n - 2) ] / sqrt [ (xi x

)2 ]

where yi is the value of the dependent variable for observation i, i is estimated value of the dependent variable for observation i, xi is the observed value of the independent variable for observation i, x is the mean of the independent variable, and n is the number of observations. The degrees of freedom (df) is equal to: df = n - 2 where n is the number of observations in the sample.

Regression Sum Square


3

or

Total Sum Square = Regression Sum Square + Residual Sum Square Total Variation = Explained variation + unexplained variation (y - y)2 = ( - y)2 + (y - )2 where - predicted value basing on y = a + bx. y - observed value y - mean value of y.

"Regression Sum Square" represents the amount of variation in y that is explained by the variation in x. Coefficient of Determination r2 = ( - y)2/(y - y)2 = bSxy / Syy ................................ (1) = Regression Sum Square/Total Sum Square = Explained Variation / Total Variation r2 = S2xy / Sxx . Syy from (1) bSxy / Syy (b = Sxy/Sxx) In fact, it can be proved that r = Cov (x, y) / x . y Where Cov (x, y) stands for the covariance of the two variables x & y. It can be expressed as S xy / n, where n is the number of cases. x & y are the standard deviations of x and y. Coefficient of Correlation (Pearson Product Moment r) The value range of the coefficient of correlation is -1 to 1 with "0" means no correlation. "+1" means perfectly positively correlated and "-1" means perfectly negatively correlated. In social work studies, the value of r is usually not very high. Anything above "0.5" (i.e. r 2 = 0.25) is considered to be high. Anything from 0.2 to below 0.5 is sometimes referred as "medium", whereas anything below 0.2 will be considered as low correlation. Inferences about "r" The Fisher Z transformation : Z = ln[(1+r)/(1-r)] gives an approximately normal distribution with z= ln[(1+)/(1-)], where "" represents the population correlation coefficient, and z = 1/(n-3); Thus, we can use the z-test with : z = (Z - z)/z = (Z - z)(n-3) Regression and ANOVA Mathematically, Regression and ANOVA are very similar. Both analyses require an interval scale and normally distributed dependent variable. When the independent variable is a nominal or ordinal scale variable, we use ANOVA. When the independent variable is an interval scale variable, we use regression. When there are more than one independent variable and some of the variables are nominal scale and some are interval scale type variables, we use either ANOVA or Logistic regression. If we use ANOVA, the interval scale variables are called "covariates". Standard requirements for simple linear regression: The dependent variable Y has a linear relationship to the independent variable X. For each value of X, the probability distribution of Y has the same standard deviation . For any given value of X, o The Y values are independent. o The Y values are roughly normally distributed (i.e., symmetric and unimodal). A
4

little skewness is ok if the sample size is large. Null hypothesis for the ANOVA test in regression: There is no linear relationship in the population between the dependent variable and the independent variables

10.0

Linear Regression

7.5

Doctor = 3.30 + 0.55 * q5 R-Square = 0.25


Doctor

5.0

2.5

0.0 2 4

10

Nurse

Ordinal Correlations Two variables are positively correlated if cases with low values for one variable tend to have low values for the other variable, and vice versa. The basic concept for the computation of statistics relies on the concept of concordance in pairs. A pair is discordant if for one case the value of variable A is higher than the value of variable B, and for the other case the value of variable A is lower than the value of variable B. When the two cases have identical values on one or both variables, they are tied. For example Consider the following data set with 6 records: Case 1 2 3 4 5 6 Quality of life (Dependent) 6 6 7 7 8 8 Stress level (Independent) 4 7 5 6 5 7

There can be 6C2 or (6 x 5) / (2 x 1) or 15 pairs of records, consider the following pairs of records Pairs of cases Concordance 6 1 Discordance 5 2 Ties (on independent) 3 4 Quality of life (Dependent) 8 6 8 6 7 7 Stress level (Independent) 7 4 5 7 5 6

In computing such statistics, all possible pairs of cases are taken into consideration. For example, if there are 10 cases, there would be 45 possible pairs, and if there are 100 cases, there would be 495 pairs. Gamma (Goodman and Kruskal's Gamma): G = (P - Q) / (P + Q) where P = number of concordance pair and Q = number of discordant pairs. The major limitation of G is the disregard of ties. Somers' d: Dy = (P - Q) / (P + Q + Ty) where Dy is the proportionate excess of concordant pairs over discordant pairs among pairs not tied on the independent variable. (Ty is the number of pairs tied on Y the dependent variable and not on X the independent variable). The above formulation treats one variable as independent variable and the other as dependent and hence it is asymmetrical. Symmetrical Somersd is computed by replacing Ty by half of (Ty + Tx). Somersd is an improvement of Gamma.

Kendall's Tau-b : b = (P - Q) / [(P + Q + Tx) (P + Q + Ty)] Kendall's Tau-c : c = 2m (P - Q) / N2 (m - 1) where m is the smaller of the number of rows and columns. c is better than b for rectangular tables (same number of ordinal values for both variables). Kendalls Tau-b and Kendalls Tau-c are superior to other measures of ordinal correlation when a test of significance is required. For Kendalls Tau-c, it is more suitable if the dependent and independent variables have different points of measurement. For example: Qo L 1 2 3 4 5 1 5 10 20 40 50 Stress level 2 10 20 30 20 15 3 50 40 25 10 5

Note: m=3

Cohens kappa When both the variables are having the same rating scale, for e.g. both used a 4-point scale, we can make use of this additional information to measure their association. Cohens kappa compare the percentage of agreements on both scales with the agreements observed by chance if there is no association between the two (null-hypothesis). The resulting value always ranges from -1 to +1. A value of 1 indicates perfect agreement, while value of -1 indicates perfect disagreement. A value of 0 indicates that the agreements observed is what you expect by chance. Spearman Rank Correlation Coefficient This statistics is not based on the concordance pair concept. It is most useful when there are a large number of possible values for both two variables. In the Spearman Rank, all cases are taken into consideration. One variable, the values of all cases are ranked. (i.e. creating a different ordinal variable) and similarly done on the other variables. Then, the Pearson's r between the ranks of the two variables is computed. NOMINAL SCALE - INTERVAL SCALE Correlation ratio = Explained SS/Total SS, derived from the ANOVA. Eta2 = Correlation ratio THE CHOICE OF MEASURES With so many possible measures of correlation, it is rather confusing to researchers who are not trained in statistics. The basic rules in choosing measures are: 1) Choose the measure which is appropriate for the level of measurement obtained by the variables. Never use Pearson r for nominal scale data.
9

a)

Statistics appropriate for lower level of measurement can be used for higher level of measurement, e.g. contingency coefficient can be used to describe interval scale.

b) However, using statistics designed for lower level of measurement we will under-utilize the amount of information available in the data. If we use ordinal measures of correlation for interval scale data, the data will be treated as ordinal only and the interval nature of the data will be ignored. 2) If we want to compare correlation or association across different pairs of variables, try to use measures which give intuitive meaning, such as lambda, uncertainty coefficient, gamma, Eta, Spearman Rank and Pearson's r. Comparing Kendall's tau b with another tau b obtained by different pairs of variables with different number of ranks will be misleading. If we want to test the significance of association or correlation, try to use statistics which give significance statistics, such as Eta, Kendall's tau b or c, Spearman Rank, Pearson r, etc. Theoretically, for different types of measures there would be different types of correlation statistics that would be most suitable. However, in a single research study report, it would not be advisable to use many different types of correlation statistics even though they are most suitable. Thus, the best strategy would be to choose a few correlation statistics that are more generally applicable, though not necessarily the most suitable. Cramers V, Kendalls tau c, and Pearson r would be most generally applicable correlation statistics for nominal, ordinal, and interval scale measures. Note: to obtain the analyses from SPSS, Analyze/ Descriptive Statistics/ Crosstabs > Statistics http://www.unesco.org/webworld/idams/advguide/Chapt4_2.htm

10

Social workers empower clients * Social workers bring hopes to those in averse situation Crosstabulation Count Social workers bring hopes to those in averse situation Strongly disagree Disagree Neutral Agree Strongly agree Strongly disagree 11 2 0 1 1 2 2 0 0 15 11 16 3 0 32 10 67 45 7 129 4 33 163 22 223 3 8 32 52 96 Total 15 30 126 243 81 495

Social workers empower Disagree clients Neutral Agree

Strongly agree Total

Directional Measures Value Ordinal by Ordinal Somers' dSymmetric Asymp. a b Std. Error Approx. T Approx. Sig. .000 .000 .000

.551 .035 14.416 Social workers empower .541 .034 14.416 clients Dependent Social workers bring hopes to those in averse .560 .036 14.416 situation Dependent a.Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis.

Symmetric Measures Value Ordinal by Ordinal Kendall's tau-b Kendall's tau-c Gamma .551 .464 Asymp. a b Std. Error Approx. T Approx. Sig. .035 .032 .039 .037 .038 .032 14.416 14.416 14.416 16.501 18.576 15.436 .000 .000 .000 .000 .000 .000
c c

.730 Spearman Correlation .596 Interval by Interval Pearson's R Measure of Agreement Kappa .642 .429

N of Valid Cases 495 a.Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis. c.Based on normal approximation.

Assignment 1. Use the dataset (social workers image) and identify the opinion statement (Q1a to Q1n) using linear regression that best predicts the level of trust in social workers. Discuss the findings. 2. Choose any one non-parametric test for ordinal correlation and see if the same opinion statement is identified.

11

You might also like