You are on page 1of 16

DEPAUL UNIVERSITY

HW 5
MAT 448 Burnett
Peter Drogos 2/10/2014

Drogos 2 Problems 5.1 and 5.5 in Cody and Smith text, pages 179 and 180, respectively Purpose of problem: Given the following data, write a SAS program and compute the Pearson correlation coefficient between X and Y; X and Z. What is the significance of each? Compute the regression line (Y on X). Y is the dependent variable, X the independent variable. What is the slope and intercept? Are they significantly different from 0? Sample Info and Variables collected: Information was collected from 10 data points that included x, y, and z. SAS routines used: proc corr, proc Summary of results: The sample of 5 data points had a Pearson correlation coefficient between x and y of 0.96509 (p=0.0078), while the Pearson correlation coefficient between x and z is -0.97525 (p=0.0047). These correlation coefficients are both significant. Both sets of data points coefficients of determination are 0.931399 and 0.951113, respectively. This means that between x and y, about 93% of the variation in y can be explained by the variation in x, and the about 95% of the variation in z can be explained by the variation in x. The following table summarizes these correlation coefficients.
Pearson Correlation Coefficients, N = 5 Prob > |r| under H0: Rho=0 y x z

0.96509 0.0078

-0.97525 0.0047

The slope for the regression line is b=1.52410, while the intercept is a=0.78916. The following table summarizes both parameter estimates.

Drogos 3
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept x

1 1

0.78916 1.25920 1.52410 0.23882

0.63 0.5753 6.38 0.0078

Summary and concluding remarks: It appears that x and y have a strong positive correlation, suggesting an upward trend- as one variable increases, the other increases as well. x and z appear to have a strong negative correlation; that is, as one variable increases, the other decreases. In our case, these results are conclusive because of the strength of the correlations between x and y and x and z. The intercept of the regression line is not significantly different from 0 (p=0.5753), while the slope of the regression line is significantly different from 0 (p=0.0078). Recommendations: A more extensive study with a larger sample size is recommended. Limitations: Results not extensive enough because of small sample size. See SAS program and output on the following pages SAS program
data corrprob; input x y z; datalines; 1 3 15 7 13 7 8 12 5 3 4 14 4 7 10 ; ods graphics on; ods rtf file='corrprob.rtf'; proc corr data=corrprob; var y z; with x; run; proc reg data=corrprob;

Drogos 4
model y=x; run; ods graphics off; ods rtf close; run;

Problem 5.3 in Cody and Smith text, page 180 Purpose of problem: Given the following data, how much of the variance of SBP (systolic blood pressure) can be explained by the fact that there is variability in AGE? (Use SAS to compute the correlation between SBP and AGE). Sample Info and Variables collected: Information was collected from 6 individuals that included age and sbp (systolic blood pressure). SAS routines used: proc corr Summary of results: The study was interested in the significance of the correlation coefficient. The following table summarizes the sample Pearson correlation coefficients.
Pearson Correlation Coefficients, N = 6 Prob > |r| under H0: Rho=0 age age sbp

1.00000 0.95258 0.0033

0.95258 0.0033 1.00000

sbp

Summary and concluding remarks: It appears that the correlation coefficient is 0.95258. This means that the coefficient of determination (R^2) is 0.907409. This means that about 91% of the variation in sbp can be explained by variability in age. Recommendations: A more extensive study is recommended with a larger sample size.

Drogos 5 Limitations: Results not conclusive due to small sample size. See SAS program and output on the following pages SAS program
data sbp; input age sbp; datalines; 15 116 20 120 25 130 30 132 40 150 50 148 ; ods graphics on; ods rtf file='sbp.rtf'; proc corr data=sbp; var age sbp; run; ods graphics off; ods rtf close; run;

Problem 1 in Assignment for Session 5, page 23 Purpose of problem: The daily attendance at a local ball park and the number of hot dog sales are studied over a period of games. Given the following data: a) plot the data; b) find the correlation coefficient and test its significance; c) find the regression line to predict hot dog sales based on attendance; d) find the standard error of estimate; e) predict hot dog sales when attendance is 7000; f) are assumptions of model met? Explain briefly from ods output; g) summarize results of this problem in a paragraph. Interpretation of SAS output: A study was done to investigate the relationship between daily ball park attendance and the number of hot dog sales over a period of games. The scatter plot of the data points is given below.

Drogos 6

hdsales = 179.42 +0.6936 attend

7500

7000

6500

N 10 Rsq 0.8789 AdjRsq 0.8637 RMSE 458.64

6000

hdsales

5500

5000

4500

4000

3500

3000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000

attend

The Pearson correlation coefficient, r, of 0.93748 (p<0.0001) showed a significant positive correlation between the daily attendance at the ballpark and the number of hotdog sales. This indicates that the more people attended the ball park, the more hotdog sales increased. The following table summarizes the Pearson correlation coefficients.

Drogos 7
Pearson Correlation Coefficients, N = 10 Prob > |r| under H0: Rho=0 attend attend hdsales

1.00000 0.93748 <.0001

0.93748 <.0001 1.00000

hdsales

The regression model proved to be significant. The number of hotdog sales appears to be a function of the number attending the ball park. The regression equation used to estimate the number of hotdog sales=179.41977+0.69362(number attending the ball park), valid over the domain 4534 to 9821 for the independent variable. The s.e.e. is 458.641, indicating that all predictions made could be off by as many as 917 people with 95% accuracy. The following tables summarize the parameter estimates for the regression model and the Analysis of Variance table.
Parameter Estimates Parameter Variable DF Estimate Intercept attend Standard Error t Value Pr > |t|

1 179.41977 672.68015 1 0.69362 0.09104

0.27 0.7964 7.62 <.0001

Root MSE

458.64114 R-Square 0.8789

Dependent Mean 5183.80000 Adj R-Sq 0.8637 Coeff Var

8.84759

Drogos 8 Hotdog sales when attendance is 7000 =179.41977+(7000*0.69362)=5035 hotdog sales. Yes, the assumptions of the model are met. We can see this by looking at the diagnostic tests run for hotdog sales. Plots of the residuals suggest normality, as seen by the random-pattern of ordered pairs on the predicted values and residuals scatter plot. The residual vs. quantile plot shows the ordered pairs close to the line. Finally, we see the distribution is somewhat normal, although the sample size is small. The following table summarizes the diagnostics tests for hotdog sales.

Drogos 9 See SAS program and output on the following pages SAS program
data hotdogs; input attend hdsales; datalines; 8747 6845 5857 4168 8360 5348 6945 5687 8688 6007 4534 3216 7450 5018 5874 4652 9821 7001 5873 3896 ; ods graphics on; ods rtf file='hotdogs.rtf'; proc corr data=hotdogs; var attend hdsales; run; proc reg data=hotdogs; model hdsales=attend; plot hdsales*attend residual.*attend; ; run; ods graphics off; ods rtf close; run;

Problem 2 in Assignment for Session 5, page 23 Purpose of problem: Although the income tax system is structured so that people with higher incomes should pay a higher percentage of their income in taxes, there are many loopholes and tax shelters available for individuals with higher incomes. A sample of individual 2009 tax returns gave the data listed in the table. Given the following data: a) compute r and interpret results; b)compute r-squared and interpret results; c) is r significant at the alpha=0.05 (state t-value and reasoning); d) compute the line of regression; e) compute the s.e.e., and interpret its results relevant to the problem; f) are assumptions of the model met? Explain briefly from ods output; g) predict the percentage of income paid in taxes by individuals with gross income of $80,000 with 95% accuracy.

Drogos 10 Interpretation of SAS output: A study was done to investigate the relationship between individuals gross income (in $1000s) and the taxes they paid (as a percentage of their total income). The Pearson correlation coefficient was 0.83416 (p=0.0014<0.05) and proved to be significant between gross income and taxes paid. This indicates that the higher gross income someone had, the more taxes they paid. The coefficient of determination is 0.6958, meaning that about 70% of the variation in taxes paid can be explained by variation in gross income. The following table summarizes the correlation coefficient results for the two variables.
Pearson Correlation Coefficients, N = 11 Prob > |r| under H0: Rho=0 grossinc grossinc taxespaid

1.00000 0.83416 0.0014

0.83416 0.0014 1.00000

taxespaid

The line of regression proved to be significant. The amount of taxes paid by an individual appears to be a function of the gross income they earned. The regression equation used to estimate the amount of taxes =12.03382+0.12606(gross income), valid over a domain of 9.1 to 150.7 1000s of dollars for an individuals gross income. The s.e.e. is 4.23, indicating that all predictions could be off by as much as 8.47 percentage of total income with 95% accuracy. The following table summarizes the parameter estimates.
Parameter Estimates Variable DF Intercept grossinc Parameter Standard Estimate Error t Value Pr > |t|

1 12.03382 2.44923 1 0.12606 0.02778

4.91 0.0008 4.54 0.0014

Drogos 11 Assumptions of the model are met after examining the diagnostics tests for taxes paid. From the first column, we see that the residual vs. predicted value scatter plot is random, while the residual vs. quantile plot shows that the qq plot ordered pairs are close to the line. The residuals also roughly demonstrate normality. The following table summarizes the diagnostics tests results.

The percentage of income paid in taxes by individuals with gross income of $80,000 with 95% accuracy is 22.12, +/- 8.46 percent. See SAS program and output on the following pages SAS program
/*Problem 2, assignment for session 5, green handout*/ data taxes; input grossinc taxespaid; datalines; 38.7 16.0

Drogos 12
80.5 20.1 14.8 11.1 47.3 24.3 9.1 10.2 150.7 30.4 55.9 27.3 110.2 27.9 73.2 16.2 146.8 29.8 100.4 23.4 ; ods graphics on; ods rtf file='taxes.rtf'; proc corr data=taxes; var grossinc taxespaid; run; proc reg data=taxes; model taxespaid=grossinc; plot taxespaid*grossinc; ; run; ods graphics off; ods rtf close; run;

Problem 3 in Assignment for Session 5, page 24 Purpose of problem: A random sample of professional business women with MBAs is drawn and their age at the birth of their first child was recorded below. Define the sample subjects and variable of interest. State the sample mean, median, and standard deviation. Compute the 95% CI for the mean age these professional women have their first child and interpret its meaning relevant to the problem. Interpretation of SAS output: A sample of 30 professional business women with MBAs was drawn and their age at the birth of their first child was recorded. The mean age of the women ranged from 23.0 years to 38.0 years, with a mean of 30.4 years and standard deviation of 4.2 years. For a sample of this size, we predict the mean age for these professional women who have their first child with 95% confidence will lie between 28.9 and 32.0 years. The following table summarizes the sample statistics for the womens ages.

Drogos 13
Analysis Variable : agef Lower 95% Upper 95% N Minimum Maximum Mean Std Dev Median CL for Mean CL for Mean Lower Quartile Upper Quartile

30

23.0

38.0 30.4

4.2

29.5

28.9

32.0

27.0

34.0

See SAS program and output on the following pages SAS program
data women; input agef @@; datalines; 32 29 30 23 34 ; ods graphics on; ods rtf file='women.rtf'; proc means n min max mean stddev median clm Q1 Q3 maxdec=1; id agef; var agef; run; 28 24 27 33 36 26 28 33 25 38 33 28 34 37 27 35 34 28 35 29 34 29 25 33 26

Drogos 14
ods graphics off; ods rtf close; run;

Problem 4 in Assignment for Session 5, page 24 Purpose of problem: The scores on a preliminary aptitude test given to new applicants at a company were studied. Perform a twosample t-test on SAS and summarize results. The personnel department believed that the performance scores for a test given to new interviewees may differ by the age of the applicant. Below are the scores for applicants under 40 and applicants 40 or older. Does there appear to be any significant difference in their mean performance score on this test between the two age groups? Test to see if differences exist between the mean performance scores of the two groups. Use alpha=0.05. Use SAS to help answer the following. Insert relevant output where needed. Do a short write-up similar to that discussed in class tonight. State the populations and variables of interest. Find the means and standard deviations and medians of both groups. Run the appropriate test to see if the mean performance scores for the two groups are significantly different. Use p-value approach to answer. Explain what the p-value on SAS means relevant to the problem. Summarize findings as done in class. Compare results of the test to the confidence interval results given on SAS. Interpretation of SAS output: A study was done to see if the mean performance score for new interviewees under 40 differs from the mean performance score for new interviewees 40 or older. 19 interviewees under 40 years old and 17 interviewees ages 40 years and older were selected and their scores were measured. The mean score for the under 40 group ranged from 57.0 to 80.0, with a mean of 66.4 and s=6.2. The mean score for the 40 or older group ranged from 64.0 to 82.0, with a mean of 73. 2 and s=5.3. The following table summarizes the sample statistics for the two groups.

Drogos 15
Analysis Variable : score group N Obs N Mean Std Dev

Under 40 Group 40 or Older Group

19 19 66.4 17 17 73.2

6.2 5.3

The Pooled two sample t-test was conducted on the sample data because the mean variances are equal. This test shows that the mean difference of 6.8669 was significant (t=3.54, 34 df, p=0.0012). Since the p-value is significant at 0.05, and 0.01 levels of significance, we reject the null H0 that the variances are equal. This indicates that the mean score for the two groups are significantly different and that the mean difference for those in the under 40 group is truly less than the mean score for those in the 40 or older group. The following table summarizes the t-test mean difference between the two groups.
group N Mean Std Dev Std Err Minimum Maximum

40 or Older Group 17 73.2353 5.2859 1.2820 Under 40 Group Diff (1-2)

64.0000 57.0000

82.0000 80.0000

19 66.3684 6.2469 1.4331 6.8669 5.8145 1.9412

See SAS program and output on the following pages SAS program
proc format; value $groupfmt data aptitude; input group$ score @@; format group $groupfmt.; datalines; A 58 A 65 A 71 A 57 A 70 A 69 A 66 A 64 A 63 A 61 A 70 A 67 A 61 A 68 A 74 A 76 A 80 A 59 A 62 'A' = 'B' 'Under 40 Group' = '40 or Older Group';

Drogos 16
B 80 B 69 B 79 B 82 B 76 B 76 B 72 B 64 B 71 B 68 B 67 B 69 B 71 B 75 B 79 B 78 B 69 ; ods graphics on; ods rtf file='aptitude.rtf'; proc means data=aptitude maxdec=1; class group; var score; run; proc ttest data=aptitude; class group; var score; run; ods graphics off; ods rtf close; run;

You might also like