Professional Documents
Culture Documents
This is a closed-book exam. You are allowed to use a calculator and one page (8.5 11 inches
or A4, both sides) of handwritten notes. No use of cellular telephones or other portable
electronics is permitted.
You have two hours for the exam. The computer output associated with one or more items
should be considered an essential part of the questions. The multiple-choice questions are
equally weighted; your grade will be based on the number of correct answers you provide.
Choose the one best answer by marking the item on the answer form.
STOP
DO NOT TURN THE PAGE UNTIL YOU ARE
INSTRUCTED TO PROCEED.
Statistics 621, Final Exam -2- Q1, 2008
(1) Mark the answer for question #1 on your answer form b. This identifies your exam.
(2) Assuming the MRM holds, if the 95% confidence interval for the intercept 0 in a multiple
regression model is the interval [-18, 54], then the
a. Intercept 0 is zero.
b. Intercept is statistically significantly different from zero.
c. Standard error of the estimated intercept is about 18.
d. p-value for the test of H0: 0 = 0 is less than 0.05.
e. Value of R2 for the model is close to zero.
(3) The intercept is often an extrapolation in a regression model because
a. The population intercept 0 is 0.
b. The sample size used to fit the estimated model is too small.
c. The explanatory variables in the fitted model are collinear.
d. The fitted model has a highly leveraged outlier.
e. Zero lies outside the range of the explanatory variables.
(4) A narrow cluster of observations at the center of the leverage plot of an explanatory
variable in a multiple regression suggests
a. The effect of one or more highly leveraged observations.
b. That this variable is collinear with other explanatory variables.
c. That this variable is a statistically significant predictor of the response.
d. That this variable has a normal distribution.
e. The presence of a nonlinear relationship in the regression model.
(5) In order to build a regression model that estimates a constant marginal elasticity of sales
with respect to price, we must
a. Fit a simple regression of sales on log price.
b. Fit a linear regression of sales on price.
c. Fit a simple regression of log sales on log price.
d. Compute the derivative of sales with respect to price.
e. Form a categorical variable to represent different levels of price.
(6) The explanatory variables in a multiple regression are assumed by the multiple regression
model to
a. Be uncorrelated with each other.
b. Have normal distributions.
c. Be linearly related to the response.
d. Have equal variation.
e. Each consists of an iid sample of n cases from the studied population.
(7) The overall F-ratio in a multiple regression with K explanatory variables (found in the
analysis of variance table)
a. Tests the statistical significance of an added categorical variable.
b. Grows larger with a larger sample size.
c. Increases with the addition of explanatory variables to the model.
d. Tests H0: 1 = 2 = = K = 0.
e. Measures the impact of collinearity on a fitted multiple regression.
Statistics 621, Final Exam -3- Q1, 2008
60000
50000
Wheat
40000
30000
20000
10000
600 700 800 900 1000 1100 1200 1300 1400
Acres
RSquare 0.512
Root Mean Square Error 6700
Mean of Response 29430
Observations 60
Parameter
Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 2950.04 4236.71 0.70 0.4893
Acres 37.68 4.83 7.81 <.0001
15000
10000
5000
Residual
-5000
-10000
-15000
600 700 800 900 1000 1100 1200 1300 1400
Acres
Statistics 621, Final Exam -5- Q1, 2008
(12) Based on the fitted model, a farmer who plants 1,000 acres of wheat can expect to harvest,
on average, about
a) 2,950 bushels of wheat.
b) 37,680 bushels of wheat.
c) 40,630 bushels of wheat.
d) 6,718 bushels of wheat.
e) 29,430 bushels of wheat.
(13) The positive estimated intercept along with the regression output indicates
a) That the relationship between Acres and Wheat is nonlinear near the origin.
b) That land that is not in active production nonetheless yields wheat.
c) Outliers have distorted the fit of the line for smaller farms.
d) No serious problem and is not significantly different from zero.
e) Most of the farms in this sample are relatively small.
(14) The fitted model implies that, on average,
a) Each additional acre planted yields 37.68 bushels of wheat.
b) Farms require about 37.68 acres to produce a bushel of wheat.
c) Each additional acre planted yields 2,950 bushels of wheat.
d) Farms require about 2,950 acres to produce a bushel of wheat.
e) Acreage planted does not significantly affect the amount of wheat produced.
(15) If a farmer were to increase the amount of land planted with wheat from 1,200 to 1,300
acres, then we would expect, on average, that the amount of wheat produced would (assuming
the SRM)
a) Increase between about 3,285 and 4,251 bushels, with 95% confidence.
b) Increase between about 3,758 and 3,778 bushels, with 95% confidence.
c) Increase between about 2,802 and 4,734 bushels, with 95% confidence.
d) Increase between about 3,763 and 3,773 bushels, with 95% confidence.
e) Remain the same.
(16) Most of the farms in this survey have fewer than 1000 acres of wheat planted, while a few
others are much larger (up to about 1400 acres). This skewness in the number of acres planted
a) Violates the normality assumption of the simple regression model.
b) Produces a lack of constant variance in the response.
c) Introduces dependence in the underlying model errors.
d) Implies that the simple regression model omits an important predictor.
e) Has produced a few leveraged observations.
(17) An extension of this model developed by the Department of Agriculture adds 1 other
predictor, increasing R2 to 0.75. This revised model, assuming it satisfies the MRM, would be
able to predict the wheat production of a typical individual farm with 95% probability to within
about
a) Plus or minus 4800 bushels.
b) Plus or minus 6700 bushels.
c) Plus or minus 9600 bushels.
d) Plus or minus 11,300 bushels.
e) Plus or minus 13,400 bushels.
Statistics 621, Final Exam -6- Q1, 2008
15000
.01 .05.10 .25 .50 .75 .90.95 .99
10000
5000
-5000
-10000
-15000
-20000
5 10 15 -3 -2 -1 0 1 2 3
Count
Normal Quantile Plot
(18) The diagnostic plot shown above graphs the residuals from the fitted model of Wheat on
Acres. This diagnostic plot indicates that
a) The simple regression omits an important predictor.
b) The relationship between Wheat and Acres is not linear.
c) The underlying model errors are apparently dependent.
d) The underlying model errors apparently have constant variance.
e) The underlying model errors are approximately normally distributed.
(19) Suppose that the Department of Agriculture were to conduct a second survey, only this
time sampling 400 farms rather than 100 farms from the same population as the current survey.
In a simple regression of Wheat on Acres using the data from this larger survey we should
expect to find that, in comparison to the shown simple regression,
a) The R2 summary statistic will be larger than 0.512.
b) The estimated intercept will be statistically significant.
c) The estimated slope will not be statistically significant.
d) The 95% confidence interval for the estimated slope will be longer.
e) The t-ratio for the estimated slope will be larger.
(20) When expanding the scope of the shown survey, if the Department of Agriculture accepts
the validity of the simple regression model for all US farms and wants to improve the accuracy
of the estimated slope by the most possible, it should seek to
a) Remove small farms that plant less than 650 acres of wheat.
b) Add more typical farms with about 1,000 acres of wheat planted.
c) Add more farms that plant the average number of acres of wheat.
d) Add farms that plant a wider range of acres of wheat than those in this survey.
e) Add farms that are adjacent to and of similar size as those in the shown model.
(21) Analysis indicated that a drought seriously affected farms in the middle of the US, but not those
along the Pacific coast. A drought reduces the yield of wheat per acre. If this expanded sample of 400
farms mixes farms from the central US and Pacific coast, then we should
a) Do a two-sample t-test to see if the effect of the drought is noticeable.
b) Add a categorical variable for location to the model.
c) Add a categorical variable for location and its interaction with size to the model.
d) Use a log transformation to capture the diminishing returns brought by drought.
e) Discard this sample since this mixing distorts the use of the data in regression.
Statistics 621, Final Exam -7- Q1, 2008
(22) If the wind direction remains constant, then the fitted model estimates that an increase in the wind
speed at MS1 by 10 mph produces, approximately,
a) No effect on the wind speed at the proposed site.
b) An increase of 7.1 miles per hour in the wind speed at the proposed site.
c) An increase of 13.1 miles per hour in the wind speed at the proposed site if the wind is from
the north or east direction.
d) An increase of 13.1 miles per hour in the wind speed at the proposed site if the wind is from
the south or west direction.
e) A decrease of 1.3 miles per hour in the wind speed at the proposed site if the wind is from the
south or west direction.
Statistics 621, Final Exam -8- Q1, 2008
(29) Were there collinearity between MS1 and Direction, which of the following statements would
provide the best interpretation?
a) The impact of wind speed at MS1 on the development site depends on the wind direction.
b) The speed at MS1 is associated with the direction the wind is blowing.
c) On windless days the wind always blows from the North.
d) Knowing the wind speed on a Monday at MS1 provides information about the wind speed for
the following Tuesday.
e) Collinearity is a meaningless concept when a categorical variable is involved.
(30) Comment on the residual plot shown with the summary of the estimated model:
a) As this data is observed over time, the residuals show autocorrelation.
b) The apparent U shape in the residual plot shows the models lack of fit.
c) The plot does not indicate violations of the underlying regression assumptions.
d) There is clear evidence of heteroscedasticity in the plot.
e) This plot shows that outliers have distorted the model and made it unreliable.
(31) The following graph is the leverage plot for MS1.
Questions 33-37. The analysts responsible for the project decided to add an interaction term
between Direction and the wind speed at MS1 to the model. Output from this model follows.
Crosses in the plots below indicate a day when Direction was SorW and dots represent when
the Direction was NorE.
Summary of Fit
RSquare 0.523
Root Mean Square Error 2.95
Mean of Response 15.66
Observations (or Sum Wgts) 61
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 3 546.1316 182.044 20.8666
Error 57 497.2791 8.724 Prob > F
C. Total 60 1043.4107 <.0001
(33) In the fitted multiple regression model, the terms denoted MS1*Direction imply that
a) MS1 and Direction are correlated (confounded).
b) The association between changes in wind speeds at MS1 and the development site depend
upon the wind directions.
c) It is always windy at the development site when it is windy at MS1.
d) Interaction terms are meaningless for categorical variables.
e) The day of the week improves predictive accuracy of wind speed at the development site.
(34) Assuming the MRM, does the addition of the term MS1*Direction significantly improve the
accuracy of the model when predicting new observations?
a) No, because R2 must always increase.
b) Yes, because the 95% confidence interval for the interaction term does not contain zero.
c) No, because the residual plot shows a clear breakdown in model assumptions.
d) Yes, because the overall F-ratio is significant.
e) No, because the value of the RMSE is too large.
(35) On a day when the wind direction is NorE and the wind speed at MS1 is 10 mph, then the shown
model predicts wind speed at the development site to be, approximately,
a) 15.66 mph.
b) 9.28 mph.
c) 1.62 mph.
d) 11.04 mph.
e) 5.45 mph.
(36) The plot of the residuals from the model shown on the previous page implies that
a) Outliers have made the estimated model unreliable.
b) There is more variability in predicted wind speed at the development site on days when the
wind is from the NorE direction.
c) The variance of the residuals depends on the wind direction.
d) The data are dependent observations due to the presence of autocorrelation.
e) There is evidence of a non-linear effect of MS1 on wind speed.
(37) On calm days, the wind speed at MS1 is less than 10 miles per hour; on windy days, it blows at
more than 25 miles per hour at MS1. Based on the fitted model, the highest estimated average wind
speeds at the proposed site are observed when the wind direction at MS1 is
a) On calm days from the North or East, and on windy days from the South and West.
b) On calm days from the South of West, and on windy days from the North and East.
c) On calm days from North and East, and on windy days from the North and East.
d) On calm days from South and West and on windy days from South and West.
e) On calm days from North, East, South and West and on windy days from the North and East.
Statistics 621, Final Exam -12- Q1, 2008
Questions 38 44. A company makes cash advances to small businesses. One of the outcome
metrics that the company uses to judge the quality of an advance is the performance of the
advance after the first six months. Performance is defined as the ratio of the actual amount paid
back to the expected amount paid back after six months at the time that the loan is made.
Ideally, this performance metric (called PRiSM, for performance ratio in six months) should be
1, or close to 1.
The company is interested in identifying predictors of PRiSM. They have developed a
regression model of PRiSM using several explanatory variables. The explanatory variables that
appear in the regression model are defined in this following table. A summary of the regression
model follows the table. All logarithms refer to natural logarithms (base e).
Loan type Either a first [O for original] loan made to this
merchant or a renewal [R] loan.
FICO Fair-Isaac credit score of the individual
guaranteeing the loan
Years In business The number of years that the merchant has been in
business
Current delinquent credit lines Number of current delinquent or derogative lines of
credit
Business type Legal entity, including corporation (Corp), limited
liability corporation (LLC), sole proprietor, or
partnership.
Median Income The income per household in the zip code where
the merchant does business (recorded in dollars).
ISO Identifier Two values, identifying whether the ISO is
Hawthorne (H) or a different ISO.
Payment stress Ratio of the expected monthly payment to the
typical volume of monthly credit card transactions.
Summary of Fit
RSquare 0.276602
Root Mean Square Error 0.146335
Mean of Response 0.937145
Observations (or Sum Wgts) 990
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 14 7.983223 0.570230 26.6289
Error 975 20.878589 0.021414 Prob > F
C. Total 989 28.861812 <.0001
Effect Tests
Source Nparm Sum of Squares F Ratio Prob > F
Log(Years Business) 1 2.2888226 106.7739 <.0001*
Loan.Type 1 1.0054689 46.9053 <.0001*
Business Type 3 0.8628517 13.4174 <.0001*
ISO Identifier 1 0.4845248 22.6032 <.0001*
FICO 1 0.4672788 21.7986 <.0001*
Loan.Type*FICO 1 1.0332935 48.2033 <.0001*
Log(Years Business)*BusinessType 3 0.4802592 7.4681 <.0001*
Current.Delinquent.Credit.Lines 1 0.0530002 2.4725 0.1162
Payment Stress 1 0.3281044 15.3061 <.0001*
Log Median Income 1 0.7113217 33.2177 <.0001
Statistics 621, Final Exam -13- Q1, 2008
(38) The shown model predicted the PRiSM score for a possible loan to be y = 1.15. Based on the
summary of the fitted model and the assumptions of the MRM, we can conclude that the probability
that the actual PRiSM for this loan is larger than 1 is approximately
a) 83%.
b) 95%.
c) 5%.156
d) 50%.
e) 67%.
(39) When comparing loan applications from businesses that are identical but for the fact that one is a
renewal and one is an original loan, the estimated model implies that a 10-point increase in the FICO
score
a) Has no effect on expected PRiSM for either application.
b) Increases the expected PRiSM for the original loan by about 0.0010.
c) Decreases the expected PRiSM for either loan by about -0.0088.
d) Increases the expected PRiSM for the original loan by about 0.00977.
e) Increases the expected PRiSM for the original loan by about 0.0105.
(40) The p-value of the intercept implies that, assuming the MRM,
a) 12.42% of loans result in a negative PRiSM score.
b) There is a 12.42% chance that these data are a sample from a population in which 0=0.
c) We should refit the model without the intercept, which is a large extrapolation.
d) The estimated intercept is not statistically significantly different from zero.
e) The probability that the population intercept is zero is 0.1242.
Statistics 621, Final Exam -14- Q1, 2008
(41) The best interpretation of the estimated coefficient for the logarithm of median income is,
given fixed levels of the other explanatory variables, that
a) A $100 increase in median income increases the estimated PRiSM by about 7.4.
b) A 1% increase in median income increases the estimated PRiSM by about 0.074%.
c) A 1% increase in median income increases the estimated PRiSM by about 0.00074.
d) A $100 increase in median income increases the estimated PRiSM by about 0.074%.
e) A 1% increase in median income increases the estimated PRiSM by about 0.074.
(42) A reasonable next step to take in the development of this model would be to
a) Remove the intercept from the model since it is not statistically significant.
b) Remove Log(Years Business)*Business.Type[Partnership] because it has the largest p-value.
c) Reduce the number of types of businesses distinguished in the fitted model.
d) Remove the number of delinquent credit lines from the model.
e) Check that the distribution of estimated PRiSM scores is approximately normal.
(43) The estimated PRiSM score for an original loan application from ISO Hawthorne for a 1-year old
corporation with no delinquent credit lines, a FICO score of 600, in a community with $60,000 median
income, and stress level 0.10 is approximately
a) 0.83
b) 0.60
c) 0.12
d) 1.42
e) 0.44
(44) The following two graphs are leverage plots for the fitted model.
1.5
1.4
1.3
PRiSM Leverage
1.2
1.1
Residuals
1.0
0.9
0.8
0.7
0.6
0.5
0.4
400 500 600 700 800
FICO Leverage, P<.0001