You are on page 1of 9

!

!
!
!
!
!
!
Predicting Prices in the Iowa Housing
Market (Regularized Linear
Regression)

Erik Bebernes
Introduction

This project asks a common question in the field of predictive analyticswhat are houses worth?
Identifying the true price of a home is important in preventing a housing bubble, such as the one
that plagued our country in 2008 that ultimately lead to a recession. The data Im using comes
from Kaggle, and looks specifically at houses in Ames, Iowa. There are 81 variables, with the
control being Sale Price. I worked on a problem similar to this as an undergraduate student in
an econometrics class, and although I really enjoyed it, I hadnt the slightest clue what I was
doing. Now that I am more knowledgeable when it comes to multi-regression analysis I should
be able to come up with some fairly accurate predictions. Before I begin, here is a look at the 81
variables Ill be working with.

My plan of attack on this project is as follows:


1.)! Identify any missing data (both missing at random and not at random) and impute new
data accordingly.
2.)! Remove any outliers to reduce model complexity and avoid overfitting.
3.)! Run a multi-regression model, using backward selection until the p-value for model as a
whole is below .05.
4.)! Try a regularized linear model.

Identifying Missing Data and Cleaning It

The first thing I like to do in a lot of my projects is to run a missmap on the datasets to see how
much of the data is NA.
A handful of variables are nearly completely missing, lets see what they are and why.

The variables with all of the missing values are Alley (type of alley access), PoolQC (pool
quality), FireplaceQu (fireplace quality), Fence (fence quality), Lot Frontage (linear feet
of street connected to property) and MiscFeature (miscellaneous feature not covered in other
categories). The descriptions of these variables make it obvious that the data is not missing at
random, because they are conditional to whether or not the house has that feature to begin with.
This can be said for all of these variables. Look at all of the missing variables related to garages
and basementsthese are the houses that dont have garages and basements. Its also worth
noting that the amount of NAs is equal across similar categories (i.e., all of the garage variables
have 81 missing values). There is an easy fix for this. Im going to replace all NAs for factor
variables with none and NAs for all numeric variables with 0.
Removing Outliers
By making scatterplots of the numeric variables against Sale Price Ill be able to identify any
outliers and remove them from the dataset. This will simplify the model and reduce any
overfitting when it comes to making the prediction.
Multi-Regression Model
In developing my linear model, I used a backward selection method, where I started by including
all of the independent variables and gradually the insignificant ones (where there was a p-value
greater than .05).

The adjusted R-squared (accounts for more error due to an abundance of variables) is .8718,
meaning 87% of the error in the dataset can be explained by the model. The model as a whole
has a p-value of < 2.2e-16, making it significant. Time to make my prediction and see how it
stands up on the kaggle rankings.

After submitting my prediction, I was only in the 13th percentile of most accurate. This is due to
the fact that there is a high variable to observation ratio, leading to overfitting. To account for
this I will attempt to make a regularized linear model using the caret package in R, but in order
to do so I need to convert factors of two into dummy variables.
Regularized Linear Model

Regularizing my model greatly improved my accuracy (now Im in the 67th percentile on


kaggles leaderboard).

You might also like