You are on page 1of 15

Lab 3 Machine Learning

Nikhilesh Prabhakar
16BCE1158

Datasets used

Apart from the usual House Prices dataset that were used in previous
lab submissions, the dataset that I had worked on for this one was
the one presented in Sebastian Raschka’s “Python Machine
Learning”.

Link: https://www.kaggle.com/c/house-prices-advanced-regression-
techniques/data. There are 79 variables describing almost every
aspect of residential homes for sale in Iowa (at the time of collecting
data).
Link for the second dataset:
https://archive.ics.uci.edu/ml/machine-
learningdatabases/housing/housing.data

Methodology

Step 1: Importing the required Packages


import pandas as pd
import matplotlib as plot
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn as sb

Step 2: Cleaning the data


This was covered in Lab 1’s submission. Additionally all string
columns were converted to integer by using Panda’s function
get_dummies() which separates all unique string values by making
them separate columns

Step 3: Using a heatmap


The Pearson’s correlation coefficient is figured out between Sale
Price and the other columns in the dataset. The columns that had
absolute correlation above 0.5 were plotted on the map as shown.

corrmat = fdata.corr()
top_corr_features =
corrmat.index[abs(corrmat["SalePrice"])>0.50]
plt.figure(figsize=(10,10))
g =
sb.heatmap(fdata[top_corr_features].corr(),annot=True)
#Sale Price is most correlated with OverallQual,
GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF
Step 4: The Linear Regression Model
There are 3 methods covered in this lecture to for finding out the
Linear Regression

Method 1: OLS (Ordinary Least Squares)

from statsmodels.formula.api import ols


from IPython.display import HTML, display

housing_model = ols("SalePrice ~ OverallQual",


data=fdata).fit()
Adj. R-squared indicates that 62% of housing prices can be
explained by our predictor variable
The standard error measures the accuracy of OverallQual’s
coefficient by estimating the variation of the coefficient if the same
test were run on a different sample of our population. Our standard
error, 5756.407, is extremely and therefore a linear model doesn’t
suit our data.
A lot more was tried using OLS as can be seen in the python
notebook, even multiple regression’s statistical data can be viewed
through this.
Method 2: sklearn’s Linear Regression Function

from sklearn.linear_model import LinearRegression


X = fdata[["OverallQual"]]
Y = fdata[["SalePrice"]]
clf = LinearRegression()
clf.fit(X,Y)
Out[174]: LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=1, normalize=False)
data_test = pd.read_csv("train.csv")
X_test = data_test[["OverallQual"]]
Y_test = data_test[["SalePrice"]]
clf.score(X_test,Y_test)
Out[179]: 0.62544678976769652

Similar to the Adj R-squared value seen in OLS


Y_pred = clf.predict(X_test)
plt.scatter(X_test, Y_test, color='black')
plt.plot(X_test, Y_pred, color='blue', linewidth=3)
#OverallQual is a discrete data
Same code was tried out with another attribute “GrLivArea” and this
was formed
Method 2: User-defined function given in Sebastian Raschka’s
book

class LinearRegressionGD(object):
def __init__(self, eta=0.001, n_iter=20):
self.eta = eta
self.n_iter = n_iter
def fit(self, X, y):
self.w_ = np.zeros(1 + X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
output = self.net_input(X)
errors = (y - output)
b = self.eta * X.T.dot(errors)
self.w_[1:] += b
self.w_[0] += self.eta * errors.sum()
cost = (errors**2).sum() / 2.0
self.cost_.append(cost)
return self
def net_input(self, X):
return np.dot(X, self.w_[1:]) + self.w_[0]
def predict(self, X):
return self.net_input(X)
def lin_regplot(X, y, model):
plt.scatter(X, y, c='blue')
plt.plot(X, model.predict(X), color='red')
return None
lin_regplot(X_std, y_std, lr)
plt.xlabel('Average number of rooms [RM]
(standardized)')
plt.ylabel('Price in $1000\'s [MEDV] (standardized)')
plt.show()
slr = LinearRegression()
slr.fit(X, y)
print('Slope: %.3f' % slr.coef_[0])
print('Intercept: %.3f' % slr.intercept_)
Slope: 107.130
Intercept: 18569.026

Similar was done with “GrLivArea"


The same code was tried out with the example given in the textbook.
A python notebook along with snippets down below is provided for
proof of execution.

You might also like