You are on page 1of 13

Ordinary Least Square Regression Analysis

Multivariate Regression Analysis using Microsoft Excel 2010


Using regression software in Microsoft Excel 2010 in order to predict values of one dependent variable based on two or more independent variables. This guide will provide a brief overview on ordinary least square analysis, multivariate regression coefficients, and step-by-step directions on making a regression analysis as well as solving the values mathematically. Contains a 5-part guide that includes an introduction, a description of the equipment, materials needed, indepth instructions with accompanying graphics to follow along with, and a troubleshooting section.

Trey Duane Kjono Toshiba

I.

Introduction

This brief manual will provide readers with an introductory course into Ordinary Least Squares regression analysis and how to do very basic multivariate regression analyses on Microsoft Excel 2010. Ordinary Least Squares (from here on abbreviated as OLS) is the most widely used statistical technique for finding relationships between different variables. Regression analysis, what OLS is used in, allows for the modeling and examination between different relationships and helps explain factors behind observed patterns as well as aiding in future predictions. The possibilities for regression analysis are limitless and can be used for multiple fields. Some of these varied examples include: Modeling automobile accidents as a function of speed, road conditions, weather, time of day, etc. in order to create a policy with the intent on decreasing accidents. Forecasting the dollar amount banks should expect to give out in automobile loans as a function of an areas unemployment rate, labor force, current prime interest rates, cost of a gallon of gasoline, etc. to predict how much money customers would like to loan from the bank. Testing whether high school grade point average and SAT scores are indicative in predicting a students collegiate performance, and then estimating a particular prospective students ending college GPA based on the three independent variables discussed below.

For ease, this paper is going to use the third example mentioned as a hypothesis. We will look at collection of random high school math and verbal (now known as Critical Reading) SAT scores and ending high school grade point averages as independent variables (x) and ending college grade point averages as the dependent variable (y). The goal is to determine whether a correlation exists between a students ending college GPA and their math and verbal SAT scores as well as high school GPA. If there is a correlation, we can then make a mathematical equation to predict what other prospective students ending college grade point averages will be based on their SAT scores and high school GPA. Upon completion, readers will have an entry-level idea on what OLS regression analysis is and how to perform beginner level tests using a common computer spreadsheet program, Microsoft Excel 2010. Readers will also learn of the different terms used in regression analysis and how they are useful. The manual is broken down into a five part structure with helpful headings and subheadings to aid the process: I) Introduction, II) Description of Equipment, III) Equipment Needed, IV) Step-by-Step Directions, and V) Troubleshooting

II.

Description of Equipment

Microsoft Excel is the go-to software program for spreadsheet applications. One of the many numerous programs that Microsoft Excel has compiled is regression software which will be used in this manual. The regression function is available under the Data Analysis feature programmed into Excel. More detailed descriptions along with accompanying graphics are provided to the user in the Directions section for ease of use

III.

Materials Needed

Materials needed for running regression analysis through Microsoft Excel: A data set containing numerical values. o In this instruction manual a data set will be provided by the author for readers to follow. A laptop or desktop computer PC o This manual is not intended for Apple operating systems Microsoft Excel 2010 spreadsheet software installed and operable o If Excel is not already installed, a one-month free trial including Word and PowerPoint is available at http://office.microsoft.com/en-us/excel/ A writing utensil and scratch paper for notes.

Figure 1: Completed Regression Analysis using MS Excel 2010

IV.
A. Opening Microsoft Excel 2010

Directions

To open Microsoft Excel, place your cursor down to your start menu and select. From here there is a search field which you will type Excel into. This will bring up a list of suggestions and previous files and will provide you with the link to open Excel. The image to the right shows how the process appears. Open Excel.

B. Setting up Regression Capabilities


Once Microsoft Excel is open, make sure that regression analysis is installed and currently turned on. If not, we can install it and activate it. Go to File tab and select the Options choice demonstrated to the left. Clicking on Options opens a new window titled Excel Options. The choice titled Add-Ins will bring up numerous available programs for download. An option will be listed for Analysis ToolPak, either under the Active Application Add-ins field or the Inactive Application Add-ins field. It is likely to be under the Inactive Application Add-ins field if this is the first time the program has been used. Please note that the screen in the demonstration is slightly different as the author has already installed and activated the program.

Once you select the Analysis ToolPak Add-in, click on the button that reads Go as shown on the screen and can be seen on the graphic to the left. Proceed to select OK. In the authors case all that is required is to select the OK option as the add-in has already been installed.

C. Inserting Data Points


A blank spreadsheet is now shown on the screen. The next step in creating our linear regression is to insert the data points. For ease of instruction I will provide data sets as an example. In this particular

entry-level test we will look at a collection of 30 observations that contain ending college and high school grade point averages, as well as SAT scores in math verbal skills. Seen to the right of the manual are the numerical values as well as the labels for each variable. Column A shows high school GPAs, B shows math SAT scores, C has the verbal SAT scores, and Column D contains the dependent variable, ending college grade point averages. Enter these exact figures shown to the right in the exact format that they are displayed to be used in this regression test. GPA-High represents the ending high school grade point average, while GPA-Univ represents the ending college grade point average. SAT-Math and SAT-Verb will be used to label the scores obtained on the mathematics SAT test and the Verbal [Critical Reading] portion.

D. Building the Regression


Once data points are entered, the program can calculate the regression for us. To do this, we select the Data tab. After selecting the Data tab, we can then move the cursor to the right side of the screen and select the option Data Analysis. Once Data Analysis is selected a new, small window will appear on the screen. Among the numerous choices available in alphabetical order is an option titled Regression. This option is what we will be using for our simple linear regression or better termed: Ordinary Least Squares. Once the regression option is highlighted, proceed to press OK. A graphic is provided below for guidance.

E. Designing the Output


After you select OK from the previous steps, a new dialogue box will appear titled Regression. In the new window the first option given required for us to fill out is Input Y Range. This is used for specifying the dependent variable and data. The dependent variable, ending college GPA, is what we are testing for. We hypothesize that there is a positive correlation between high school GPA, math and verbal SAT scores in comparison to ending college GPA. In the Input Y Range location select the small button next to the field; this will bring us to a minimized box. Click and drag across for all sets of data in the GPA-Univ category including our label GPA-Univ. This will be from fields D1-D31. The final appearance is shown in the figure above and to the right. The Input X Range is used to specify independent variable(s). We will be selecting high school GPAs, math SAT scores, and verbal SAT scores for this field. In a similar process as the previous step, click on the small button next to the empty field and drag across the ranges that contain our dependent variables. (For this data-set it would be from A1 to C31.) Please note that once selected the regression program uses a program code which converts our data ranges into a series specific cells with dollar signs before them. This can be shown in the graphic above. Select the Labels option. By selecting this we tell the program that we are including non-numeric values in our regression and these will be used to label our data sets. They are also much more aesthetically pleasing and allow us to skip some steps had we not selected the option.

The Output options category is to determine how the results should be displayed. Under the Output options category select New Worksheet Ply. This selection opens our regression information in a new sheet for Excel, alleviating clutter and confusion. Note: When the new sheet is opened, we will not be able to see our data set alongside. To access the data set, go to the bottom left of Excel and select Sheet1. The graphic on the next page shows a downward pointing arrow that allows us to switch between different spreadsheets. The Residuals category allows for graphs and charts to be displayed along with our information. Under Residuals choose the box that is labeled Line Fit Plots. This option shows a graph with all of our data points built in for one of the independent variables along the x-axis and the dependent variable across the y-axis. Select the box Standardized Residuals as well. For this experiment we will not require any other boxes selected under the Residuals category. The Normal Probability Plots section below does not need to be checked; this option is not necessary for this specific test. Press OK.

F. Data Results and Modification


1. Column/Row Width and Height If the previous steps are followed correctly, a new sheet should appear on your screen. Data sets will still be on Excel as noted previously though they are on the old sheet. The yellow arrow in the graphic below provides a visual if you had not already seen previously.

Now, for better viewing of information we need to readjust the row and column sizes to fit all the given information. A good size to use for this example is 17.3 for columns width A and B, and 13 for columns C to I. Row height can remain at the default setting. To adjust the column sizes, right click the letter of the column you wish to adjust, select Column Width, and enter the desired size. 2. Adjusting the Graphs The next step in helping understand our regression analysis is readjusting the graphs for easier viewing. To start this process, double click on the X-Axis of one of the three graphs created. For example, we will demonstrate how to change it for the SAT-Verb Line Fit Plot graph. A new window called Axis Options will appear. Within the Axis Options window we are given two options labeled Minimum and Maximum which contain a selectable choice called Fixed. Select the Fixed option and in the fields now available enter a number slightly smaller than the lowest verbal SAT score in our data for the Minimum field and in the Maximum field enter a number slightly larger than the highest verbal SAT score. A good starting point is 485 for the minimum and 735 for the maximum. Do the same for the Y-Axis, but in this category we will use the lowest and highest university GPA. The author suggests 2.0 and 4.0 for the minimum and maximum values The process for adjusting the x-axis is shown to the right. Follow these same steps to resize the other two graphs using the lowest and highest values respective to their categories. The author recommends the values in the x-axis for the SAT-Math Line Fit Plot graph to be a minimum and maximum of 550 and 720. The y-axis will be 2.0 and 4.0, the same as the previous graph. In the GPA-High Line Fit Plot graph the x-axis and y-axis will both have the same minimum and maximum values: 2.0 and 4.0. Press Close. 3. Adding a Trendline After shaping the graphs axis into being manageable and organized, the next step for visualizing our data is to add a Trendline and individual regression estimate. To do this, left-click on one the red data points shown on a graph. If done correctly, all the predicted value data points in that particular graph will be highlighted. Now right-click on the same spot and select Add Trendline This process can be seen in the small graphic to the right.

In the new window displayed select Linear. Under the Trendline Name category select Automatic. Entering a specific name for this example is unnecessary for our basic test. This new window and the options chosen can be seen on the larger graphic immediately to the right on this page. On the bottom of the window select Display Rsquared value on chart. This option will give us our correlation strength, the R-squared value, but will limit it to just one of the independent variables. Select Close. You should now see a straight line passing through your predicted value data points along with a number that identifies the strength of correlation. Repeat these steps for the other two graphs and then arrange the graphs on the spreadsheet for easier viewing. The authors example is shown below.

G. Examining the Regression Equation


Now that the graph is organized and manageable, we can start with our Econometrics lesson. If you are seeing some of these terms for the first time, you may be confused. Do not worry as we will cover what they mean. R-Square is called the coefficient of determination. R-Square is the integral part in OLS statistical testing. In our example we have an R-square value of .797. We can interpret this as for these given observations the regression equation model explains 79.70% of the variation in the dependent variable. The Adjusted R-Squared is slightly lower, and will always be slightly lower, in a well-done statistical analysis. This is because the adjusted R-square model accounts for numerous variables. We tested and combined three different variables which increased the function complexity. Looking lower along the spreadsheet, as well as the graphic at the top of the page you will come across an unnamed category which contains four specific terms starting with Intercept. These name and calculate the coefficients, mathematically known as bx. What these coefficients do is provide values that each dependent variable represents, and assists in determining the strength the explanatory variables (dependent) has compared to the independent variable. Regression coefficients also guide us in what type of model is to be used for analysis. In this case it is linear, but in many cases it can be a polynomial equation, a logarithmic equation, or a spatial regression equation. When looking at this unnamed category, we can see four rows. The first is Intercept which is what statisticians and economists mathematically identify as the b0 term. This is where our regression line will cross the y-axis. GPA-High is estimated as the slope of the regression line, mathematically written as b1. SAT-Math or b2 is the second regression coefficient, and like b1 it is the slope for that particular variable, or the difference in the predicted value of the dependent variable following a one-unit increase in the second regression coefficient, in this case, a one point higher math SAT score, when all other coefficients are held constant. SAT-Verb written mathematically as b3 is the same as b2 only we are examining the critical reading SAT score and how a one point increase affects the independent variable. The ANOVA table, short for Analysis of Variance is not useful for our regression analysis though it does contain some terms that readers should be aware of: Regression: the regression sum-of-all-squares which is a technique in OLS that determines the distance between each data point and the line of best fit and then squares that number. These numbers are automatically calculated and totaled when we made the regression. Residual: the residual sum-of-all-squares which is used to measure the variance in our data set that is not explained by the original regression model. What the residual value does is measures the amount of error between the regression function and the actual data set. This is also squared and summed. Total: This is simply the regression values and the residual values summed together.

H. Interpreting the Results


With the information that was calculated for us we can now mathematically build a formula to predict a students ending college grade point average with near 80% accuracy, a relatively strong correlation, if we are provided with their high school GPA as well as SAT scores. The equation used is: Yi= b0+ b1x1+ b2 x2+ b3x3+i Yi is the predicted college GPA; b0is the y-axis; b1x1, b2 x2, and b3x3 are the regression coefficients for high school GPA, math SAT score, and verbal SAT score. Finally, i is the stochastic error term; statisticians Get Out of Jail Free Card. The stochastic error is the difference between the predicted value and the observed value. If we are forecasting, like we will be doing in this example, the error term is unneeded, however if we were to compare rather than forecast the error term represents variables that were overlooked, lack of observations, human error, and slight anomalies that are not factored into the regression function. So, now to finally demonstrate OLS linear regression in a real-world, practical example; lets say that you are in charge of acceptances at a semi-prestigious university. You have been told by the university higher-ups that the school is choosing to be less focused on high enrollment numbers and Freshmen retention rates, instead the goal is to admit qualified students that are expected to not only graduate from college but to do so with a respectable grade point average, at least 3.1; quality over quantity. Using the regression analysis we developed previously, and also shown to the left, we can insert values for the independent variables and determine, with just shy of 80% accuracy, how a prospective student will perform at the university. An application is received from Macie Murphy. Ms. Murphys letter of application includes her high school transcript as well as her SAT scores. Ms. Murphy is a very astute student that hands in great assignments and seems to never have things hindering her academically like broken computer chargers that leave her two weeks behind the class syllabus and restlessly scrambling to catch up on homework. Macies high school GPA is 3.62 and her SAT scores are also something for her to be prideful about: 642 in mathematics and 661 in the verbal [Critical Reading] section. Using our regression function

and the values determined from 30 previous samples the equation to forecast Macies college GPA would be as followed: Yi= b0+ b1x1+ b2 x2+ b3x3 Note: because we are forecasting, a stochastic error is not necessary. Estimated College GPA = .459423967 + .73143021(3.62) - .002513024(642) + .00339953(661) Estimated College GPA = .459423967 + 2.64778 1.61336 + 2.24709 Estimated College GPA = 3.74 Given the universities new policy of an expected college GPA of 3.10 or higher, Macies astounding expected 3.74 GPA would make her a shoo-in.

V.

Troubleshooting

If the instructions are followed correctly, there should not be any errors messages or differing results. Though if you do encounter one they can often be easily fixed. One error that could appear is Data contains non-numeric data. A problem that you could be encountering is that one or more of the data cells on Sheet1 not listed in the first row (A1-D1) contains either a blank space or a letter put into one of the numbered cells. Recheck the data set and make the correction using the specified value(s) the author has provided. Another error that could be occurring is that the windows and boxes that get pulled up are not the same as the authors. This error is most likely to occur in section IV. Directions; subheading F. Data Results and Modification; in process 2. Adjusting the Graphs and 3. Adding a Trendline. The reasoning behind this possible error is that in these particular processes there is a lot of precise clicking needed. For process 2. Adjusting the Graphs, the author recommends clicking once on the middle displayed axis number, be it the y-axis or x-axis; if done correctly a light gray box should surround the area of the axis in question. Once the gray box is seen, double-click without moving the mouse cursor away from the initial first-click that resulted in the gray selection box. If done correctly, a window will open up that is identical to the authors provided graphic of the step. For process 3. Adding a Trendline, if your screen is different than the graphic provided, it is likely because not all of the data points are selected, or you may have selected the entire graph rather than the data points. Hover your mouse over one of the red data points. The red data points represent the predicted values, not the actual values. Left-click once; all red data points should now be highlighted. Without moving your cursor go ahead and right-click opening up the small window as seen in the graphic for that particular step. The key solution for this error is to be patient and avoid double-clicking or right-clicking when left-clicking is required and vice versa. If your data and numerical values are off, this will be another problem, especially if you are trying to reenact the authors regression. To fix this return to Sheet 1, the spreadsheet that contains the data set. Examine the numbers in each cell carefully as a missing or wrongly placed decimal point will throw the calculations completely off. The same goes for missing or improperly placed numbers. If the error is still occurring after you are likely having some trouble with Section IV. Directions; subheading E. Designing the output. Review and compare the Input Y Range and Input X Range fields. If

they do not match, the example the author has provided then repeat the process shown for that particular step making sure the correct cells have been selected. Finally, if all the graphics match with the authors, but Macies GPA you calculated doesnt match with the authors calculation it is likely a mathematical error. To fix this review, the first function given that contains no identifying numerical values; examine how the process for multiplication, addition, and subtraction will take form. Use the scratch paper and writing utensil to jot down how the process should go. Next, review the regression coefficient values (the independent variables) and check to see if they match the computer regression calculations as well as the authors inputs. If those match, then reread Macies variables provided (her high school GPA, and her math and reading SAT scores). Place those values into the equation just as the author did. If your numbers still dont match up, remember to multiply conjoined numbers first before doing the addition (and in this case a subtraction) between the independent variables. If need be do each of the multiplications first, like the author instructed, rather than entering the whole formula into a calculator as some lower-level calculators do not have order pairing multiplication and division programmed in.

You might also like