You are on page 1of 160

Module - 3

Data Analysis And Presentation

The Data Preparation Process

Preparing Preliminary Plan of Data Analysis Questionnaire Checking Editing Coding Transcribing Data Cleaning Statistically Adjusting the Data Selecting a data analysis Strategy

Questionnaire Checking
Involve a check of all questionnaire for completeness and interviewing quality A questionnaire returned from the field may be unacceptable due to several reasons
The part of the questionnaire may be incomplete The questionnaire is received after the pre established cutoff date Answered by someone who does not qualify for the participation The questionnaire is physically incomplete

Editing
Editing is done for filled up questionnaires Editing is the rearview of questionnaire with the objective of increasing it s accuracy and precision It consist of screening the questionnaire to identify illegible, incomplete, inconsistent or ambiguous responses

Editing
Treatment of unsatisfactory responses
Returning to the field Assigning missing values Discarding unsatisfactory respondents

Coding
Coding means assigning a code to each possible response to each question Coding Example
Dou you have currently valid passport?
1. Yes 2. No

Here Yes response is coded with 1 and No response is coded with 2 Codes allow the data to be easily entered and processed in a computer

Coding
Researchers arrange the organize the coded data into field, records and files Field
A collection of characters that represents a single type of data

Record
A collection of all related fields

File
A collection of all related records

Coding
The data matrix
A rectangular arrangement of data in rows and columns Fields
Column-1 States Column-2 Population ( in millions) 4 Column-2 Average Age 29.3 Column-2 Cars per 1000 543

Records

Gujarat Row-1 Maharashtra Raw-2 Rajasthan Row-3 Delhi Raw-4

GUJ

MAH

26.1

550

RAJ DEL

3.6 2.4

29.2 32.5

460 560

Coding
Code Book
A book containing coding instructions and necessary information about various variables in the data set It is used by researcher to promote more accurate and more efficient data entry

Coding
Pre-coding
Codes are assigned before the field work Used in case of close ended questions Its easy to code close ended questions

Post-coding
Codes are assigned after the field work Used in case of open ended questions Its very complex to code open ended questions

Coding
Coding Rules
Appropriate to the research problem and purpose Exhaustive

Transcribing
Transcribing involves transferring the coded data from questionnaire to computer Various techniques used for transcribing the data includes,
Computer Aided Telephonic Interviews (CATI) Keypunching Optical recognition Digital Technologies Bar Code Other technologies

Data Cleaning
Data cleaning includes consistency checks and treatment of missing responses Although preliminary consistency checks have been made during editing stage the checks at this stage is more through and extensive, because they are made on computer.

Data Cleaning
Consistency checks
A part of the data cleaning process that identifies the data that are out of range, logically inconsistent, or have extreme values

Missing responses
Represents values of variables that are unknown either because respondents provided unambiguous answers or their answers are not properly recorded

Data Cleaning
Treatment of missing responses
Substitute a neutral value
A neutral value typically a mean response value is substituted for the missing responses

Substitute an imputed response


Respondents pattern of responses to other questions is used to impute a suitable response to the missing question

Data Cleaning
Treatment of missing responses
Case wise deletion
Cases or respondents with any missing responses are discarded from the analysis

Pair wise deletion


Instead of discarding all cases with missing values researcher uses only cases with complete responses for the variables involved in each question

Statistically adjusting the data


Procedures for statistically adjusting the data consist of,
Weighting Variable re-specification and, Scale transformation

Statistically adjusting the data


Weighting
Each case or respondent in the database is assigned a weight to reflect its importance relative to other cases or respondents

Variable re specification
Involves transformation of data to create new variables or modify existing variables

Statistically adjusting the data


Scale transformation
Involves modification of assigned scales to make it more efficient and valuable Other objective is to increase the comparability between scales

Selecting a data analysis strategy


While selecting a data analysis strategy followings should be taken into consideration
Earlier steps of marketing research
Problem identification, research design etc.

Characteristics of data
Measurement scales used

Properties of statistical technique


A technique suitable for your research should be selected

Background and philosophy of researcher

A classification of statistical techniques

Classification of statistical techniques


Statistical techniques can be classified as univariate and multivariate Univariate techniques
Appropriate when there is single measurement of each element in the sample or there are several measurements of each element but each variable is analyzed in isolation

Multivariate techniques
Suitable for analyzing the data when there are two or more measurement of each elements and variables are analyzed simultaneously

Univariate Techniques
Univariate Techniques Metric Data Nonmetric Data

 Metric Data
 Data that are measured on a interval or ratio scale  Non Metric Data  Data that are measured on a nominal or ordinal scale

Univariate Techniques
Univariate Techniques Metric Data
One sample

Nonmetric Data
Two or more sample

o z Test o t Test

Independent

Dependent

o Two Group t Test o Two Group z Test o One way ANNOVA

o Paired t Test

Univariate Techniques
Univariate Techniques Metric Data
One Sample

Nonmetric Data
Two Sample Related

o Frequency o Chi-square Independence o Kolmogorov Smirnov oRuns o Chi-Square oBinomial

oMann-Whitney oMedian oKS oKW ANNOVA

o Sign oWilcoxon oMcNeamar oChi-Square

Multivariate Techniques
Multivariate Techniques Dependence Techniques Interdependence Techniques

Dependence Techniques
Appropriate when one or more variables can be identified as dependent and remaining as dependent

Interdependence techniques
The variables are not classified as dependent or independent whether the whole set of interdependent relationships are examined

Frequency Distribution
A mathematical distribution whose objective is to obtain count of number of responses associated with different values of one variable and to express these counts in percentage term

Frequency Distribution
42 30 53 50 52 30 55 49 61 74 26 58 40 40 28 36 30 33 31 37 32 37 30 32 23 32 58 43 30 29 34 50 47 31 35 26 64 46 40 43 57 30 49 40 25 50 52 32 60 54

Ages of a Sample of Managers from Urban Child Care Centers in the United States

Frequency Distribution
Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Frequency 6 18 11 11 3 1

Frequency Distribution
Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50

Class Interval 2020-under 30 3030-under 40 4040-under 50 5050-under 60 6060-under 70 7070-under 80 Total

Frequency 6 18 11 11 3 1 50

Graphical Presentation of data

Graphical Presentation of data


Once your data have been entered and checked for errors you are ready for analysis Data can be presented by using various diagrams, graphs and tables

Graphical Presentation of data


Following check list should considered while using diagrams and tables, For Diagrams:
Does it have clear axis labels ? Are the bars and their components in logical sequence? Is a key or legend included?

For tables:
Does it have clear row and column headings Are the columns and rows in logical sequence?

Graphical Presentation of data


Following check list should considered while using diagrams and tables, For both diagrams and tables:
Does it have brief but clear and descriptive title? Are the units of measurement used are clearly stated? Are the sources of data used clearly stated? Are there notes to explain abbreviations and unusual terminologies

Graphical Presentation of data


Tables and graphs can used to present following type of information's,
Specific values Highest and lowest values Trends over time Proportions Distributions Conjunctions Totals Interdependence and relationships

To show specific values


The simplest way of summarizing the data for individual variables so that specific values can be read is,
Tables and, Frequency distributions

To show highest and lowest values


Table attach no visual significance to highest or lowest value Diagrams, charts and graphs can do this job in batter manner Following types of diagrams can be used for the same purpose,
Bar chart Histogram Line graph / Frequency Polygon

To show highest and lowest values


Bar chart 70
60 50

Employees by Grade

Number

40 30 20 10 0 Production Marketing Finance HR

Grade

To show highest and lowest values


Histogram:
A histogram is series of rectangles, each proportional in width to the range of values in within a class and proportional in height to the number of items falling in each class Advantages:
Rectangle clearly shows each separate class in the distribution Area of each rectangle, relative to all other rectangles shows proportion of total number observations that occur in each class

To show highest and lowest values


Histogram
Frequ ency 20 0 0 10

10 20 30 40 50 60 70 80 Years

To show highest and lowest values


Histogram
Class IntervalFrequency IntervalFrequency 2020-under 30 6 3030-under 40 18 4040-under 50 11 5050-under 60 11 6060-under 70 3 7070-under 80 1
20 Freq uen cy 0 0 10

10 20 30 40 50 60 70 80 Years

To show highest and lowest values


Line graph / Frequency polygon
Frequency Polygon is a line graph of frequencies Here we plot each class frequency by drawing a dot above its midpoint and connect the successive dots by a straight lines to form a polygon Advantages of polygon
Frequency polygon is simpler than its histogram counterpart It sketches an outline of data pattern more clearly

To show highest and lowest values


Frequency polygon
Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Frequency 6 18 11 11 3 1
20 Frequency 0 0 10

10 20 30 40 50 60 70 80 Years

To Show Proportions
Pie chart
The most frequently used diagram to show the proportion or share is the pie charts A pie chart is a circular chart divided into sectors, illustrating proportion. The pie chart is perhaps the most ubiquitous statistical chart in the business world and the mass media.

Pie chart

To Show Proportions

To Show Distribution of values


To show the distribution of values among the variables following diagrams can be used
Pie chart Frequency Polygon Bar Chart Box and plot diagram

To Show Distribution of values


Box and plot diagram
It summarizes the following statistical measures
Median Upper and lower quartiles Minimum and maximum data values

 The box itself contains middle 50% of data  The upper edge of the box indicates 75th percentile  Lowe edge indicates 25th percentile

Comparing Variables
To show specific values and interdependence
Contingency tables or cross tabulation
Is often used to record and analyze the relation between two or more categorical variables. It displays the frequency distribution of the variables in a matrix format.

Right-handed Males Females Totals 43 44 87

Left-handed 9 4 13

Totals 52 48 100

Comparing Variables
To compare highest and lowest values
Comparison of highest and lowest values are best explored using a multiple bar chart or a compound bar chart
Amount of sales during 1999 and 2010
160 140 Sales in Lacks 120 100 80 60 40 20 0 Company A Company B Company Company C 1999 2010

Comparing Variables
To compare proportions
Comparisons of proportion between variables uses either a percentage component bar chart or two or more pie charts
Would you purchase this product again?
120% Percentage of respondents 100% 80% 60% 40% 20% 0% Product A Product B Product C Products Product D No Unsure Probably Definitely

Comparing Variables
To compare trends and conjunctions
The most suitable diagram to compare the trends between two or more variable is multiple line graph Trends in Sale of ice cream
18 16 14 Sales in '000 liters 12 10 8 6 4 2 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Months Chocolate Strawberry Vanilla

Comparing Variables
To compare distribution of values
Often it is useful to compare the distribution of values for two or more variables Plotting Multiple frequency polygon or bar charts are useful in such situation
Amount of sales during 1999 and 2010
160 140 120 100 80 60 40 20 0 Company A Company B Company Company C 1999 2010

Sales in Lacks

Comparing Variables
To show the relationship between cases for variables
Often it is useful to compare the distribution of values for two or more variables Scatter graph or scatter plot would be useful in such situation

Statistics Associated With Frequency Distribution

Statistics Associated With Frequency Distribution


The most commonly used statistics associated with frequency distribution are,
Measures of location
Mean , Median and Mode

Measures of variability
Range, Inter quartile range, Standard deviation and Coefficient of variables

Measures of Shape
Skewness and Kurtosis

Measures of Location
Mean
The average, that value obtained by summing all elements in a set and dividing by number of elements The most commonly used measure of central tendency Affected by each value in the data set, including extreme values

Measures of Location
Mean

X X!

n 57  86  42  38  90  66 ! 6 379 ! 6 ! 63.167

Measures of Location
Mode
The most frequently occurring value in a data set Represent the highest peak of the distribution Multimodal -- Data sets that contain more than two modes

Measures of Location
Mode

 The mode is 44.  There are more 44s than any other value.

35 37 37 39 40 40

41 41 43 43 43 43

44 44 44 44 44 45

45 46 46 46 46 48

Measures of Location
Median
Middle value in an ordered array of numbers. Unaffected by extremely large and extremely small values.

First Procedure
Arrange the observations in an ordered array. If there is an odd number of terms, the median is the middle term of the ordered array. If there is an even number of terms, the median is the average of the middle two terms.

Second Procedure
The median s position in an ordered array is given by (n+1)/2. Where n= No of items in array

Measures of Location
Median
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22      There are 17 terms in the ordered array. Position of median = (n+1)/2 = (17+1)/2 = 9 The median is the 9th term, 15. If the 22 is replaced by 100, the median is 15. If the 3 is replaced by -103, the median is 15.
Median

Measures of shape
Skewness
Absence of symmetry Extreme values in one side of a distribution

Negatively Skewed

Symmetric (Not Skewed)

Positively Skewed

Measures of Shape
Skewness

Mean Median

Mode

Mean Median Mode

Mode Median

Mean

Negatively Skewed

Symmetric (Not Skewed)

Positively Skewed

Measures of Shape
Kurtosis
A measure of relative peakedness or flatness of the curve defined by frequency distribution The kurtosis of normal distribution is always zero If Kurtosis is positive then the distribution is more peaked then normal distribution

Measures of variability
A statistics that indicates the distribution s dispersion It measures the dispersion of data in distribution Common measures of variability
Range Interquartile Range Mean Absolute Deviation Variance Standard Deviation Z scores Coefficient of Variation

Measures of variability
Range
The difference between the largest and the smallest values in a set of data Simple to compute Ignores all data points except the two extremes Example: Range = Largest Smallest = 48 - 35 = 13
35 37 37 39 40 40 41 41 43 43 43 43 44 44 44 44 44 45 45 46 46 46 46 48

Measures of variability
Interquartile Range
Range of values between the first and third quartiles Range of the middle half Less influenced by extremes

Interquartile Range ! Q 3  Q1

Measures of variability
Variance
The mean squared deviation of all the values from mean When data points are clustered around the mean, the variance is small Data set: 5, 9, 16, 17, 18 Mean:
65 Deviations fromN ! 5 ! 13 -4, the mean: -8,

X Q!

3, 4, 5
+5 +4 +3

-8
0 5

-4
10 15

20

Measures of variability
Standard Deviation
The square root of variance Expressed in the same units as data, rather in squared unit Population Variance and standard deviation Sample Variance and standard deviation

X  Q !
N

X  X !
2

n 1

W !

S!

Measures of variability
Coefficient of variation
The coefficient of variation is the ratio of standard deviation to the mean, expressed as percentage

S CV ! X

Introduction to hypothesis testing

Hypothesis
Hypothesis means assumption It is related with making assumption about population parameters Hypothesis/assumptions about population parameters may be right or wrong. Hypothesis testing is related with determining the correctness/accuracy of hypothesis.

Procedure for hypothesis testing


1. 2. 3. 4. Formulate the null and alternative hypothesis Select an appropriate statistical technique Choose the level of significance Determine the sample size and collect the data. Calculate the value of test statistics 5. Determine the probability associated with test statistic and determine the critical value of test statistic that divide rejection and non rejection regions

Procedure for hypothesis testing


6. Compare probability with level of significance. Determine whether the test statistic has fallen within rejection or non rejection regions 7. Reject or do not reject null hypothesis 8. Draw a marketing research conclusion

Procedure for hypothesis testing


Step 1 Formulate the hypothesis Null hypothesis
Assumption we wish to test is called null hypothesis Denoted as- H0 For e.g. assumption is that population mean is 500 It is called null hypothesis and denoted as,

H 0 : Q ! 500

Procedure for hypothesis testing


Step 1 Formulate the hypothesis Alternative hypothesis
When we reject null hypothesis the conclusion we do accept is called alternative hypothesis. H :Q ! Fro e.g. if null hypothesis is, H0: 0 = 500 500 Alternative hypothesis would be as under,

1. H0 : Q { 500 2. H0 : Q " 500 3. H0 : Q 500

Step 1 Formulate the hypothesis One tail test

Procedure for hypothesis testing

One tail test will reject the null hypothesis if sample mean is
Higher than hypothesized population mean Lower than hypothesized population mea

So there is only one rejection area

Step 1 Formulate the hypothesis One tail test


Appropriate when, Null hypothesis is,

Procedure for hypothesis testing

H0: Q!Q,
Alternative hypothesis is,

H1:Q "Q, H1:Q


Q,

Step 1 Formulate the hypothesis Two tail test

Procedure for hypothesis testing

A two tail test will reject the null hypothesis if sample mean is significantly higher or lower than hypothesized population mean. So there are two rejection areas Appropriate when, Null hypothesis is H0: Q!Q, Alternative hypothesis is H1:Q !Q,

Procedure for hypothesis testing


Step 2 Select an appropriate test Test statistic measures how close the sample has come to the null hypothesis It often follows a well known distributions such as normal, t or chi-square distribution

Procedure for hypothesis testing


Step 2 Select an appropriate test
Population standard Population standard deviation is known deviation is unknown

n>30

Normal distribution Normal distribution Z- table Z- table

n < 30

Normal distribution Z- table

T- distribution t- table

Procedure for hypothesis testing


Step 3 Choose a level of significance The purpose of hypothesis testing is to make judgment about the difference between sample statistic (mean) and hypothesized population parameters (mean) The significance level is related with level of acceptable difference between sample mean and hypothesized population mean

Procedure for hypothesis testing


Step 3 Choose a level of significance Higher significance level leads to higher accuracy Higher the significance level we use for testing hypothesis, higher the probability of rejecting null hypothesis when it is true.

Procedure for hypothesis testing


Step 3 Choose a level of significance Type -1 error
Error of rejecting null hypothesis when it is true E Its probability is denoted by It happens in case of high significance level

Type - 2 error
Error of accepting null hypothesis when it is false F Its probability is denoted by It happens when significance level is low There is a trade off between these two types of errors

Procedure for hypothesis testing


Step 4 Collect the data and calculate test statistic The value of test statistic should be computed by collecting the required data The test statistic can be calculated as under

z - test
xQ Z ! Wx

t - test
xQ t! Wx

Procedure for hypothesis testing


Step 5 Determine the probability (or critical value) Determine the probability / Critical value for zvalue or t-value by using standard normal table Critical value is determined by taking into consideration the significance level

Procedure for hypothesis testing


Step 6 Compare the probability and make the decision See that whether the value of sample statistic is within the acceptance region or rejection region Step 7 Marketing research conclusion Finally draw the marketing research conclusion based on hypothesis tested

Cross tabulation

Introduction to cross tabulation


If you want to better understand how two different survey items inter-relate, then crosstab analysis is the answer. It is a statistical technique that describe the two or more variables simultaneously and results in a table that reflect the joint distribution of two or more variables

Introduction to cross tabulation


Examples
 Relationship between gender and internet use
Gender Internet Use
Male Light Heavy Total 5 10 15 Female 10 5 15 Total 15 15 30

Contingency table

Introduction to cross tabulation


Examples
How many brand loyal users are male ? Is familiarity with new product related with age and education level ? Is product ownership is related to income ?

Cross tabulation - Advantages


Cross tabulation analysis and results can be easily understood and interpreted by managers The clarity of interpretation provides a stronger link between research result and managerial actions Simple to conduct and appealing to less sophisticated researchers

Cross tabulation With Two Variables


Cross tabulation with two variables is also known as bivariate cross tabulation It shows the relationship between two variables
Gender Internet Use
Male Light Heavy Total 5 10 15 Female 10 5 15 Total 15 15 30

Cross tabulation With Two Variables


Gender Internet Use
Male Light Heavy Total 33.33% 66.7% 100% Female 66.7% 33.33% 100%

Gender Internet Use


Male Light Heavy 33.33% 66.7% Female 66.7% 33.33% Total 100% 100%

Cross tabulation With Three Variables


Cross tabulation with three variables
Cross tabulation with three variables reflect the relationship between three variables
Gender
Male Marital Status Female Marital Status
Married 25% 75% 100% Unmarried 60% 40% 100%

Purchase of Fashion Clothing


High Low Column Totals

Married 35% 65% 100%

Unmarried 40% 60% 100%

Statistics associated with cross tabulation


The following are the statistical tests associated with cross tabulation
Chi-Square Phi Coefficient Contingency Coefficient Carmer s V Test Lambda Coefficient

Hypothesis Testing Related to Differences

Hypothesis tests

Parametric Tests

Non parametric Tests

Univariate Techniques
Hypothesis Tests Parametric
One sample

Non Parametric
Two or more sample

o z Test o t Test

Independent

Dependent

o Two Group t Test o Two Group z Test o F - Test

o Paired t Test

Univariate Techniques
Hypothesis Tests Parametric
One Sample

Non Parametric
Two Sample

o Chi-square o Kolmogorov Smirnov oRuns oBinomial

Independence

Related

o Chi-Square oMann-Whitney oMedian oKS

o Sign oWilcoxon oMcNeamar oChi-Square

Parametric V/S Nonparametric Tests


Parametric tests
Hypothesis test procedure that assume that variables are measured on interval scale Assumes that sampling distribution is normally distributed

Non Parametric V/S Nonparametric Tests Non Parametric tests


Assumes that variables are measured on a nominal or ordinal scale Useful when sampling distribution is not normally distributed These are tests that can be done without the assumption of normality, approximate normality, or symmetry These tests do not require a mean and standard deviation

Non Parametric V/S Nonparametric Parametric Test Goal for Tests -Parametric Goal for NonNonNon NonParametric Test Test Parametric Test
Two Sample TTTest To see if two Wilcoxon RankRanksamples have Sum Test identical population means To test a hypothesis about the mean of the population a sample was taken from Wilcoxon Signed Ranks Test To see if two samples have identical population medians To test a hypothesis about the median of the population a sample was taken from

One Sample TTTest

Non Parametric V/S Nonparametric Parametric Test Goal for Tests -Parametric Goal for NonNonNon NonParametric Test Test Parametric Test
ChiChi-Squared Test To see if a sample KolmogorovKolmogorovfor Goodness of fits a theoretical Smirnov Test Fit distribution, such as the normal curve ANOVA To see if two or more sample means are significantly different KruskalKruskal-Wallis Test To see if a sample could have come from a certain distribution To test if two or more sample medians are significantly different

Non Parametric V/S Nonparametric Tests


Parametric
Assumed distribution Typical data Data set relationships Usual central measure
Normal Ratio or Interval Independent Mean Can draw more conclusions

Non-parametric
Not Normal Ordinal or Nominal Any Median Simplicity; Less affected by outliers

Benefits

Parametric Test One Sample


z test
The Z-test compares sample and population means to determine if there is a significant difference. It requires a simple random sample from a population with a Normal distribution and where the mean is known. Used when sample size is very large (More than 30) Uses Z - distribution

Parametric Test One Sample


z test

Z !

X Q

Parametric Test One Sample


t test
To test a hypothesis about the mean of the population a sample was taken from Used when population standard deviation is unknown and sample size is small (n<30) Uses t - distribution

Parametric Test One Sample


t test
Characteristics of t distribution
More flatter than normal distribution

Lower at the mean and higher at tails than normal


distribution. More values in tails as compared to normal distribution As the size of n increases t distribution approximates normal distribution

Parametric Test One Sample


t test

xQ t! Wx

t test Procedure for hypothesis testing when t test is used


Formulate null and alternative hypothesis Select appropriate formula for t statistic Select a significance level Select one or two samples and compute mean and standard deviation Calculate degree of freedom and find out the critical value from t table Compare t value with critical value and based on it accept or reject null hypothesis

Parametric Test One Sample

Several hypothesis in marketing relates the parameters from two different populations Foe example,

Parametric Test Two Independent Samples

Brand perception of users and non users Spending habits of high income consumers and low income consumers etc.

Parametric Test Two Independent Samples


Two group z test
To see if two samples have identical population means Used when sample size is large In this case the hypothesis take the following form

H 0 : Q1 ! Q 2 H1 : Q1 { Q 2

Two group z test

Parametric Test Two Independent Samples

x  x  Q  Q z!
1 2 1 2

W x1  x2

W x1  x2

2 W 12 W 2 !  n1 n2

Two group t test

Parametric Test Two Independent Samples

To see if two samples have identical population means Used when sample size is small t statistic is computed as under

x  x  Q  Q t!
1 2 1 2

W x1  x2

W x1 x 2

1 1 ! sp v  n1 n2

Two group t test

Parametric Test Two Independent Samples


sp ! s
2
2 p

( n1  1)( s1 )  ( n2  1)( s2 ) n1  n2  2
DOF= n1 + n1

F test (ANNOVA)

Parametric Test Two Independent Samples

An F test of ample variance can be performed in order to know that whether the two populations have equal variance In this case the hypothesis take the following form

H0 :W1 ! W 2 H1 : W 1 { W 2
2

F test (ANNOVA)

Parametric Test Two Independent Samples


2

The f statistic is computed as under

s1 F n1 1 , n2 1 ! 2 s2
Where n1 ! Size of sample 1 n 2 ! Size of sample 2 n1  1 ! Degree of freedom for sample 1 n 2  1 ! Degree of freedom for sample 2
2 S1 ! Sample variance for sample 1

S2 ! Sample variance for sample 1 2

Parametric Test Paired Samples


In many marketing research applications the observations for two groups are not selected from independent samples Rather the observations relates to paired samples in that the two sets of observations relates to the same sample. For example a sample of respondents may rate two competing brands The difference in this case is examined by paired t test

Parametric Test Paired Samples


 In this case the hypothesis take the following form

H0 : QD ! 0 H1 : Q D { 0 D  QD t n 1 ! SD n

Non Parametric One sample


Kolmogorov Smirnov test (K S Test)
A fully non-parametric test for comparing two distributions To see if a sample could have come from a certain distribution (Normal, uniform, Poisson) Goodness of fit test

Non Parametric One sample


Kolmogorov Smirnov test (K S Test)

K ! Max Ai  Oi
Where,

Ai ! Cumulative relative frequency for each category of theoretical distribution O i ! Comparable value of sample frequency

One sample test of runs


a test for randomness of order of occurrence

A run is a sequence of identical occurrences that are followed and preceded by different occurrences.
Example: The list of X s & O s below consists of 7 runs.

xxxooooxxooooxxxxoox

Suppose r is the number of runs, n1 is the number of type 1 occurrences and n2 is the number of type 2 occurrences.

The mean number of runs is 2n1n 2  1. r ! n1  n 2


The standard deviation of the number of runs is 2n1n 2 (2n1n 2 - n1 - n 2 ) Wr ! . 2 (n1  n 2 ) (n1  n 2  1)

If n1 and n2 are each at least 10, then r is approximately normal.

r - Qr So, Z! Wr is a standard normal variable.

Example: A stock exhibits the following price increase (+) and decrease () behavior over 25 business days. Test at the 1% whether the pattern is random. +++++++++++++ r =16, = 13, = 12
r

n1 (+) n2 ()

2n1n 2 2(13)(12) ! 1 !  1 ! 13.48 n1  n 2 13  12


2(13)(12) [(2(13)(12) - 13 - 12] ! 2.44 2 (13  12) (13  12  1)
critical region .005 critical region .005

2n1n 2 (2n1n 2 - n1 - n 2 ) ! Wr ! 2 (n1  n 2 ) (n1  n 2  1)

r - Q r 16 - 13.48 Z! ! ! 1.03 2.44 Wr


Since the critical values for a 2-tailed 1% test are 2.575 and -2.575, we accept H0 that the pattern is random.

.495 .495 acceptance region

-2.575

2.575

Mann Whiteny U test

Non Parametric Two independent samples

Sometimes distributions of variables do not show a normal distribution, or the samples taken are so small that one

cannot tell if they are part of a normal distribution or not. Using the t-test to tell if there is a significant difference between samples is not appropriate here.
The Mann-Whitney U-test can be used in these situations. This test can be used for very small

samples (between 5 and 20).

Non Parametric Two independent samples


Mann Whiteny U test

(n1  1) U ! n1n2  n1  R1 2

Non Parametric Two independent samples


Mann Whiteny U test - Example
Two tailed null hypothesis that there is no difference between the heights of male and female students Ho: Male and female students are the same height HA: Male and female students are not the same height

Non Parametric Two independent samples


Mann Whiteny U test - Example
Heights of Heights of males (cm) females (cm) Ranks of male heights Ranks of female heights

193 188 185 183 180 178 170 n1 = 7

175 173 168 165 163

1 2 3 4 5 6 9

7 8 10 11 12

n2 = 5

R1 = 30

R2 = 48

Non Parametric Two independent samples


Mann Whiteny U test - Example
U = n1n2 + n1(n1+1) R1 2 U=(7)(5) + (7)(8) 30 2 U = 35 + 28 30 U = 33 U = n1n2 U U = (7)(5) 33 U =2 U 0.05(2),7,5 = U 0.05(2),5,7 = 30 As 33 > 30, Ho is rejected

Non Parametric Paired Sample


Wilcoxon Rank Sum test
This test is used to test whether 2 independent samples have been drawn from populations with the same mean. It is a nonparametric substitute for the t-test on the difference between two means.

Non Parametric Paired Sample


Wilcoxon Rank Sum test Example
university university A B A B 50 70 50 70 52 73 52 73 56 77 56 77 60 80 60 80 64 83 64 83 68 85 68 85 71 87 71 87 74 88 74 88 89 96 89 96 95 99 95 99

Based on the following samples from two universities, test at the 10% level whether graduates from the two schools have the same average grade on an aptitude test.

First merge and rank the grades. Sum the ranks for each sample. rank sum for university A: 74 rank sum for university B: 136 university university A B A B 50 70 50 70 52 73 52 73 56 77 56 77 60 80 60 80 64 83 64 83 68 85 68 85 71 87 71 87 74 88 74 88 89 96 89 96 95 99 95 99

rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

grade 50 52 56 60 64 68 70 71 73 74 77 80 83 85 87 88 89 95 96 99

university A A A A A A B A B A B B B B B B A A B B

Note: If there are ties, each value gets the average rank. For example, if 2 values tie for 7th and 8th place, both are ranked 7.5.

Here, the group from university A is considered the 1st sample. When the samples differ in size, designate the smaller of the 2 samples as the 1st sample. Define W ! sum of the ranks for 1st sample .
The mean of W is Q W n1 (n1  n 2  1) ! , 2
n1n 2 (n1  n 2  1) . 12

and the standard deviation is W W !

If n1 and n 2 are each at least 10, W is approximately normal.


W - QW So, Z ! has a standard normal distribution. WW

For our example, W ! 74.


n1 (n  1) 10(20  1) QW ! ! ! 105 2 2
WW ! n1n 2 (n  1) ! 12 (10)(10)(20  1) ! 13.229 12

W - Q W 74 - 105 Z! ! ! - 2.343. WW 13.229


Since the critical values for a 2-tailed Z test at the 10% level are 1.645 and -1.645, we reject H0 that the means are the same and accept H1 that the means are different.

critical region .05

.45
-1.645 0

.45
.05 1.645

critical region

Non Parametric Paired Sample


The Kruskal-Wallis Test
This test is used to test whether several populations have the same mean. It is a nonparametric substitute for a one-factor ANOVA.

R2 12 j The test statistic is K ! n - 3(n  1) , n(n  1) j

where nj is the number of observations in the jth sample, n is the total number of observations, and Rj is the sum of ranks for the jth sample.
If each n j u 5 and the null hypothesis is true, then the distribution of K is G 2 with dof ! c - 1, where c is the number of sample groups.

In the case of ties, a corrected statistic should be computed:


K Kc ! (t 3  t j ) j 1- n3  n

where tj is the number of ties in the jth sample.

Kruskal-Wallis Test Example: Test at the 5% level whether average employee performance is the same at 3 firms, using the following standardized test scores for 20 employees.

Firm 1 score 78 95 85 87 75 90 80 n1 = 7 rank 68 77 84 61 62 72

Firm 2 score rank 82 65 50 93 70 60 73 n2 = 6

Firm 3 score rank

n3 =7

We rank all the scores. Then we sum the ranks for each firm. Then we calculate the K statistic.

Firm 1 score 78 95 85 87 75 90 80 n1 = 7 rank 12 20 16 17 10 18 13 R1 = 106 68 77 84 61 62 72

Firm 2 score rank 6 11 15 3 4 8 82 65 50 93 70 60 73 n2 = 6 R2 = 47

Firm 3 score rank 14 5 1 19 7 2 9 R3 = 57

n3 =7

2 12 R j 12 106 2 47 2 57 2 - 3(n  1) ! K! 7  6  7 - 3(21) ! 6.641 n(n  1) nj 20(21)

f(G2)

acceptanc e region .05 5.991

crit. reg.

G 22

From the G2 table, we see that the 5% critical value for a G2 with 2 dof is 5.991. Since our value for K was 6.641, we reject H0 that the means are the same and accept H1 that the means are different.

Sign Test
Earliest form of Non-parametric testing, developed in 1710 Compares matched pairs of data Looks at direction of differences of each pair
Not interested in size of difference

Expressed as
0 if no difference + if 1st of pair is higher (or = 1)

- if 1st of pair is lower (or 1)

144

Ignore 0 s Let test statistic = S = no. of + signs Ho = no differences in population Sampling dist of S is binomial & probability of success of getting a + sign is 0.5

145

Sampling dist of S
Mean: Qs = np Std Dev: s ! npq W When nu 10 can use normal distr as good approx for binomial Hence test statistic ! S  Q S Z WS
146

Example
Soft drink distributor wants to see if people prefer old diet cola to new diet cola (eg Coke Zero)
11 customers = n 7 prefer new brand, 4 still like old brand Let new brand = +

147

Hypo test
Ho: Preferences for the 2 brands are similar (p = 0.5) Ha: Preferences for 2 brands are not similar Test at E= 0.05, So Zcrit = 1.96 Let S = 7 = no. of +ve preferences Then = np = 11*0.5 = 5.5 W = 0.5*(11)^0.5 = 1.66 So Zcalc = (7 5.5)/1.66 = 0.9 Hence Ho can t be rejected So customers are indifferent to brands of cola, if to be sold at same price, then equal q should be stocked

148

Simple Linear Regression and correlation


In order to understand the relationship between two variables regression and correlation studies are carried out

Correlation
 A measure of association between two numerical variables.  Typically, in the summer as the temperature increases people are thirstier.

It measures a linear relationship between two variables It does not imply that any one of the variable causes other

Correlation
For seven random summer days, a person recorded the temperature and their water consumption, during a three-hour period spent outside.
Temperature (F)
Water Consumption (ounces)

75 83 85 85 92 97 99

16 20 25 27 32 48 48

Correlation

Correlation
Pearson s Sample Correlation Coefficient, r
Measures the direction and the strength of the linear association between two numerical paired variables.

Correlation
r value Interpretation 1 0 -1 perfect positive linear relationship no linear relationship perfect negative linear relationship

Correlation
Direction of correlation Positive Correlation Negative Correlation

Correlation
Strength of correlation
r value 0.9 0.5 0.25

Interpretation
strong association moderate association weak association

Correlation
Strength of correlation

Correlation Coefficient Formula

COVXY r! S X SY

Correlation Coefficient Formula

= the sum n = number of paired items xi = input variable x = x-bar = mean of x s sx= standard deviation of x s yi = output variable y = y-bar = mean of y s sy= standard deviation of ys

Hypothesis testing for correlation


Inferences from sample correlation coefficient (r) to population correlation coefficient (p) (r) can be used an estimate of population correlation coefficient (p)

You might also like