You are on page 1of 176

DATA MINER

GUIDE

Descriptive Statistics - Factorial Analyses - Clustering


Linear Models Discriminant Analyses
Scoring Decision Trees

22 quai gallieni - 92150 Suresnes - France


Tl : +33 1 57 32 60 60 - Fax : +33 1 57 32 62 00
spad@coheris.com www.coheris.com
Siret : 399 467 927 00105 - APE : 5829C
Register number training: 11-92-1522492

Data Miner Guide


Copyright 1996, 2008 SPAD. All rights reserved.

For any further information about the SPAD software, training and consulting activities, please visit
us at www.coheris.com or contact us by email:

About
SPAD Software
SPAD Hot line
Training
Consulting
Books

E-mail
info-spad@coheris.com
support-spad@coheris.com
formation-spad@coheris.com
consulting-spad@coheris.com
publication-spad@coheris.com

For further information about the COHERIS Group offer (CRM, BI, Data Mining, Data Quality
Management, Merchandising Sfa), visit us at www.coheris.com

Tl : +33 1 57 32 60 60 - Fax : +33 1 57 32 62 00


www.coheris.com
Siret : 399 467 927 00105 - APE : 5829C
Register number training : 11-92-1522492

Table of contents
DESCRIPTIVE STATISTICS WITH SPAD

STATS - MARGINAL DISTRIBUTIONS, HISTOGRAMS


DEMOD AUTOMATIC CHARACTERIZATION OF A QUALITATIVE VARIABLE
DESCO - AUTOMATIC CHARACTERIZATION OF A CONTINUOUS VARIABLE
TABLE - CROSS TABLES
BIVAR - BIVARIATE ANALYSIS

5
16
21
25
28

FACTORIAL ANALYSES WITH SPAD

30

PCA - PRINCIPAL COMPONENT ANALYSIS


SCA - SIMPLE CORRESPONDENCE ANALYSIS
MCA - MULTIPLE CORRESPONDENCE ANALYSIS

32
45
50

CLUSTERING WITH SPAD

62

RECIP / SEMIS - CLUSTERING ON FACTORS SCORES


PARTI - DECLA - CUT OF THE TREE AND CLUSTERS DESCRIPTION
CLASS - MINER - CLUSTERS DESCRIPTION
ESCAL - STORING THE FACTORIAL AXES AND THE PARTITIONS

63
69
78
79

THE LINEAR MODEL AND ITS APPLICATIONS

80

REGRESSION AND ANALYSIS OF VARIABCE, GENERAL LINEAR MODEL


OPTIMAL REGRESSIONS RESEARCH
LOGISTIC REGRESSION

80
85
94

THE DISCRIMINANT AND ITS METHODS

105

FUWILD - OPTIMAL DISCRIMINANT ANALYSIS


DIS2GD - LINEAR DISCRIMINANT ANALYSIS BASED ON CONTINUOUS VARIABLES
DIS2GFP - LINEAR DISCRIMINANT ANALYSIS BASED ON PRINCIPAL FACTORS
DISCO - DISCRIMINANT ANALYSIS BASED ON QUALITATIVE VARIABLES
SCORE - SCORING FUNCTION
IDT 1 - INTERACTIVE DECISION TREE 1
IDT 2 - INTERACTIVE DECISION TREE 2

105
117
126
134
134
154
154

DESCRIPTIVE STATISTICS WITH SPAD

STATS : marginal distributions, histograms, matrix plot, box plot

DEMOD : automatic characterization of a qualitative variable

DESCO : automatic characterization of a continuous variable

TABLE : Crossed tables

BIVAR : Bivariate analysis

Descriptive Statistics with SPAD

STATS - MARGINAL DISTRIBUTIONS, HISTOGRAMS

This procedure supplies a rapid and automatic description of your nominal and
continuous variables.
The Survey.sba base is an opinion survey file, which will be used for this example. The file is
supplied with the application and installed automatically on your PC.

SET THE PARAMETERS FOR A METHOD


Before it can be executed, a method must have its parameters set.
To access the parameter settings of a method, right click on the method then on the Set the
method command or double-click on the method icon.
The rules for calculation and parameter settings of each of the methods are available on line.

The Cases, Weighting and Parameters tabs are available for almost all SPAD methods.
Cases: the Cases tab lets you select the cases used for the method
Weighting: the weighting tab allows you to adjust the distribution of the cases in the sample
Parameters: options and settings of the method
5

STATS - marginal distributions, Histograms

The Cases tab


The Cases tab lets you select the cases with one of the following methods:

All the available cases


One or more logical filters (selection criteria combined with AND/OR)
A name list of cases
A selection made in one or more intervals
Random draw

Apply a logical filter

c
Click on Logical filter

Select the chosen


variable

e
Click on the operator

f
Click on the
operand

h
g

Global Definition
of the filter

Click on
Validate

In case of error, you can delete an expression from the filter by selecting the expression to discard,
and click on Delete.
The cases satisfying the filter are considered as active, while the others are supplementary.
Select the individuals from a list
Select the chosen
method by List
Select the status of
the cases
Choose your cases in the Available list and
use the transfer buttons to select them.
6

Descriptive Statistics with SPAD

Select cases by interval

c Select

by interval as the
method of choice

d Select the status of the


cases

e Define the interval as a


function of its rank in the
Base SPAD

f Click on the arrow button


to move your choice to the
cases status window
You can save the definition of the selection made, by clicking on the Save button. This allows you
to re-use it later.
Do a Random Draw
Click on the Yes radio button
to run a random draw
Click on Define to set the
parameters for the
random draw
Indicate the number of preliminary
requests for the random draw. On
another execution of the selection, you
do not need to change the value of this
number unless you want to generate
different draws
Enter the percentage of the
draw by random, or the
sample size after the draw
Click on OK
This selection lets you apply the method to a sample before applying it to the entire SPAD base.
It also lets you, by executing the same method several times, after having taken the precaution to
change the number of preliminary request, to test the stability of the results of the method.

STATS - marginal distributions, Histograms

The Weighting tab


The weighting tab allows you to adjust the distribution of the cases in the sample:

According to a Weighting variable already in the file.


As a function of one or more theoretical percentages (calculation by adjustment).

Select the
weighting
type

In the case of calculation by


adjustment, in the available
variables window, choose the
variable serving to correct and
click on the button Define

For a category, enter


the theoretical
percentage and hit
Enter

Enter the theoretical percentage for each category and click on OK.
You can repeat this operation for another variable. In this way you get an adjustment as a function
of several variables with a simple weighting variable. This requires a calculation by successive
approximations, as shown in the window below:

Click on the options in the


first window, to access the
options window for the
weighting system.

You can use the options


by default, or change the
options for fitting
8

Descriptive Statistics with SPAD

Attention: The weighting calculation in the weighting tab page for a method is temporary (the
weighting variable is not saved). This approach lets you make quick tests and also to measure the
influence of the weighting on the results of the method. When a satisfactory weighting variable has
been obtained, it is preferable to create a permanent weighting variable with the menu Tools
Weighting of the main menu (Data Management Manual, paragraph 4.3).
Then in the weighting tab of a method, we will select this variable as the weight variable.
The Marginal distributions tab
We select the categorical variables in the list below.

The Parameters button allows you to display or not the categories without any
respondent and to display or not the missing data as a new category.
The Statistics button displays summary statistics on the selected variables. For example,
select the Region where the respondent lives (V1), then click on the statistics button. A
window opens with statistics on the variable:
This statistics window shows for the categorical
variables: the count and percentage associated for
each category. For the continuous variables; the
statistic window shows the count, the mean, the
standard deviation, as well as the minimum and
maximum.

STATS - marginal distributions, Histograms

The Histograms - Categorization tab


This tab allows you to select continuous variables both for histograms/summary statistics
and for categorization (marginal distributions of the variables values)

The Parameters button allows you to set global or specific parameters for the histograms
characteristics such as the number of classes, the min and max bounds and the histogram
bar width.
You can also select continuous variables for categorization. As a result, each distinct value
is displayed with its frequency.
It is a preliminary step before splitting the continuous variable into classes.
It is not allowed to do both histograms and categorization for the same variable.

10

Descriptive Statistics with SPAD

The Marginal distributions by categories tab


This tab is useful for variables that are based on the same categories. The categories of
theses variables must have the same labels and must be ranked in the same order (we can
check it with the marginal distributions tab).

The Parameters tab


This tab allows you to export the results into excel or not.

11

STATS - marginal distributions, Histograms

Once you have specified your request, then you validate the method by clicking on the
OK button.

RESULTS
Results are accessible in the Execution view or by right-clicking on the method and
choosing the Results command. Then, depending on the method, different choices are
available between the results editor, the Graphics gallery and Excel results.
The results editor
The Result Editor opens up in a new window.

The information list has a tree structure.


you open a branch of the tree, and by clicking on
you close a
By clicking on
branch of the tree. You can use the mouse to navigate through the tree.
By double clicking on the title, you display the relevant results in the new window.
The Layout option of the File menu allows you to customize results display on the screen.
The results can be printed or copied into your word processor, but they cannot be changed
in this editor.

12

Descriptive Statistics with SPAD

THE RESULTS OF THE STATS METHOD


SUMMARY STATISTICS OF THE VARIABLES
MARGINAL DISTRIBUTIONS OF CATEGORICAL VARIABLES
-------- COUNTS -------ACTUAL %/TOTAL %/EXPR.
HISTOGRAM OF WEIGHTS
---------------------------------------------------------------------------------------------------1 . Region where the respondent lives
Rg1 - Paris region
56
17.78
17.78
*********
Rg2 - Paris Basin
51
16.19
16.19
********
Rg3 - north
24
7.62
7.62
****
Rg4 - east
29
9.21
9.21
*****
Rg5 - west
45
14.29
14.29
*******
Rg6 - south-west
38
12.06
12.06
******
Rg7 - center east
36
11.43
11.43
******
Rg8 - mediterranean
36
11.43
11.43
******
OVERALL
315 100.00 100.00
---------------------------------------------------------------------------------------------------2 . Urban area size (number of inhabitants)
Agg1 - less than 2000
84
26.67
26.67
*************
Agg2 - 2001 to 5000
18
5.71
5.71
***
Agg3 - 5001 to 10000
18
5.71
5.71
***
Agg4 - 10001 to 20000
12
3.81
3.81
**
Agg5 - 20001 to 50000
23
7.30
7.30
****
Agg6 - 50001 to 100000
18
5.71
5.71
***
Agg7 - 100001 to 200000
28
8.89
8.89
*****
Agg8 - more than 200000
68
21.59
21.59
**********
Agg9 - paris,paris.agglo
46
14.60
14.60
*******
OVERALL
315 100.00 100.00
---------------------------------------------------------------------------------------------------3 . Sex of respondent
Sex1 - male
138
43.81
43.81
*********************
Sex2 - female
177
56.19
56.19
**************************
OVERALL
315 100.00 100.00
---------------------------------------------------------------------------------------------------MARGINAL DISTRIBUTIONS CATEGORIZED VARIABLES
----------- COUNTS -----------ACTUAL %/TOTAL %/EXPR. % CUM.
HISTOGRAM OF WEIGHTS
-----------------------------------------------------------------------------------------------

13

STATS - marginal distributions, Histograms


26 . Number of persons in a housing
1.000
38
12.06
12.06
12.06
******
2.000
90
28.57
28.57
40.63
*************
3.000
69
21.90
21.90
62.54
**********
4.000
71
22.54
22.54
85.08
**********
5.000
34
10.79
10.79
95.87
*****
6.000
7
2.22
2.22
98.10
*
7.000
4
1.27
1.27
99.37
*
8.000
2
0.63
0.63 100.00
*
OVERALL
315 100.00 100.00
----------------------------------------------------------------------------------------------28 . Number of children
0.000
70
22.22
22.22
22.22
**********
1.000
67
21.27
21.27
43.49
**********
2.000
94
29.84
29.84
73.33
*************
3.000
54
17.14
17.14
90.48
********
4.000
9
2.86
2.86
93.33
**
5.000
11
3.49
3.49
96.83
**
6.000
2
0.63
0.63
97.46
*
7.000
2
0.63
0.63
98.10
*
8.000
2
0.63
0.63
98.73
*
9.000
4
1.27
1.27 100.00
*
OVERALL
315 100.00 100.00
-----------------------------------------------------------------------------------------------

SUMMARY STATISTICS OF CONTINUOUS VARIABLES


TOTAL COUNT :
315
TOTAL WEIGHT :
315.00
+----------------------------------------------+----------------------+--------------------+--------------------+
| NUM . LABEL
COUNT
WEIGHT |
MEAN
STD. DEV. | MINIMUM
MAXIMUM | MIN.2
MAX.2
|
+----------------------------------------------+----------------------+--------------------+--------------------+
|
4 . Age of respondent
315
315.00 |
43.756
16.581 | 18.000
86.000 | 19.000
83.000 |
| 41 . Family, children : i
315
315.00 |
6.651
1.062 |
1.000
7.000 |
2.000
6.000 |
| 42 . Work, profession : i
315
315.00 |
5.956
1.544 |
1.000
7.000 |
2.000
6.000 |
| 43 . Free time, relax: im
315
315.00 |
5.295
1.454 |
0.000
7.000 |
1.000
6.000 |
| 44 . Friends, acquaintanc
315
315.00 |
5.190
1.424 |
1.000
7.000 |
2.000
6.000 |
| 45 . Relatives, brothers,
315
315.00 |
5.629
1.436 |
1.000
7.000 |
2.000
6.000 |
| 46 . Religion : importanc
315
315.00 |
3.241
2.022 |
0.000
7.000 |
1.000
6.000 |
| 47 . Politic, political l
315
315.00 |
3.111
1.770 |
0.000
7.000 |
1.000
6.000 |
| 50 . State benefits : ave
283
283.00 |
533.795
926.899 |
0.000 5100.000 | 15.000 4980.000 |
| 51 . Salary of the respon
267
267.00 | 4408.547 4575.339 |
0.000 40000.000 | 300.000 24000.000 |
+----------------------------------------------+----------------------+--------------------+--------------------+

HISTOGRAMS OF CONTINUOUS VARIABLES


VARIABLE 4 : Age of respondent
LOW. LIMIT|
MEAN | WEIGHT| HISTOGRAM (BETWEEN
16.00 INCLUDED AND
88.00 EXCLUDED,
BAR INTERVAL WIDTH = 2.00)
----------+----------+-------+--------------------------------------------------------------------16.00 |
20.93 |
28 | XXXXXXXXXXXXXX
24.00 |
27.85 |
68 | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
32.00 |
35.31 |
58 | XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
40.00 |
43.35 |
37 | XXXXXXXXXXXXXXXXXX
48.00 |
52.08 |
39 | XXXXXXXXXXXXXXXXXXX
56.00 |
59.06 |
33 | XXXXXXXXXXXXXXXX
64.00 |
67.09 |
33 | XXXXXXXXXXXXXXXX
72.00 |
74.71 |
14 | XXXXXXX
80.00 |
82.20 |
5 | XX
+------------+-----------------------------+-----------------------------+
|
|
OVERALL
|
HISTOGRAM
|
|
| (FROM
18.00 TO
86.00) | (FROM
16.00 TO
88.00) |
+------------+-----------------------------+-----------------------------+
| WEIGHT
|
315.00
|
315.00
|
| MEAN
|
43.756
|
43.756
|
| STD. DEV. |
16.581
|
16.440
|
+------------+-----------------------------+-----------------------------+
WEIGHTS OF REMAINING CASES : STRICTLY LESS THAN .....
16.00 :
0.00
GREATER THAN OR EQUAL TO
88.00 :
0.00

14

Descriptive Statistics with SPAD


MARGINAL DISTRIBUTIONS OF GROUPED VARIABLES
COMMAND NUMBER 1
-------- COUNTS -------ACTUAL %/TOTAL %/EXPR.
DISTRIBUTION OF ANSWER
: yes
FOR VARIABLES
Have you recently been nervous
Have you recently had backaches
Have you recently had headaches
Have you recently been depressed
DISTRIBUTION OF ANSWER
: no
FOR VARIABLES
Have you recently been depressed
Have you recently had headaches
Have you recently had backaches
Have you recently been nervous

15

155.00
149.00
115.00
50.00

49.21
47.30
36.51
15.87

49.21
47.30
36.51
15.87

265.00
200.00
166.00
160.00

84.13
63.49
52.70
50.79

84.13
63.49
52.70
50.79

DEMOD Automatic Characterization of a qualitative variable

DEMOD AUTOMATIC CHARACTERIZATION OF A


QUALITATIVE VARIABLE

This extremely powerful procedure provides the automatic characterization of any


categorical variable.
This is the IDEAL procedure to find out everything about a variable in one question. The
well-structured outputs form comprehensive study reports.
One can characterize either each category of a variable, or globally the variable itself. All
the elements available (active and illustrative) may participate in the characterization: the
categorical variables of the categorical variables, the categorical variables themselves, and
the continuous variables.
The following table summarizes all the capabilities of the DEMOD procedure:
Elements to characterize

Characterizing elements

Groups of cases (defined by the categories of the


variable to characterize)

We describe each category with all its significant characterizing


elements.

The categorical variable to characterize

We cross the variable with all the characterizing elements and


display only the elements that are dependant from the variable
to characterize.

categories
categorical variables
continuous variables
categories
categorical variables
continuous variables

A group of cases is defined by a category of the variable to characterize. We have as much


groups of cases as the number of categories of the variable to characterize.
Double-click on the demod icon in order to access the settings of the method.

16

Descriptive Statistics with SPAD

THE VARIABLES TAB


The scrolling menu allows you to select the variables to characterize and the characterizing
elements.

In this example, the variable to characterize is V8 The family is the only place where you
feel well. All the other variables whether categorical or continuous are selected as
characterizing elements.

17

DEMOD Automatic Characterization of a qualitative variable

THE PARAMETERS TAB


This tab allows you to modify the default parameters for the DEMOD method.

Once you have set the parameters, then you validate the method by clicking on the OK
button and run the chain.

18

Descriptive Statistics with SPAD

THE DEMOD RESULTS


THE DEMOD-5 EXCEL SHEET
Characterisation by categories of groups of
The family is the only place where you feel well
Group: Yes (Count:

230 - Percentage: 73.02)


% of
% of
% of group
Test-value Probability
category in category in
in category
set
group
78,26
70,79
80,72
4,55
0,000
62,61
55,87
81,82
3,83
0,000
31,30
25,71
88,89
3,79
0,000
32,61
28,25
84,27
2,76
0,003
81,30
77,14
76,95
2,68
0,004
40,87
36,51
81,74
2,55
0,005
20,43
17,14
87,04
2,50
0,006
20,43
17,14
87,04
2,50
0,006
33,04
29,21
82,61
2,38
0,009
11,30
9,21
89,66
2,01
0,022

Variable label

Caracteristic
categories

Marital status
Do you watch TV
Opinion about marriage
Are you worried about the risk of a nuclear plant accident
Do you have children
Are you worried about the risk of a road accident
Educational level of the respondent
Current situation of the respondent
Are you worried about the risk of a mugging
Do you think the society needs to change

married
every day
indissoluble
a lot
yes
a lot
primary school
retired people
a lot
I do not know

Current situation of the respondent


Are you worried about the risk of a mugging
Current situation of the respondent
Educational level of the respondent
Marital status
Do you have work-personal life problems
Urban area size (number of inhabitants)
Your opinion on the life conditions in the future
Do you watch TV
Marital status
Do you have children
Opinion about marriage
Are you worried about the risk of a road accident
Educational level of the respondent

unemployed person
not at all
student
technical and GCSE
cohabitation
yes
more than 200000
improving a lot
quite often
single
no
dissolved if agreem
a little
more high school

5,22
23,04
2,17
3,48
3,04
20,43
17,83
3,91
19,57
9,57
17,39
30,87
15,65
9,13

7,30
26,35
3,81
5,40
5,08
24,13
21,59
6,67
24,13
13,33
21,90
36,19
20,32
13,65

52,17
63,86
41,67
47,06
43,75
61,84
60,29
42,86
59,21
52,38
57,97
62,28
56,25
48,84

-2,02
-2,02
-2,06
-2,10
-2,30
-2,33
-2,46
-2,81
-2,90
-2,93
-2,96
-3,07
-3,13
-3,49

0,022
0,022
0,020
0,018
0,011
0,010
0,007
0,002
0,002
0,002
0,002
0,001
0,001
0,000

Weight
223
176
81
89
243
115
54
54
92
29
23
83
12
17
16
76
68
21
76
42
69
114
64
43

% of category in group :
Frequency of the category in the group divided by the frequency of the group
% of category in set:
Frequency of the category in the population
% of group in category:
Frequency of the group in the category divided by the frequency of category
Test-value:
When the test-value is greater than zero, it means that the category is overrepresented in the group. The category is under-represented if the test-value is
negative. By default, SPAD displays only characterizing elements with a test-value
greater equal than 1.96 (i.e. a probability equal to 0.025 for an unilateral test).
Probability:
The probability evaluates the scale of the difference between the percentage of the
category in the group and the percentage of the category in the population. Lower is
the probability, more significant is the difference and greater is the test-value related
to this probability (the test-value is the fractile of the normal law that corresponds to
the same probability).
Weight:
Weight of the cases in the category
19

DEMOD Automatic Characterization of a qualitative variable

THE DEMOD-13 EXCEL SHEET


Characterisation by continuous variables of categories of
The family is the only place where you feel well
Yes

(Weight =

230.00 Count =

230 )

Age of respondent
Religion : importance given
Relatives, brothers, sisters ... : importance given

Category
mean
46,100
3,383
5,726

Overall
mean
43,756
3,241
5,629

Category Std.
deviation
16,752
2,081
1,380

Overall Std.
deviation
16,581
2,022
1,436

Salary of the respondent

4044,990

4408,550

3690,140

4575,340

Category
mean
5377,780

Overall
mean
4408,550

Category Std.
deviation
6311,000

Overall Std.
deviation
4575,340

1,542
36,855

1,860
43,756

1,772
13,971

1,671
16,581

Characteristic variables

No

(Weight =

Characteristic variables
Salary of the respondent
Number of children
Age of respondent

83.00 Count =

Test-value Probability
4,12
2,04
1,98

0,000
0,021
0,024

-2,09

0,018

83 )
Test-value Probability
2,10

0,018

-2,02
-4,41

0,022
0,000

Category mean:
Weighted mean of the variable in the category
Overall mean:
Weighted mean of the category in the overall population
Interpretation:
One can see that the Age of respondent is the most characterizing continuous
variable of the group who answered yes to the question The family is the only
place where you feel well .
This group is significantly older than the average respondent with an average age of
46 years old, compared to 43.75 years old for the overall population.

20

Descriptive Statistics with SPAD

DESCO - AUTOMATIC CHARACTERIZATION OF A


CONTINUOUS VARIABLE

This procedure provides the statistical characterization of one or more continuous


variables by:
The other continuous variables, with the support of correlations.
The categories of the categorical variables, by comparison of means.
The categorical variables themselves, with the help of Fisher's statistic.

THE VARIABLES TAB


A continuous variable can be characterized with the other variables whether categorical or
continuous, called characterizing variables.
The scrolling menu allows you to select the variables to characterize and the characterizing
elements.

21

DESCO - Automatic Characterization of a continuous variable

THE PARAMETERS TAB


The parameter Minimum relative weight of charactering elements is useful if you do
not want to display characterizing categories whose the frequency in the population is
lower than 2% (threshold by default).

Display the categories whose the


related probabilities are lower
equal than 0.025. It corresponds
to a test-value of 1.96.

22

Descriptive Statistics with SPAD

THE DESCO RESULTS


CHARACTERISATION OF CONTINUOUS VARIABLES

DESCRIPTION OF : Salary of the respondent


DESCRIPTION BY CATEGORIES
OF CONTINUOUS VARIABLE : Salary of the respondent
ON
267.0 ACTIVE CASES MEAN
= 4408.547
STD. DEV. = 4575.339
+--------------+-------------------+--------------------+------------------------------------------------------------+-----------+
|
TEST PROB. |
MEAN STD. DEV.| CATEGORIES
|VARIABLE LABEL
|
WEIGHT |
| VALUE
|
|
|
|
|
+--------------+-------------------+--------------------+------------------------------------------------------------+-----------+
|
8.16 0.000 | 7060.53 4921.82 |yes, full time
|At the moment, do you have a professional activity
|
114.00 |
|
7.58 0.000 | 6496.32 4736.16 |employed
|Current situation of the respondent
|
136.00 |
|
7.28 0.000 | 6617.07 4883.30 |no
|Have you been unemployed during the last twelve months
|
123.00 |
|
6.69 0.000 | 6533.19 5486.12 |male
|Sex of respondent
|
117.00 |
|
4.60 0.000 | 6452.63 5414.05 |no
|Do you have work-personal life problems
|
76.00 |
|
4.25 0.000 | 6698.25 6784.83 |quite often
|Do you watch TV
|
57.00 |
|
3.73 0.000 | 6331.15 3880.83 |yes
|Do you have work-personal life problems
|
61.00 |
|
3.47 0.000 | 6797.37 6049.03 |more high school
|Educational level of the respondent
|
38.00 |
|
3.35 0.000 | 4860.06 4834.30 |no
|Have you recently been depressed
|
217.00 |
|
3.18 0.001 | 5291.85 5418.67 |no
|Have you recently been nervous
|
135.00 |
|
3.10 0.001 | 6950.00 5579.71 |yes
|Do you have a piano
|
28.00 |
|
2.89 0.002 | 6529.41 5935.61 |yes
|Do you have a second house
|
34.00 |
|
2.88 0.002 | 6330.00 7536.22 |yes
|Do you have a video-tape
|
40.00 |
|
2.65 0.004 | 5937.26 6786.27 |Paris region
|Region where the respondent lives
|
51.00 |
|
2.43 0.008 | 5179.34 5246.40 |a lot
|Has the respondent been interested by the survey
|
117.00 |
|
2.17 0.015 | 6906.67 4638.46 |a lot better
|Your opinion on the evolution of the daily personal life
|
15.00 |
|
2.10 0.018 | 5377.78 6311.00 |No
|The family is the only place where you feel well
|
72.00 |
| -2.01 0.022 | 3301.51 2735.77 |quite agree
|Persons like me often feel alone
|
55.00 |
| -2.09 0.018 | 4044.99 3690.14 |Yes
|The family is the only place where you feel well
|
193.00 |
| -2.14 0.016 | 3769.06 3573.01 |a lot
|Are you worried about the risk of having a serious illness |
125.00 |
| -2.23 0.013 | 3196.12 3440.69 |a lot worse
|Your opinion on the evolution of French people life level
|
56.00 |
| -2.47 0.007 | 3319.48 2735.76 |a lot
|Are you worried about the risk of a nuclear plant accident |
77.00 |
| -2.54 0.006 | 1971.43 1864.75 |unemployed person
|Current situation of the respondent
|
21.00 |
| -2.57 0.005 |
760.00 1356.61 |student
|Current situation of the respondent
|
10.00 |
| -2.66 0.004 | 2606.41 3255.77 |a lot worse
|Your opinion on the evolution of the daily personal life
|
39.00 |
| -2.86 0.002 | 3726.34 3277.03 |every day
|Do you watch TV
|
155.00 |
| -2.88 0.002 | 4069.97 3721.48 |no
|Do you have a video-tape
|
227.00 |
| -2.89 0.002 | 4099.07 4253.85 |no
|Do you have a second house
|
233.00 |
| -3.10 0.001 | 4110.81 4346.66 |no
|Do you have a piano
|
239.00 |
| -3.18 0.001 | 3505.18 3271.07 |yes
|Have you recently been nervous
|
132.00 |
| -3.35 0.000 | 2449.00 2373.53 |yes
|Have you recently been depressed
|
50.00 |
| -3.49 0.000 | 2263.04 2043.80 |no qualifications
|Educational level of the respondent
|
46.00 |
| -4.36 0.000 |
832.14 1563.89 |I have never worked |At the moment, do you have a professional activity
|
28.00 |
| -4.85 0.000 | 2691.10 3397.40 |no
|At the moment, do you have a professional activity
|
103.00 |
| -6.54 0.000 |
488.54 1396.02 |housewife w/o prof. |Current situation of the respondent
|
48.00 |
| -6.69 0.000 | 2751.33 2742.02 |female
|Sex of respondent
|
150.00 |
| -7.28 0.000 | 2311.41 3196.29 |missing category
|Do you have work-personal life problems
|
130.00 |
| -7.28 0.000 | 2311.41 3196.29 |missing category
|Have you been unemployed during the last twelve months
|
130.00 |
+--------------+-------------------+--------------------+------------------------------------------------------------+-----------+
|
| 4408.55 4575.34 | OVERALL
|
267.00 |
+--------------+-------------------+--------------------+------------------------------------------------------------+-----------+

DESCRIPTION BY CATEGORICAL VARIABLES


OF VARIABLE : Salary of the respondent
+------------+--------+-----------------------------------------------------------------+-----------------+------+
| TEST-VALUE | PROBA. | NUM . VARIABLE LABEL
| DEN. DEG. FREE. |FISHER|
+------------+--------+-----------------------------------------------------------------+-----------------+------+
|
8.56
| 0.000 |
5 . Current situation of the respondent
|
261
| 21.44|
|
8.48
| 0.000 | 18 . At the moment, do you have a professional activity
|
263
| 31.95|
|
7.50
| 0.000 | 20 . Have you been unemployed during the last twelve months
|
264
| 35.01|
|
7.28
| 0.000 | 19 . Do you have work-personal life problems
|
264
| 32.89|
|
6.98
| 0.000 |
3 . Sex of respondent
|
265
| 53.58|
|
3.48
| 0.000 |
7 . Educational level of the respondent
|
258
| 3.87|
|
3.47
| 0.000 | 33 . Do you watch TV
|
263
| 6.57|
|
3.38
| 0.001 | 24 . Have you recently been depressed
|
265
| 11.69|
|
3.21
| 0.001 | 23 . Have you recently been nervous
|
265
| 10.50|
|
3.12
| 0.002 | 16 . Do you have a piano
|
265
| 9.94|
|
2.90
| 0.004 | 17 . Do you have a second house
|
265
| 8.58|
|
2.89
| 0.004 | 15 . Do you have a video-tape
|
265
| 8.50|
|
2.04
| 0.021 | 52 . Has the respondent been interested by the survey
|
264
| 3.92|
|
1.92
| 0.054 | 21 . Have you recently had headaches
|
265
| 3.74|
|
1.77
| 0.039 | 30 . Your opinion on the evolution of the daily personal life |
261
| 2.38|
|
1.56
| 0.059 | 25 . Are you satisfied of your health
|
263
| 2.51|
|
1.33
| 0.092 | 40 . Are you worried about the risk of a nuclear plant accident|
263
| 2.16|
|
1.31
| 0.189 | 29 . Do you regularly impose restrictions
|
265
| 1.73|
|
1.24
| 0.107 |
8 . The family is the only place where you feel well
|
264
| 2.24|
|
1.12
| 0.132 |
1 . Region where the respondent lives
|
259
| 1.61|
|
1.07
| 0.143 | 39 . Are you worried about the risk of umemployment
|
263
| 1.82|
|
1.03
| 0.151 | 35 . The computer science diffusion is...
|
263
| 1.78|
|
1.02
| 0.154 | 34 . Do you think the society needs to change
|
264
| 1.86|
|
0.92
| 0.179 | 49 . Persons like me often feel alone
|
263
| 1.64|
|
0.89
| 0.186 | 31 . Your opinion on the evolution of French people life level |
260
| 1.48|
|
0.86
| 0.194 | 36 . Are you worried about the risk of having a serious illness|
263
| 1.58|
|
0.79
| 0.428 | 22 . Have you recently had backaches
|
265
| 0.63|
|
0.78
| 0.217 | 11 . Are you satisfied of your housing
|
263
| 1.49|
|
0.65
| 0.257 | 37 . Are you worried about the risk of a mugging
|
263
| 1.35|
|
0.45
| 0.327 | 13 . Occupation status of housing
|
262
| 1.16|
|
0.22
| 0.412 | 27 . Do you have children
|
264
| 0.88|
|
0.13
| 0.446 | 38 . Are you worried about the risk of a road accident
|
263
| 0.89|
|
0.10
| 0.459 |
6 . Marital status
|
262
| 0.91|
|
0.08
| 0.469 |
9 . Opinion about marriage
|
263
| 0.85|
|
-0.15
| 0.561 | 32 . Your opinion on the life conditions in the future
|
261
| 0.79|
|
-0.21
| 0.585 | 12 . Are you satisfied of your daily life
|
263
| 0.65|

23

DESCO - Automatic Characterization of a continuous variable


|
-0.23
| 0.591 | 14 . The housing expenses are for you
|
260
| 0.77|
|
-0.53
| 0.702 | 10 . Housekeeping works, take care of children...
|
263
| 0.47|
|
-0.59
| 0.724 |
2 . Urban area size (number of inhabitants)
|
258
| 0.66|
|
-0.64
| 0.740 | 48 . Your opinion on the justice running in 1986
|
261
| 0.55|
+------------+--------+-----------------------------------------------------------------+-----------------+------+

SUMMARY STATISTICS OF CONTINUOUS VARIABLES


TOTAL COUNT
315
TOTAL WEIGHT
315.00
+-----------------------------------------------------+----------------------+---------------------+
| NUM . IDEN - LABEL
COUNT WEIGHT |
MEAN
STD. DEV. | MINIMUM
MAXIMUM |
+-----------------------------------------------------+----------------------+---------------------+
|
4 . Age - Age of respondent
267
267.00 |
43.61
16.88 |
18.00
83.00 |
| 26 . Nbpr - Number of persons in
267
267.00 |
3.04
1.43 |
1.00
8.00 |
| 28 . Nbef - Number of children
267
267.00 |
1.85
1.69 |
0.00
9.00 |
| 41 . Fami - Family, children : i
267
267.00 |
6.65
1.07 |
1.00
7.00 |
| 42 . Trav - Work, profession : i
267
267.00 |
5.90
1.57 |
1.00
7.00 |
| 43 . Lois - Free time, relax: im
267
267.00 |
5.30
1.43 |
0.00
7.00 |
| 44 . Amis - Friends, acquaintanc
267
267.00 |
5.18
1.41 |
1.00
7.00 |
| 45 . Part - Relatives, brothers,
267
267.00 |
5.63
1.44 |
1.00
7.00 |
| 46 . Reli - Religion : importanc
267
267.00 |
3.15
1.96 |
1.00
7.00 |
| 47 . Poli - Politic, political l
267
267.00 |
3.15
1.79 |
1.00
7.00 |
| 50 . PrFm - State benefits : ave
244
244.00 |
583.10
966.04 |
0.00
5100.00 |
| 51 . Salr - Salary of the respon
267
267.00 |
4408.55
4575.34 |
0.00 40000.00 |
+-----------------------------------------------------+----------------------+---------------------+
CORRELATIONS WITH CONTINUOUS VARIABLES
OF VARIABLE : Salary of the respondent
+------------+--------+-------------+---------------------------------------------------+---------+
| TEST-VALUE | PROB. | CORRELATION | NUM . VARIABLE LABEL
| WEIGHT |
+------------+--------+-------------+---------------------------------------------------+---------+
|
99.90
| 0.000 |
1.000
| 51 . Salary of the respondent
| 267.000 |
|
-2.53
| 0.006 |
-0.162
| 50 . State benefits : average monthly amount
| 244.000 |
+------------+--------+-------------+---------------------------------------------------+---------+

24

Descriptive Statistics with SPAD

TABLE - CROSS TABLES


With this procedure, you can obtain in one go an unlimited number of tables for members,
means or frequencies.

THE TABLES TAB


This tab allows you to define the cross tables to create.
The tables cells can display weights, % raw, % column, average and standard deviation
depending on the parameters and settings.
The scrolling menu allows you to define the cross tables you want to display with or
without supplementary information such as mean or frequency related to another
variable.

If a variable appears in the Means column, each cell of the cross table will display the
weighted average corresponding to the cases of the cell.
If a variable appears in the Frequencies column, each cell of the cross table will display
the weighted sum of the values of the variable for the cases of the cell.
By clicking on local filter, you can define a specific filter for each command.

25

TABLE - Cross tables

THE PARAMETERS TAB

26

Descriptive Statistics with SPAD

THE TABLE RESULTS


CROSS-TABS
LIST OF COMMANDS
COMMAND 1
TABLE
1
BY ROW
:
9 . Opinion about marriage
BY COLUMN
:
3 . Sex of respondent
COMMAND 2
TABLE
2
BY ROW
:
9 . Opinion about marriage
BY COLUMN
:
3 . Sex of respondent
MEANS OF
:
4 . Age of respondent
LIST OF CROSS-TABS
TABLE
1 BY ROW
: Opinion about marriage
TOTAL WEIGHT: 315.
BY COLUMN
: Sex of respondent
WEIGHT
| male
| female
| OVERALL
COLUMN PERC.
|
|
|
ROW PERC. |
|
|
---------------------+--------------+--------------+-------------|
41
|
40
|
81
indissoluble
| 29.71
| 22.60
| 25.71
|
50.62 |
49.38 |
100.00
---------------------+--------------+--------------+-------------|
39
|
69
|
108
dissolved serious pb | 28.26
| 38.98
| 34.29
|
36.11 |
63.89 |
100.00
---------------------+--------------+--------------+-------------|
50
|
64
|
114
dissolved if agreem | 36.23
| 36.16
| 36.19
|
43.86 |
56.14 |
100.00
---------------------+--------------+--------------+-------------|
8
|
4
|
12
I do not know
|
5.80
|
2.26
|
3.81
|
66.67 |
33.33 |
100.00
---------------------+--------------+--------------+-------------|
138
|
177
|
315
OVERALL
| 100.00
| 100.00
| 100.00
|
43.81 |
56.19 |
100.00
---------------------------------------------------------------------------------------------------KHI2 =
6.67 / 3 DEGREES OF FREEDOM / 0 EXPECTED FREQUENCIES LESS THAN 5
PROB. ( KHI2 >
6.67 ) = 0.083 / TEST-VALUE =
1.38
---------------------------------------------------------------------------------------------------TABLE
2 BY ROW
: Opinion about marriage
TOTAL WEIGHT: 315.
BY COLUMN
: Sex of respondent
MEANS OF
: Age of respondent
WEIGHT
| male
| female
| OVERALL
MEAN
|
|
|
STD. DEV.
|
|
|
---------------------+--------------+--------------+-------------|
41
|
40
|
81
indissoluble
|
45.829 |
48.325 |
47.062
|
17.234 |
17.084 |
17.206
---------------------+--------------+--------------+-------------|
39
|
69
|
108
dissolved serious pb |
43.000 |
46.362 |
45.148
|
14.739 |
18.260 |
17.148
---------------------+--------------+--------------+-------------|
50
|
64
|
114
dissolved if agreem |
41.300 |
38.484 |
39.719
|
15.442 |
14.330 |
14.893
---------------------+--------------+--------------+-------------|
8
|
4
|
12
I do not know
|
50.250 |
41.250 |
47.250
|
15.618 |
8.842 |
14.377
---------------------+--------------+--------------+-------------|
138
|
177
|
315
OVERALL
|
43.645 |
43.842 |
43.756
|
16.007 |
17.015 |
16.581

27

BIVAR - Bivariate Analysis

BIVAR - BIVARIATE ANALYSIS


The BIVAR procedure lets you characterize a sample from the viewpoint of two particular
continuous variables (AXES variables or base variables). The sample can be described by
categorical variables and by other continuous variables.

THE VARIABLES TAB


With this tab, the SPAD user selects the two continuous variables for the bivariate
analysis.
It is possible to include in the analysis some supplementary variables (whether continuous
or categorical).

The graph editor of the BIVAR method is the same that is used for factorial analyses.
The capabilities of the graph editor will be described in the section Factorial analyses.

28

Descriptive Statistics with SPAD

29

BIVAR - Bivariate Analysis

FACTORIAL ANALYSES WITH SPAD

PCA : Principal Component Analysis (PCA)

SCA : Simple Correspondence Analysis (SCA)

MCA : Multiple Correspondence Analysis (MCA)

DEFAC : Factors description

SPAD provides the main techniques in multidimensional exploratory analysis, combined


with procedures for clustering. One area of application concerns the processing of largescale surveys in market research and socio-economic research.
The main applications of factorial analyses are: (1) to reduce the number of dimensions
and (2) to detect structure in the relationships between variables. Therefore, factor analysis
is applied as a data reduction or structure detection method.

30

Factorial Analyses with SPAD

VOCABULARY
Active Variables

Variables used to perform the factorial analysis

Supplementary variables

Variables that are not used to perform the original analysis


but used to illustrate the main results of the analysis.

Contribution

Criteria that measures the contribution of an element


(category, variable, frequency or case) to the inertia (total
inertia, dimensions inertia)

Cosines

Criteria that measures the quality of representation of an


element (category, variable, case or frequency) for each
dimension.

Axes, factors, dimensions

These terms correspond to the factors computed or


extracted by the analysis. Consecutive factors are
uncorrelated or orthogonal to each other. Factors are
consecutively extracted by maximizing the remaining
variability in the active data.

31

PCA - Principal Component Analysis

PCA - PRINCIPAL COMPONENT ANALYSIS


This method performs the principal component analysis of a sample of cases described
with continuous variables. The analysis can be performed on original variables or normed
variables (centered and normalized) whether the active variables are on the same scale or
not.
It is possible to introduce supplementary elements such as: cases, other continuous
variables or categorical variables.

Import the Sba dataset Cars.sba.


Drag and drop the PCA method on the Cars dataset as follows.

The two goals of the analysis are:


Capture the main interrelationships between correlated variables in small number
of summary characteristics: dimension reduction
Identify automobile models with similar attributes: Useful step for developing
clustering or classification model
The dataset contains measurements on 6 variables for 24 models: cubic capacity, power,
speed, weight, length and width.
Due to strong differences in measurement scales, we will perform a PCA on normed
variables.
KIDEN
Honda civic
Peugeot 205 Rallye
Seat Ibiza SX I
Citron AX Sport
Renault 19
Fiat Tipo
Peugeot 405
Renault 21
Citron BX
Opel Omega
Peugeot 405 Break
Ford Sierra

Cubic
capacity
1396
1294
1461
1294
1721
1580
1769
2068
1769
1998
1905
1993

Power

Speed
174
189
181
184
180
170
180
180
182
190
194
185

90
103
100
95
92
83
90
88
90
122
125
115
32

Weight
850
805
925
730
965
970
1080
1135
1060
1255
1120
1190

Length
369
370
363
350
415
395
440
446
424
473
439
451

Width
166
157
161
160
169
170
169
170
168
177
171
172

Factorial Analyses with SPAD


Renault Espace
Nissan Vanette
VW Caravelle
Audi 90 Quattro
BMW 530i
Rover 827i
Renault 25
BMW 325iX
Ford Scorpio
Fiat Uno
Peugeot 205
Ford Fiesta

1995
1952
2109
1994
2986
2675
2548
2494
2933
1116
1580
1117

120
87
112
160
188
177
182
171
150
58
80
50

177
144
149
214
226
222
226
208
200
145
159
135

1265
1430
1320
1220
1510
1365
1350
1600
1345
780
880
810

436
436
457
439
472
469
471
432
466
364
370
371

177
169
184
169
175
175
180
164
176
155
156
162

The matrix plot, performed with the STATS method, is a good overview of the pair wise
relationships between variables.

33

PCA - Principal Component Analysis

The SETTING OPTIONS


THE VARIABLES TAB
This tab allows the SPAD user to define the following elements:
Active continuous variables
Supplementary continuous variables
Supplementary categorical variables
In our example, we select all the available continuous variables as active. We do not have
any more available variable for supplementary information.

34

Factorial Analyses with SPAD

THE CASES TAB


The Cases tab allows you to define the role of the cases in the analysis.
The cases retained are the ACTIVE cases, those not retained are called ILLUSTRATIVES
or SUPPLEMENTARY. By using the selections by list or interval, we can also define the
ABANDONNED cases (which are neither active nor illustrative).
All the calculations that lead to the factorial planes, to the hierarchical classification tree
and to the final partitions are carried out only on the active cases. The illustrative cases
may be projected onto the factorial planes constructed, and re-assigned during the
partition into classes, of which they are the closest or form a missing data class.
The cases abandoned are completely ignored in the calculations and affected automatically
to a missing data class in the partitions.
If you conduct many analyses on a particular sub-population, it may be preferable to
create a BASE corresponding this one. To do this, use the Recoding chain in the Tools
menu.
In the Cars example, we select all the cases as active.

THE PARAMETERS TAB

Cases coordinates are not


displayed by default.

NORMED PCA AND NOT NORMED PCA


35

PCA - Principal Component Analysis

Normed PCA means that all the active variables are previously centered and standardized
by SPAD. The consequence is that all the variables are assigned the same contribution to
the overall inertia.
When the PCA is not normed (only centered), the distance between the variable and the
origin is equal to the variance of the variable.
Most of the time, it is advised to perform a normed analysis in order to assign the same
importance to each active variable. It is particularly recommended when the
measurements scales are different.
In our example, we can see that the measurements scales are strongly different. Thus, we
will perform a normed PCA.
RETAINED COORDINATES
The number of retained coordinates is useful for the methods that follow the PCA in the
chain. These methods can be DEFAC (factors description) and RECIP/SEMIS (clustering).

36

Factorial Analyses with SPAD

THE PCA RESULTS


PRINCIPAL COMPONENTS ANALYSIS
SUMMARY STATISTICS OF CONTINUOUS VARIABLES
TOTAL COUNT
:
24
TOTAL WEIGHT
:
24.00
+--------------------------------------------------+----------------------+----------------------+
| NUM . IDEN - LABEL
COUNT
WEIGHT |
MEAN STD.DEV. |
MINIMUM
MAXIMUM |
+--------------------------------------------------+----------------------+----------------------+
|
1 . CYLI - Cubic capacity
24
24.00 |
1906.13
516.79 |
1116.00
2986.00 |
|
2 . PUIS - Power
24
24.00 |
113.67
37.97 |
50.00
188.00 |
|
3 . VITE - Speed
24
24.00 |
183.08
24.68 |
135.00
226.00 |
|
4 . POID - Weight
24
24.00 |
1123.33
243.20 |
730.00
1600.00 |
|
5 . LONG - Length
24
24.00 |
421.58
40.47 |
350.00
473.00 |
|
6 . LARG - Width
24
24.00 |
168.83
7.49 |
155.00
184.00 |
+--------------------------------------------------+----------------------+----------------------+
CORRELATION MATRIX
|
CYLI
PUIS
VITE
POID
LONG
LARG
-----+-----------------------------------------CYLI |
1.00
PUIS |
0.86
1.00
VITE |
0.69
0.89
1.00
POID |
0.90
0.77
0.51
1.00
LONG |
0.86
0.69
0.53
0.86
1.00
LARG |
0.71
0.55
0.36
0.70
0.86
1.00
-----+-----------------------------------------|
CYLI
PUIS
VITE
POID
LONG
LARG

The linear correlation coefficient points out the intensity of the relationship between two
continuous variable. The coefficient correlation ranges from 1 to 1. The closer the
correlation coefficient is to +1 or -1, the more closely the two variables are related.
TEST-VALUES MATRIX
|
CYLI
PUIS
VITE
POID
LONG
LARG
-----+-----------------------------------------CYLI | 99.99
PUIS |
6.35 99.99
VITE |
4.19
7.06 99.99
POID |
7.14
4.99
2.74 99.99
LONG |
6.42
4.14
2.90
6.40 99.99
LARG |
4.34
3.05
1.86
4.25
6.41 99.99
-----+-----------------------------------------|
CYLI
PUIS
VITE
POID
LONG
LARG

This matrix is related to the previous one. SPAD translates the test of correlation in terms
of test-value. In this example, the higher is the test-value, the more closely are the two
variables. We can consider that a test-value lower than 2 means no linear relationship
between the two variables.
EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION..
6.0000
SUM OF EIGENVALUES............
6.0000
HISTOGRAM OF THE FIRST 6 EIGENVALUES
+--------+------------+-------------+-------------+----------------------------------------------------- +
| NUMBER | EIGENVALUE | PERCENTAGE | CUMULATED |
|
|
|
|
| PERCENTAGE |
|
+--------+------------+-------------+-------------+------------------------------------------------ -----+
|
1
|
4.6173
|
76.96
|
76.96
| ********************************************************************************|
|
2
|
0.8788
|
14.65
|
91.60
| ****************
|
|
3
|
0.3035
|
5.06
|
96.66
| ******
|
|
4
|
0.1055
|
1.76
|
98.42
| **
|
|
5
|
0.0732
|
1.22
|
99.64
| **
|
|
6
|
0.0216
|
0.36
|
100.00
| *
|
+--------+------------+-------------+-------------+------------------------------------------------------+

In the second column (Eigenvalue) above, we find the variance on the new factors that
were successively extracted. In the third column, these values are expressed as a percent of
the total variance. As we can see, factor 1 accounts for 77 percent of the variance, factor 2
for 15 percent, and so on. As expected, the sum of the eigenvalues is equal to the number
37

PCA - Principal Component Analysis

of variables. The third column contains the cumulative variance extracted. The variances
extracted by the factors are called the eigenvalues. This name derives from the
computational issues involved.
Eigenvalues and the Number-of-Factors Problem
Now that we have a measure of how much variance each successive factor extracts, we can
return to the question of how many factors to retain. By its nature this is an arbitrary
decision. However, there are some guidelines that are commonly used, and that, in
practice, seem to yield the best results.
The Kaiser criterion. First, we can retain only factors with eigenvalues greater than 1. In
essence this is like saying that, unless a factor extracts at least as much as the equivalent of
one original variable, we drop it. This criterion was proposed by Kaiser (1960), and is
probably the one most widely used. In our example above, using this criterion, we would
retain 1 factor (principal component).
The scree test. A graphical method is the scree test first proposed by Cattell (1966). We can
plot the eigenvalues shown above in a simple line plot.
5,0

4,0

3,0

2,0

1,0

0,0
1

Cattell suggests to find the place where the smooth decrease of eigenvalues appears to
level off to the right of the plot. To the right of this point, presumably, one finds only
"factorial scree" -- "scree" is the geological term referring to the debris which collects on the
lower part of a rocky slope. According to this criterion, we would probably retain 1 or 2
factors in our example.
RESEARCH OF IRREGULARITIES (THIRD DIFFERENCES)
+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
|
1 -- 2 |
-2785.86 | **************************************************** |
+--------------+--------------+------------------------------------------------------+
RESEARCH OF IRREGULARITIES (SECOND DIFFERENCES)
+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
|
1 -- 2 |
3163.20 | **************************************************** |
|
2 -- 3 |
377.34 | *******
|
+--------------+--------------+------------------------------------------------------+

38

Factorial Analyses with SPAD


ANDERSON'S LAPLACE INTERVALS
WITH 0.95 THRESHOLD
+--------+--------------------------------------------------------+
| NUMBER |
LOWER LIMIT
EIGENVALUE
UPPER LIMIT
|
+--------+--------------------------------------------------------+
|
1
|
1.9486
4.6173
7.2860
|
|
2
|
0.3709
0.8788
1.3868
|
|
3
|
0.1281
0.3035
0.4789
|
|
4
|
0.0445
0.1055
0.1665
|
|
5
|
0.0309
0.0732
0.1154
|
+--------+--------------------------------------------------------+
LENGTH AND RELATIVE POSITION OF INTERVALS
1
2
3
4
5

. . . . . . . . . . . . .
. . .*--------+--------*.
.*--+--*. . . . . . . . .
*+* . . . . . . . . . . .
+*. . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.*----------------------------------------------+----------------------------------------------*.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Third and second differences as well as Andersons laplace intervals are other guidelines
to help the SPAD User to choose the number of dimensions to retain for further analyses.
LOADINGS OF VARIABLES ON AXES 1 TO 5
ACTIVE VARIABLES
---------------------+------------------------------------+-------------------------------+------------------------------VARIABLES
|
LOADINGS
| VARIABLE-FACTOR CORRELATIONS |
NORMED EIGENVECTORS
---------------------+------------------------------------+-------------------------------+------------------------------IDEN - SHORT LABEL
|
1
2
3
4
5
|
1
2
3
4
5 |
1
2
3
4
5
---------------------+------------------------------------+-------------------------------+------------------------------CYLI - Cubic capacity|
0.96
0.01 -0.15
0.04 -0.23 | 0.96 0.01 -0.15 0.04 -0.23 | 0.45 0.01 -0.27 0.11 -0.84
PUIS - Power
|
0.90
0.38 -0.02 -0.16
0.04 | 0.90 0.38 -0.02 -0.16 0.04 | 0.42 0.41 -0.03 -0.49 0.15
VITE - Speed
|
0.75
0.62
0.20
0.08
0.04 | 0.75 0.62 0.20 0.08 0.04 | 0.35 0.66 0.37 0.26 0.13
POID - Weight
|
0.91 -0.18 -0.35 -0.06
0.11 | 0.91 -0.18 -0.35 -0.06 0.11 | 0.42 -0.19 -0.63 -0.18 0.42
LONG - Length
|
0.92 -0.30
0.05
0.22
0.07 | 0.92 -0.30 0.05 0.22 0.07 | 0.43 -0.32 0.10 0.69 0.26
LARG - Width
|
0.80 -0.48
0.34 -0.14 -0.02 | 0.80 -0.48 0.34 -0.14 -0.02 | 0.37 -0.51 0.62 -0.42 -0.06
---------------------+------------------------------------+-------------------------------+-------------------------------

For normed PCA, correlations (variable factor) and loadings are equivalent.
Apparently, the first factor is generally more highly correlated with the variables than the
second factor. This is to be expected because, as previously described, these factors are
extracted successively and will account for less and less variance overall.
Normed eigen vectors are the coefficients that describe the linear relationship between the
active normed variables and the factors: in this example, we have:
CYLI Mean(CYLI )
PUIS Mean( PUIS )
+ 0.42
+ 0.35...
Factor1 = 0.45
STDEV (CYLI )
STDEV ( PUIS )
Note:
SPAD does not print out neither the contributions nor the cosinus for the active variables.
However, it is possible to calculate them this way:

Cos ( j , ) = Loading ( j , ) for a normed PCA


Cos ( j , ) = Correlation( j , ) for both normed and not normed PCA
and
Contribution( j , ) = NormedEigenVector ( j , )

39

PCA - Principal Component Analysis


FACTOR SCORES, CONTRIBUTIONS AND SQUARED COSINES OF CASES
AXES 1 TO 5
+-----------------------------------+-------------------------------+--------------------------+---------------------+
|
CASES
|
FACTOR SCORES
|
CONTRIBUTIONS
|
SQUARED COSINES
|
|-----------------------------------+-------------------------------+--------------------------+--------------------------|
| IDENTIFIER
REL.WT. DISTO |
1
2
3
4
5
|
1
2
3
4
5 |
1
2
3
4
5 |
+-----------------------------------+-------------------------------+--------------------------+--------------------------+
| Honda civic
4.17
4.59 | -2.01 0.32 0.50 -0.44 -0.10 | 3.6 0.5 3.4 7.6 0.6 | 0.88 0.02 0.05 0.04 0.00 |
| Peugeot 205 Rallye
4.17
7.37 | -2.25 1.49 0.14 0.09 0.19 | 4.6 10.6 0.3 0.3 2.1 | 0.69 0.30 0.00 0.00 0.00 |
| Seat Ibiza SX I
4.17
4.73 | -1.92 0.94 -0.06 -0.36 0.00 | 3.3 4.2 0.1 5.0 0.0 | 0.78 0.19 0.00 0.03 0.00 |
| Citron AX Sport
4.17
8.78 | -2.60 1.29 0.47 -0.32 -0.15 | 6.1 7.9 3.0 4.0 1.2 | 0.77 0.19 0.02 0.01 0.00 |
| Renault 19
4.17
0.92 | -0.78 -0.16 0.48 0.20 -0.12 | 0.6 0.1 3.1 1.6 0.8 | 0.66 0.03 0.25 0.04 0.01 |
| Fiat Tipo
4.17
2.18 | -1.30 -0.43 0.43 -0.22 -0.10 | 1.5 0.9 2.5 2.0 0.6 | 0.77 0.09 0.08 0.02 0.00 |
| Peugeot 405
4.17
0.71 | -0.30 -0.46 0.21 0.58 0.16 | 0.1 1.0 0.6 13.1 1.4 | 0.12 0.30 0.06 0.47 0.04 |
| Renault 21
4.17
0.96 | 0.15 -0.64 0.01 0.67 -0.21 | 0.0 1.9 0.0 17.8 2.5 | 0.02 0.42 0.00 0.47 0.05 |
| Citron BX
4.17
0.54 | -0.52 -0.20 0.17 0.40 0.04 | 0.2 0.2 0.4 6.2 0.1 | 0.50 0.07 0.06 0.29 0.00 |
| Opel Omega
4.17
3.25 | 1.45 -0.79 0.51 0.31 0.42 | 1.9 3.0 3.5 3.7 10.0 | 0.64 0.19 0.08 0.03 0.05 |
| Peugeot 405 Break
4.17
0.55 | 0.57 0.13 0.39 0.15 0.19 | 0.3 0.1 2.0 0.9 2.1 | 0.58 0.03 0.27 0.04 0.07 |
| Ford Sierra
4.17
0.82 | 0.70 -0.43 0.14 0.30 0.16 | 0.4 0.9 0.3 3.5 1.4 | 0.60 0.23 0.02 0.11 0.03 |
| Renault Espace
4.17
1.77 | 0.86 -0.87 0.20 -0.44 0.13 | 0.7 3.6 0.5 7.7 0.9 | 0.42 0.43 0.02 0.11 0.01 |
| Nissan Vanette
4.17
4.73 | -0.11 -1.69 -1.33 -0.05 0.24 | 0.0 13.6 24.4 0.1 3.3 | 0.00 0.61 0.38 0.00 0.01 |
| VW Caravelle
4.17
7.58 | 1.14 -2.39 0.21 -0.69 -0.06 | 1.2 27.1 0.6 18.7 0.2 | 0.17 0.75 0.01 0.06 0.00 |
| Audi 90 Quattro
4.17
3.43 | 1.39 1.10 0.19 -0.03 0.48 | 1.7 5.7 0.5 0.0 13.0 | 0.56 0.35 0.01 0.00 0.07 |
| BMW 530i
4.17 15.98 | 3.88 0.85 -0.35 -0.04 -0.30 | 13.6 3.4 1.7 0.1 5.1 | 0.94 0.04 0.01 0.00 0.01 |
| Rover 827i
4.17 10.52 | 3.15 0.75 0.13 0.05 -0.13 | 8.9 2.7 0.2 0.1 0.9 | 0.94 0.05 0.00 0.00 0.00 |
| Renault 25
4.17 12.39 | 3.39 0.57 0.71 -0.23 0.07 | 10.4 1.5 6.9 2.1 0.3 | 0.93 0.03 0.04 0.00 0.00 |
| BMW 325iX
4.17
8.92 | 2.20 1.17 -1.59 -0.24 0.32 | 4.4 6.5 34.6 2.3 6.0 | 0.54 0.15 0.28 0.01 0.01 |
| Ford Scorpio
4.17
8.28 | 2.74 -0.15 -0.19 0.13 -0.83 | 6.8 0.1 0.5 0.6 39.1 | 0.91 0.00 0.00 0.00 0.08 |
| Fiat Uno
4.17 14.29 | -3.73 0.03 -0.50 0.19 0.01 | 12.6 0.0 3.5 1.4 0.0 | 0.97 0.00 0.02 0.00 0.00 |
| Peugeot 205
4.17
7.70 | -2.60 0.46 -0.72 0.12 -0.39 | 6.1 1.0 7.1 0.6 8.4 | 0.88 0.03 0.07 0.00 0.02 |
| Ford Fiesta
4.17 12.99 | -3.49 -0.87 -0.13 -0.11 -0.03 | 11.0 3.6 0.2 0.5 0.1 | 0.94 0.06 0.00 0.00 0.00 |
+-----------------------------------+-------------------------------+--------------------------+--------------------------+

DISTO : the distance between the case and the center of gravity of the overall sample. This
is helpful to determine the Average cars, (close to the center of gravity) and the more
specific ones that are far from the center of gravity.

40

Factorial Analyses with SPAD

THE FACTORIAL GRAPH EDITOR


To access the factorial graph editor, click on this icon

To create a new factorial graph, select Graph - New , the following window
appears:

The preselection step allows you to select the different elements to display in the graph:
Active or supplementary cases
Active or supplementary variables

If you forget to select an element, you have to create a new graph and redo the
preselection.
THE TOOL BAR OF THE GRAPH EDITOR

Points
selection

Factors
selection

Total
Unselection

Framing
selection

41

Delete
the labels

Write
the labels

Cancel
the ghosts

Set
as ghost

PCA - Principal Component Analysis

Information
on points

Vertical
symmetric view

Refresh

Correlation
circle

Horizontal
symmetric view

SAVE A GRAPH
Internal save is dependent on the chain.
In the case of a re-execution of the chain, or the deletion by the user of the results of the
chain, these internal saves are deleted.
This type of save uses the commands:
Save
Save as internal save of the graphics menu.
When you save in internal format, you give a TITLE to the saved graphic.
Later you can reload this save with the command Open Internal save graphics menu.
The utility of the Save in Internal Format is that all the functions of the annotations and
properties of the factorial planes remain available.

The save in archive format is a save, which is independent of the chain.


This type of save is made using the command Save as Save archive on the graphics
menu.
When saving in archive format, you give a NAME to the graphic saved with the obligatory
extension .GFA.
Later, you can recover this save with the command Open -Save archive in the Graphics
menu.
This save is independent of the chain. Some formats are no longer possible in this type of
save, in particular the formatting of cases.

The editor for the factorial planes also lets you save the graphics in .BMP or .PCX format.
These images can then be inserted into a word processor document.
The EMF Metafile format gives the best image quality.
This type of Save is made with the command Save as - Screen Image BMP/PCX.

42

Factorial Analyses with SPAD

GENERAL PRINCIPLES
The construction of a graphic after an analysis requires the following general principles:
Go to the New Graphics Menu, which opens the pre-selections Dialogue Box.
For a single analysis, you can open several graphics at once through the Graphics Menu
and make different pre-selections. All the graphics you create can be saved in an internal
or the archive format.
To modify your graph, apply the following rule:
Select the points with the tool bar or the selection menu
Format them with the format menu
Deselect to see the effect of the embellishments.

IMPORTANT
To manipulate (move, change etc.) the labels and the texts on a graphic, enlarge the frame.
For this you have to be in standard mode, that is: no selection mode button is highlighted,
and the status bar is empty.

43

PCA - Principal Component Analysis

44

Factorial Analyses with SPAD

SCA - SIMPLE CORRESPONDENCE ANALYSIS


This procedure performs a simple correspondence analysis (SCA) on a contingency table
or a table with non-negative numbers.
Simple correspondence analysis is a powerful statistical tool for the graphical analysis of
contingency tables.
The result of a simple correspondence analysis is a two-dimensional graphical
representation of the association between rows and columns of the table.
The plot contains a point for each row and each column of the table. Rows with similar
patterns of counts produce points that are close together, and columns with similar
patterns of counts produce points that are close together.
Simple correspondence analysis analyzes a contingency table made up of one or more
column variables and one or more row variables.
To illustrate this method, consider the following dataset, a typical two-dimensional
contingency table. The data deal with the perception of different kinds of alcohol.

Like the taste


With friends
To relax oneself
Become expensive
Refreshing
Not elegant
Friendly product
Good before meals
Good during the day
Good during evening
For all year long
Liked by youngs
Good for guests
Oldy, not trendy
As well for men as for women
Close to me
By habits
Make snobish
We can mix it
For night life / bars / nightclubs

PASTIS WHISKY MARTINI


49
50
42
83
83
76
61
61
51
60
88
42
78
22
18
26
11
13
64
64
56
88
79
85
24
21
12
7
61
12
83
87
85
45
77
36
88
92
87
12
4
13
50
62
69
38
41
27
36
30
24
3
35
9
43
87
29
12
91
27

SUZE
18
60
32
41
19
17
34
64
10
11
79
16
60
38
43
11
16
8
32
16

Select the SPAD dataset ALCOOL.SBA and import it.

45

VODKA
25
69
38
75
17
13
45
45
13
53
83
65
70
5
49
16
19
28
82
84

GIN
23
68
39
70
19
11
42
46
12
50
82
69
67
6
51
18
19
25
80
81

MALIBU
25
69
39
61
14
13
46
37
13
48
80
76
67
8
61
17
17
21
43
72

BEER
59
74
72
19
80
29
68
41
85
54
90
89
81
7
60
49
40
4
40
67

SCA - Simple correspondence analysis

The SETTING OPTIONS


THE COLUMNS TAB

Active frequencies: all

THE ROWS TAB


This tab is exactly similar to the Cases tabs available for the descriptive statistics
methods.

46

Factorial Analyses with SPAD

THE PARAMETERS TAB

In order to display the rows


results in excel sheets, click
on the Options button
and select Yes

47

SCA - Simple correspondence analysis

THE SCA RESULTS


SIMPLE CORRESPONDENCE ANALYSIS
EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION..
SUM OF EIGENVALUES............

0.1345
0.1345

HISTOGRAM OF THE FIRST 7 EIGENVALUES


+--------+------------+-------------+-------------+----------------------------------------------------------------------------------+
| NUMBER | EIGENVALUE | PERCENTAGE | CUMULATED |
|
|
|
|
| PERCENTAGE |
|
+--------+------------+-------------+-------------+----------------------------------------------------------------------------------+
|
1
|
0.0664
|
49.37
|
49.37
| ******************************************************************************** |
|
2
|
0.0449
|
33.34
|
82.72
| *******************************************************
|
|
3
|
0.0124
|
9.24
|
91.96
| ***************
|
|
4
|
0.0069
|
5.14
|
97.09
| *********
|
|
5
|
0.0029
|
2.18
|
99.27
| ****
|
|
6
|
0.0008
|
0.63
|
99.90
| **
|
|
7
|
0.0001
|
0.10
|
100.00
|*
|
+--------+------------+-------------+-------------+----------------------------------------------------------------------------------+
COORDINATES, CONTRIBUTIONS OF FREQUENCIES ON AXES 1 TO 5
ACTIVE FREQUENCIES
+---------------------------------+-------------------------------+--------------------------+--------------------------+
|
FREQUENCIES
|
COORDINATES
|
CONTRIBUTIONS
|
SQUARED COSINES
|
|---------------------------------+-------------------------------+--------------------------+--------------------------|
| IDEN LABEL
REL.WT DISTO |
1
2
3
4
5
|
1
2
3
4
5 |
1
2
3
4
5 |
+---------------------------------+-------------------------------+--------------------------+--------------------------+
| PAST - PASTIS
13.12
0.17 | -0.36 -0.05 0.16 0.11 -0.04 | 26.3 0.6 26.5 23.5 8.3 | 0.76 0.01 0.14 0.07 0.01 |
| WHIS - WHISKY
15.83
0.05 | 0.19 0.02 0.09 -0.02 0.09 | 8.4 0.1 9.7 0.6 39.5 | 0.67 0.01 0.15 0.00 0.14 |
| MART - MARTINI
11.23
0.11 | -0.17 -0.21 0.09 -0.17 0.00 | 4.9 10.5 7.2 49.7 0.0 | 0.26 0.38 0.07 0.28 0.00 |
| SUZE - SUZE
8.63
0.30 | -0.22 -0.43 -0.24 0.05 0.04 | 6.3 35.6 40.7 3.2 3.9 | 0.16 0.62 0.20 0.01 0.00 |
| VODK - VODKA
12.35
0.10 | 0.30 0.00 -0.01 0.06 0.00 | 16.8 0.0 0.0 7.2 0.0 | 0.94 0.00 0.00 0.04 0.00 |
| GIN - GIN
12.13
0.08 | 0.28 0.00 -0.01 0.06 -0.01 | 14.3 0.0 0.1 5.9 0.7 | 0.94 0.00 0.00 0.04 0.00 |
| MALI - MALIBU
11.42
0.07 | 0.21 0.02 -0.06 -0.07 -0.11 | 7.9 0.1 3.0 8.7 45.9 | 0.67 0.00 0.05 0.08 0.17 |
| BIER - BEER
15.30
0.23 | -0.26 0.39 -0.10 -0.02 0.02 | 15.2 53.1 12.7 1.1 1.7 | 0.28 0.67 0.04 0.00 0.00 |
+---------------------------------+-------------------------------+--------------------------+--------------------------+

COORDINATES, CONTRIBUTIONS AND SQUARED COSINES OF CASES


AXES 1 TO 5
+---------------------------------------+-------------------------------+--------------------------+--------------------------+
|
CASES
|
COORDINATES
|
CONTRIBUTIONS
|
SQUARED COSINES
|
|---------------------------------------+-------------------------------+--------------------------+--------------------------|
| IDENTIFIER
REL.WT. DISTO |
1
2
3
4
5
|
1
2
3
4
5 |
1
2
3
4
5 |
+---------------------------------------+-------------------------------+--------------------------+--------------------------+
| Like the taste
4.02
0.08 | -0.21 0.10 0.12 -0.08 0.06 | 2.6 0.9 4.4 3.9 5.2 | 0.55 0.13 0.18 0.09 0.05 |
| With friends
8.04
0.01 | -0.04 -0.10 0.00 -0.01 -0.04 | 0.1 1.9 0.0 0.2 3.9 | 0.09 0.79 0.00 0.01 0.10 |
| To relax oneself
5.43
0.03 | -0.14 0.04 0.04 -0.04 0.02 | 1.6 0.2 0.7 1.1 0.7 | 0.79 0.07 0.06 0.06 0.02 |
| Become expensive
6.30
0.12 | 0.25 -0.19 0.09 0.11 -0.03 | 5.7 5.0 4.2 10.1 1.8 | 0.51 0.30 0.07 0.09 0.01 |
| Refreshing
3.69
0.48 | -0.56 0.30 0.07 0.25 -0.07 | 17.5 7.3 1.5 33.0 6.9 | 0.66 0.19 0.01 0.13 0.01 |
| Not elegant
1.84
0.14 | -0.32 0.03 -0.12 0.11 -0.08 | 2.9 0.0 2.0 3.0 3.9 | 0.76 0.01 0.10 0.08 0.05 |
| Friendly product
5.79
0.01 | -0.10 0.00 0.05 -0.04 -0.01 | 0.8 0.0 1.2 1.6 0.2 | 0.67 0.00 0.18 0.13 0.01 |
| Good before meals
6.70
0.14 | -0.18 -0.30 0.11 -0.03 0.06 | 3.1 13.0 6.7 0.8 8.5 | 0.23 0.64 0.09 0.01 0.03 |
| Good during the day
2.62
0.69 | -0.43 0.66 -0.25 -0.04 0.11 | 7.2 25.1 13.0 0.5 10.6 | 0.26 0.63 0.09 0.00 0.02 |
| Good during evening
4.09
0.25 | 0.40 0.26 -0.12 -0.01 0.03 | 10.0 6.0 5.1 0.0 0.9 | 0.66 0.27 0.06 0.00 0.00 |
| For all year long
9.24
0.02 | -0.02 -0.11 -0.08 -0.01 -0.03 | 0.1 2.7 4.2 0.3 3.7 | 0.02 0.60 0.26 0.01 0.05 |
| Liked by youngs
6.53
0.09 | 0.17 0.22 -0.02 -0.03 -0.09 | 2.8 7.0 0.2 0.7 17.5 | 0.33 0.55 0.01 0.01 0.09 |
| Good for guests
8.45
0.02 | -0.06 -0.10 0.03 -0.04 -0.01 | 0.5 1.7 0.7 2.2 0.2 | 0.23 0.57 0.07 0.11 0.00 |
| Oldy, not trendy
1.28
1.41 | -0.46 -0.84 -0.68 0.11 0.08 | 4.1 20.2 47.5 2.3 2.9 | 0.15 0.50 0.33 0.01 0.00 |
| As well for men as for w 6.15
0.03 | -0.01 -0.09 -0.02 -0.14 -0.06 | 0.0 1.2 0.3 16.3 6.6 | 0.00 0.28 0.02 0.59 0.10 |
| Close to me
3.00
0.11 | -0.22 0.19 0.13 -0.05 0.10 | 2.2 2.3 4.1 1.0 9.6 | 0.42 0.30 0.15 0.02 0.08 |
| By habits
2.78
0.05 | -0.21 0.08 0.06 0.02 0.03 | 1.8 0.4 0.8 0.2 0.6 | 0.80 0.11 0.06 0.01 0.01 |
| Make snobish
1.84
0.40 | 0.61 -0.09 0.03 0.02 0.09 | 10.4 0.4 0.1 0.1 4.6 | 0.95 0.02 0.00 0.00 0.02 |
| We can mix it
6.02
0.13 | 0.31 -0.03 0.03 0.16 0.07 | 8.6 0.1 0.5 22.3 11.4 | 0.72 0.01 0.01 0.19 0.04 |
| For night life / bars /
6.21
0.23 | 0.44 0.18 -0.07 -0.02 0.01 | 17.9 4.4 2.7 0.3 0.1 | 0.84 0.14 0.02 0.00 0.00 |
+---------------------------------------+-------------------------------+--------------------------+--------------------------+

48

Factorial Analyses with SPAD

50

70

69

82
51

68

48
21
91

43
87

61

88

61

76

77

61

42
80

87
62

69

67

83

92

46

64

39

39

61

69
9

27

12

29

42

16
67

11

43

41

40

90
60

37

43
12

74

60

34

32

68

72

45

88

50

49

18 11 12
6

19

25

17

17 13 13 8

14

30

41

11 21 4

22

42

24

27

13 12 13

18

18

16

11 17 10

38

85

GIN
MALIBU

WHISKY

MARTINI

59

40

49

29

61

19

SUZE

80
7

88
64

19

50

41
83

23

VODKA

64

60

81

17

79

51

19
83

60

16 13 13 5

85
56

16
89

54

87

36
79

32

76

19

46

35
85

25

Refreshing

45

Not elegant
Good during the day
Oldy. not trendy

Good before meals

38

Close to me

To relax oneself

45

67

25
72

Friendly product

70

By habits

80

69

Like the taste

81

83
49

Good for guests

65

With friends

75

For all year long

Liked by youngs

82
53

28

As well for men as for women

Become expensive

84

We can mix it

Good during evening

For night life / bars / nightclubs

Make snobish

The following graph has been designed with the SPAD Amado procedure.
Using the SCA results, rows and columns are ranked by decreasing first factor
coordinates. It gives a visual structure to the table. The width of a column is proportional
to its frequency.

BEER
78

49

36

38

26 24
12

PASTIS

MCA - Multiple Correspondence Analysis

MCA - MULTIPLE CORRESPONDENCE ANALYSIS


The multiple correspondence analysis extends the simple correspondence analysis
properties to n-way tables.
The procedure requires more than 2 active categorical variables, observed on a set of cases.
As well as for the other factorial analyses, it is possible to add some supplementary
elements such as illustrative cases, illustrative continuous or categorical variables.
We will perform the MCA on the ASPI1000.SBA dataset.

VARIABLES DESCRIPTION OF THE ASPI1000.SBA DATASET


ACTIVE CATEGORICAL VARIABLES - 7 VARIABLES - 28 CATEGORIES
11 . Gender
29 . Do you own securities ?
39 . Urban area size (number of inhabitants)
49 . Job category
51 . Diploma in 5 categories
52 . Occupation status of housing in 4 categories
53 . Age in 5 categories

( 2 categories )
( 2 categories )
( 5 categories )
( 5 categories )
( 5 categories )
( 4 categories )
( 5 categories )

SUPPLEMENTARY CATEGORICAL VARIABLES - 35 VARIABLES - 152 CATEGORIES


All available categorical variables
SUPPLEMENTARY CONTINUOUS VARIABLES - 8 VARIABLES
All available continuous variables

50

Factorial Analyses with SPAD

The SETTING OPTIONS


THE VARIABLES TAB

51

MCA - Multiple Correspondence Analysis

THE PARAMETERS TAB

d
c

By default, cases coordinates


are not displayed.

c Random assignment of active categories inferior to (in %)


To assure the robustness of the analysis, it may be useful, on the definition of the
axes of the analysis, to take into account only the categorical variables of a sufficient
weight.
For each question, the cases concerned by a weak total weight category will be
assigned at random to one of the other categories of the variable with a sufficient
weight in the question considered. This cleaning operation allows the data table to
conserve its completely disjunctive property.
The parameter PCMIN fixes the percentage of the total weight of the active cases
below which a category is considered to have a weight too weak. If all the cases
have the weight 1, PCMIN is the percentage of the number of active cases below
which a category will be broken down.
If all the categories for a question (or all except one) have too weak weight, the
question itself will be made illustrative for the calculation of the axes.
The default value (2%) is suitable for most analyses. If the parameter is set to 0.0,
only the categories with a null weight will be eliminated.

d Retained coordinates
The number of retained coordinates is useful for the methods that follow the MCA
in the chain. These methods can be DEFAC (factors description) and RECIP/SEMIS
(clustering).

52

Factorial Analyses with SPAD

THE MCA RESULTS

MULTIPLE CORRESPONDENCE ANALYSIS


ELIMINATION OF ACTIVE CATEGORIES WITH SMALL WEIGHTS
THRESHOLD (PCMIN) :
2.00 %
WEIGHT:
BEFORE CLEANING
:
7 ACTIVE QUESTIONS
AFTER CLEANING
:
7 ACTIVE QUESTIONS
TOTAL WEIGHT OF ACTIVE CASES :
1000.00

20.00
28 ASSOCIATE CATEGORIES
28 ASSOCIATE CATEGORIES

MARGINAL DISTRIBUTIONS OF ACTIVE QUESTIONS


----------------------------+-----------------+------------------------------------------------------------------CATEGORIES
| BEFORE CLEANING |
AFTER CLEANING
IDENT
LABEL
| COUNT
WEIGHT | COUNT
WEIGHT
HISTOGRAM OF RELATIVE WEIGHTS,
----------------------------+-----------------+------------------------------------------------------------------11 . Gender
masc - male
| 469
469.00 | 469
469.00 *****************************
fmi - gender
| 531
531.00 | 531
531.00 ********************************
----------------------------+-----------------+------------------------------------------------------------------29 . Do you own some securities ?
vmo1 - Yes
| 121
121.00 | 121
121.00 ********
vmo2 - No
| 879
879.00 | 879
879.00 *************************************************
----------------------------+-----------------+------------------------------------------------------------------39 . Urban area size (number of inhabitants)
agg1 - Lower than 2.000
|
83
83.00 |
83
83.00 *****
agg2 - 2.000 - 20.000
|
87
87.00 |
87
87.00 ******
agg3 - 20.000 - 100.000
| 175
175.00 | 175
175.00 ***********
agg4 - greater than 100.000 | 329
329.00 | 329
329.00 ********************
agg5 - Paris
| 326
326.00 | 326
326.00 ********************
----------------------------+-----------------+------------------------------------------------------------------49 . Job category
emp1 - Worker
| 263
263.00 | 263
263.00 ****************
emp2 - Employee
| 335
335.00 | 335
335.00 *********************
emp3 - Manager
| 229
229.00 | 229
229.00 **************
emp4 - Other
|
48
48.00 |
48
48.00 ==RAND.ASSIGN.==
49_ - missing category
| 125
125.00 | 125
125.00 ********
----------------------------+-----------------+------------------------------------------------------------------51 . Diploma in 5 categories
die1 - No one
| 189
189.00 | 189
189.00 ************
die2 - CEP
| 321
321.00 | 321
321.00 ********************
die3 - BEPC-BE-BEPS
| 158
158.00 | 158
158.00 **********
die4 - Bac - Brevet sup.
| 182
182.00 | 182
182.00 ***********
die5 - University
| 150
150.00 | 150
150.00 **********
----------------------------+-----------------+------------------------------------------------------------------52 . Occupation status of housing in 4 categories
slo1 - homeowner
| 120
120.00 | 120
120.00 ********
slo2 - owner
| 290
290.00 | 290
290.00 ******************
slo3 - tenant
| 523
523.00 | 523
523.00 ********************************
slo4 - free housing, other |
67
67.00 |
67
67.00 *****
----------------------------+-----------------+------------------------------------------------------------------53 . Age in 5 categories
agc1 - Lower than 25 yo
| 150
150.00 | 150
150.00 **********
agc2 - 25 to 34 yo
| 284
284.00 | 284
284.00 ******************
agc3 - 35 to 49 yo
| 209
209.00 | 209
209.00 *************
agc4 - 50 to 64 yo
| 188
188.00 | 188
188.00 ************
agc5 - 65 yo and more
| 169
169.00 | 169
169.00 ***********
----------------------------+-----------------+-------------------------------------------------------------------

53

MCA - Multiple Correspondence Analysis

EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION..
SUM OF EIGENVALUES............

2.8571
2.8571

HISTOGRAM OF THE FIRST 20 EIGENVALUES


+--------+------------+-------------+-------------+----------------------------------------------------------+
| NUMBER | EIGENVALUE | PERCENTAGE | CUMULATED |
|
|
|
|
| PERCENTAGE |
|
+--------+------------+-------------+-------------+--------------------------------------------------------------------------+
|
1
|
0.2703
|
9.46
|
9.46
| ******************************************************************************** |
|
2
|
0.2369
|
8.29
|
17.75
| ***********************************************************************
|
|
3
|
0.2084
|
7.29
|
25.05
| **************************************************************
|
|
4
|
0.1922
|
6.73
|
31.77
| *********************************************************
|
|
5
|
0.1846
|
6.46
|
38.23
| *******************************************************
|
|
6
|
0.1578
|
5.52
|
43.76
| ***********************************************
|
|
7
|
0.1534
|
5.37
|
49.13
| **********************************************
|
|
8
|
0.1493
|
5.23
|
54.35
| *********************************************
|
|
9
|
0.1441
|
5.04
|
59.40
| *******************************************
|
|
10
|
0.1398
|
4.89
|
64.29
| ******************************************
|
|
11
|
0.1326
|
4.64
|
68.93
| ****************************************
|
|
12
|
0.1300
|
4.55
|
73.48
| ***************************************
|
|
13
|
0.1284
|
4.49
|
77.97
| **************************************
|
|
14
|
0.1222
|
4.28
|
82.25
| *************************************
|
|
15
|
0.1070
|
3.74
|
86.00
| ********************************
|
|
16
|
0.1015
|
3.55
|
89.55
| *******************************
|
|
17
|
0.0954
|
3.34
|
92.89
| *****************************
|
|
|
18
|
0.0821
|
2.87
|
95.76
| *************************
|
19
|
0.0748
|
2.62
|
98.38
| ***********************
|
|
20
|
0.0462
|
1.62
|
100.00
| **************
|
+--------+------------+-------------+-------------+---------------------------------------------------------------- ------------+

RESEARCH OF IRREGULARITIES (THIRD DIFFERENCES)


+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
|
5 -- 6 |
-27.77 | **************************************************** |
|
14 -- 15 |
-10.42 | ********************
|
|
17 -- 18 |
-6.67 | *************
|
|
13 -- 14 |
-5.44 | ***********
|
|
10 -- 11 |
-3.77 | ********
|
|
2 -- 3 |
-3.66 | *******
|
|
8 -- 9 |
-1.53 | ***
|
+--------------+--------------+------------------------------------------------------+

RESEARCH OF IRREGULARITIES (SECOND DIFFERENCES)


+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
|
5 -- 6 |
22.31 | **************************************************** |
|
2 -- 3 |
12.28 | *****************************
|
|
14 -- 15 |
9.83 | ***********************
|
|
3 -- 4 |
8.62 | *********************
|
|
1 -- 2 |
4.94 | ************
|
|
10 -- 11 |
4.67 | ***********
|
|
11 -- 12 |
0.90 | ***
|
|
8 -- 9 |
0.81 | **
|
|
6 -- 7 |
0.40 | *
|
+--------------+--------------+------------------------------------------------------+

Irregularity 2nd diff between 5 and 6 = [ ( 7 6 ) ( 6 5 ) ] * 1000


The two tables below are the equivalent of the scree test (or Cattel test).
This procedure detects the main irregularities in the graph and ranks them by decreasing
importance.

54

Factorial Analyses with SPAD


LOADINGS, CONTRIBUTIONS AND SQUARED COSINES OF ACTIVE CATEGORIES
AXES 1 TO 5
+------------------------------------------+-------------------------------+--------------------------+--------------------------+
|
CATEGORIES
|
LOADINGS
|
CONTRIBUTIONS
|
SQUARED COSINES
|
|------------------------------------------+-------------------------------+--------------------------+--------------------------|
| IDEN - LABEL
REL. WT. DISTO |
1
2
3
4
5
|
1
2
3
4
5 |
1
2
3
4
5 |
+------------------------------------------+-------------------------------+--------------------------+--------------------------+
| 11 . Gender
|
| masc - male
6.70
1.13 | -0.29 0.08 0.43 -0.47 -0.25 | 2.1 0.2 6.0 7.6 2.3 | 0.07 0.01 0.16 0.19 0.06 |
| fmi - gender
7.59
0.88 | 0.26 -0.07 -0.38 0.41 0.22 | 1.8 0.2 5.3 6.7 2.0 | 0.07 0.01 0.16 0.19 0.06 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 3.9 0.3 11.2 14.4 4.3 +--------------------------+
| 29 . Do you own some securities ?
|
| vmo1 - Yes
1.73
7.26 | 0.69 1.46 -0.25 -0.23 0.06 | 3.1 15.5 0.5 0.5 0.0 | 0.07 0.29 0.01 0.01 0.00 |
| vmo2 - No
12.56
0.14 | -0.10 -0.20 0.03 0.03 -0.01 | 0.4 2.1 0.1 0.1 0.0 | 0.07 0.29 0.01 0.01 0.00 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 3.5 17.6 0.6 0.6 0.0 +--------------------------+
| 39 . Urban area size (number of inhabitants)
|
| agg1 - Lower than 2.000
1.19 11.05 | -1.06 0.83 -1.06 0.75 -0.06 | 5.0 3.4 6.4 3.5 0.0 | 0.10 0.06 0.10 0.05 0.00 |
| agg2 - 2.000 - 20.000
1.24 10.49 | -0.55 0.26 0.28 0.80 -0.61 | 1.4 0.3 0.5 4.2 2.5 | 0.03 0.01 0.01 0.06 0.04 |
| agg3 - 20.000 - 100.000
2.50
4.71 | -0.27 0.07 -0.17 0.07 -0.12 | 0.7 0.1 0.3 0.1 0.2 | 0.02 0.00 0.01 0.00 0.00 |
| agg4 - greater than 100.000 4.70
2.04 | -0.04 -0.40 0.05 -0.22 -0.27 | 0.0 3.2 0.0 1.2 1.9 | 0.00 0.08 0.00 0.02 0.04 |
| agg5 - Paris
4.66
2.07 | 0.60 0.08 0.24 -0.22 0.52 | 6.2 0.1 1.3 1.2 6.7 | 0.18 0.00 0.03 0.02 0.13 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 13.3 7.1 8.5 10.1 11.3 +--------------------------+
| 49 . Job category
|
| emp1 - Worker
3.94
2.62 | -0.88 -0.47 0.54 -0.66 -0.20 | 11.2 3.6 5.6 8.9 0.8 | 0.29 0.08 0.11 0.17 0.01 |
| emp2 - Employee
4.91
1.91 | -0.19 -0.20 -0.38 0.67 0.63 | 0.6 0.8 3.5 11.4 10.5 | 0.02 0.02 0.08 0.23 0.21 |
| emp3 - Manager
3.44
3.15 | 0.80 0.89 0.74 0.02 -0.14 | 8.2 11.4 9.0 0.0 0.4 | 0.21 0.25 0.17 0.00 0.01 |
| 49_ - missing category
1.99
6.19 | 0.80 -0.12 -1.41 -0.38 -0.91 | 4.7 0.1 18.9 1.5 9.0 | 0.10 0.00 0.32 0.02 0.13 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 24.8 16.0 36.9 21.8 20.6 +--------------------------+
| 51 . Diploma in 5 categories
|
| die1 - No one
2.70
4.29 | -0.70 -0.23 -0.23 -0.93 0.34 | 5.0 0.6 0.7 12.1 1.7 | 0.12 0.01 0.01 0.20 0.03 |
| die2 - CEP
4.59
2.12 | -0.80 0.08 0.05 0.29 -0.07 | 10.9 0.1 0.1 2.0 0.1 | 0.30 0.00 0.00 0.04 0.00 |
| die3 - BEPC-BE-BEPS
2.26
5.33 | 0.23 -0.62 -0.17 0.47 0.56 | 0.4 3.7 0.3 2.6 3.8 | 0.01 0.07 0.01 0.04 0.06 |
| die4 - Bac - Brevet sup.
2.60
4.49 | 0.93 -0.06 -0.32 0.26 -0.95 | 8.3 0.0 1.3 0.9 12.6 | 0.19 0.00 0.02 0.01 0.20 |
| die5 - University
2.14
5.67 | 1.23 0.84 0.73 -0.26 0.27 | 12.1 6.4 5.5 0.8 0.8 | 0.27 0.13 0.10 0.01 0.01 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 36.6 10.9 7.9 18.4 19.2 +--------------------------+
| 52 . Occupation status of housing in 4 categories
|
| slo1 - homeowner
1.71
7.33 | -0.31 -0.06 0.85 1.02 -1.30 | 0.6 0.0 5.9 9.2 15.7 | 0.01 0.00 0.10 0.14 0.23 |
| slo2 - owner
4.14
2.45 | -0.44 1.00 -0.51 -0.07 -0.01 | 3.0 17.6 5.2 0.1 0.0 | 0.08 0.41 0.11 0.00 0.00 |
| slo3 - tenant
7.47
0.91 | 0.27 -0.51 0.15 -0.15 0.33 | 2.0 8.2 0.8 0.9 4.4 | 0.08 0.28 0.03 0.03 0.12 |
| slo4 - free housing, other
0.96 13.93 | 0.34 -0.25 -0.50 -0.33 -0.20 | 0.4 0.3 1.2 0.6 0.2 | 0.01 0.00 0.02 0.01 0.00 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 6.0 26.0 13.0 10.8 20.3 +--------------------------+
| 53 . Age in 5 categories
|
| agc1 - Lower than 25 yo
2.14
5.67 | 0.81 -0.98 -0.89 -0.68 -0.80 | 5.2 8.7 8.2 5.2 7.4 | 0.12 0.17 0.14 0.08 0.11 |
| agc2 - 25 to 34 yo
4.06
2.52 | 0.35 -0.45 0.63 0.47 0.41 | 1.9 3.4 7.8 4.8 3.7 | 0.05 0.08 0.16 0.09 0.07 |
| agc3 - 35 to 49 yo
2.99
3.78 | -0.33 0.36 0.41 0.41 -0.69 | 1.2 1.6 2.5 2.6 7.6 | 0.03 0.03 0.05 0.04 0.12 |
| agc4 - 50 to 64 yo
2.69
4.32 | -0.51 0.30 -0.42 0.21 0.25 | 2.6 1.0 2.3 0.6 0.9 | 0.06 0.02 0.04 0.01 0.01 |
| agc5 - 65 yo and more
2.41
4.92 | -0.34 0.84 -0.32 -0.93 0.59 | 1.0 7.2 1.2 10.8 4.6 | 0.02 0.14 0.02 0.17 0.07 |
+------------------------------------------+------- CUMULATED CONTRIBUTION = 11.8 22.0 21.9 23.9 24.2 +--------------------------+

P.REL : relative weight of the category .


P.REL = ( nq * 100 ) / ( n * Q ) where nq is the weight of the category, n the overall weight
and Q the number of active variables.
For example, for the male category, P.REL = ( 469 * 100 ) / ( 1000 * 7 ) = 6.70 .
DISTO : distance between the category and the center of gravity. This criteria depends on
the weight of the category. The formula is the following :
d (j,G) = ( n / nj ) 1 where nj is the weight of the category j and n the overall weight
LOADINGS AND TEST-VALUES OF CATEGORIES
AXES 1 TO 5
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
CATEGORIES
|
TEST-VALUES
|
LOADINGS
|
|
|---------------------------------------------|-------------------------------|------------------------------------|----------|
| IDEN - LABEL
COUNT
ABS.WT |
1
2
3
4
5
|
1
2
3
4
5
| DISTO. |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
11 . Gender
|
| masc - male
469
469.00 | -8.6
2.3 12.8 -13.9 -7.5 | -0.29
0.08
0.43 -0.47 -0.25 |
1.13 |
| fmi - gender
531
531.00 |
8.6 -2.3 -12.8 13.9
7.5 |
0.26 -0.07 -0.38
0.41
0.22 |
0.88 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
29 . Do you own some securities ?
|
| vmo1 - Yes
121
121.00 |
8.1 17.1 -2.9 -2.7
0.7 |
0.69
1.46 -0.25 -0.23
0.06 |
7.26 |
| vmo2 - No
879
879.00 | -8.1 -17.1
2.9
2.7 -0.7 | -0.10 -0.20
0.03
0.03 -0.01 |
0.14 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
39 . Urban area size (number of inhabitants)
|
| agg1 - Lower than 2.000
83
83.00 | -10.1
7.9 -10.1
7.2 -0.6 | -1.06
0.83 -1.06
0.75 -0.06 |
11.05 |
| agg2 - 2.000 - 20.000
87
87.00 | -5.4
2.5
2.7
7.8 -5.9 | -0.55
0.26
0.28
0.80 -0.61 |
10.49 |
| agg3 - 20.000 - 100.000
175
175.00 | -3.9
1.1 -2.4
1.0 -1.7 | -0.27
0.07 -0.17
0.07 -0.12 |
4.71 |
| agg4 - greater than 100.000 329
329.00 | -0.9 -8.8
1.0 -4.8 -6.0 | -0.04 -0.40
0.05 -0.22 -0.27 |
2.04 |
| agg5 - Paris
326
326.00 | 13.2
1.8
5.2 -4.9 11.3 |
0.60
0.08
0.24 -0.22
0.52 |
2.07 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
49 . Job category
|
| emp1 - Worker
263
263.00 | -16.1 -9.7 10.7 -12.6 -3.5 | -0.86 -0.51
0.57 -0.67 -0.18 |
2.80 |
| emp2 - Employee
335
335.00 | -3.6 -5.0 -8.5 15.2 14.2 | -0.16 -0.22 -0.38
0.68
0.63 |
1.99 |
| emp3 - Manager
229
229.00 | 14.6 14.9 13.2
0.2 -2.1 |
0.85
0.86
0.77
0.01 -0.12 |
3.37 |
| emp4 - Other
48
48.00 | -5.2
5.3 -3.5 -0.2 -3.3 | -0.73
0.75 -0.50 -0.03 -0.47 |
19.83 |
| 49_ - missing category
125
125.00 | 11.4 -2.4 -16.6 -5.0 -10.9 |
0.96 -0.20 -1.39 -0.42 -0.91 |
7.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+

55

MCA - Multiple Correspondence Analysis


|
51 . Diploma in 5 categories
|
| die1 - No one
189
189.00 | -10.8 -3.5 -3.5 -14.2
5.3 | -0.70 -0.23 -0.23 -0.93
0.34 |
4.29 |
| die2 - CEP
321
321.00 | -17.4
1.8
1.2
6.3 -1.5 | -0.80
0.08
0.05
0.29 -0.07 |
2.12 |
| die3 - BEPC-BE-BEPS
158
158.00 |
3.1 -8.5 -2.3
6.5
7.7 |
0.23 -0.62 -0.17
0.47
0.56 |
5.33 |
| die4 - Bac - Brevet sup.
182
182.00 | 13.9 -0.9 -4.8
3.8 -14.1 |
0.93 -0.06 -0.32
0.26 -0.95 |
4.49 |
| die5 - University
150
150.00 | 16.4 11.2
9.7 -3.5
3.6 |
1.23
0.84
0.73 -0.26
0.27 |
5.67 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
52 . Occupation status of housing in 4 categories
|
| slo1 - homeowner
120
120.00 | -3.6 -0.7
9.9 11.9 -15.2 | -0.31 -0.06
0.85
1.02 -1.30 |
7.33 |
| slo2 - owner
290
290.00 | -8.9 20.2 -10.3 -1.4 -0.2 | -0.44
1.00 -0.51 -0.07 -0.01 |
2.45 |
| slo3 - tenant
523
523.00 |
9.0 -16.9
5.1 -5.0 10.9 |
0.27 -0.51
0.15 -0.15
0.33 |
0.91 |
| slo4 - free housing, other
67
67.00 |
2.8 -2.1 -4.3 -2.8 -1.7 |
0.34 -0.25 -0.50 -0.33 -0.20 |
13.93 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
53 . Age in 5 categories
|
| agc1 - Lower than 25 yo
150
150.00 | 10.7 -13.0 -11.8 -9.1 -10.6 |
0.81 -0.98 -0.89 -0.68 -0.80 |
5.67 |
| agc2 - 25 to 34 yo
284
284.00 |
7.0 -8.9 12.6
9.5
8.1 |
0.35 -0.45
0.63
0.47
0.41 |
2.52 |
| agc3 - 35 to 49 yo
209
209.00 | -5.3
5.9
6.7
6.6 -11.2 | -0.33
0.36
0.41
0.41 -0.69 |
3.78 |
| agc4 - 50 to 64 yo
188
188.00 | -7.7
4.6 -6.4
3.1
3.9 | -0.51
0.30 -0.42
0.21
0.25 |
4.32 |
| agc5 - 65 yo and more
169
169.00 | -4.8 12.0 -4.5 -13.2
8.4 | -0.34
0.84 -0.32 -0.93
0.59 |
4.92 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
1 . The family is the only place where you feel well
|
| fbi1 - Yes
561
561.00 | -14.5
4.4 -3.6
0.6
0.5 | -0.40
0.12 -0.10
0.02
0.02 |
0.78 |
| fbi2 - No
431
431.00 | 14.6 -4.5
3.6 -0.4 -0.8 |
0.53 -0.16
0.13 -0.02 -0.03 |
1.32 |
|
1_ - missing category
8
8.00 | -0.3
0.6 -0.4 -1.0
1.5 | -0.11
0.20 -0.13 -0.34
0.54 |
124.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
2 . Opinion about wedding
|
| Mar1 - indissoluble
231
231.00 | -7.9
4.1 -3.3 -3.2
0.3 | -0.46
0.23 -0.19 -0.19
0.02 |
3.33 |
| Mar2 - dissolved serious pb 342
342.00 | -1.8
3.4 -1.8
3.2 -0.7 | -0.08
0.15 -0.08
0.14 -0.03 |
1.92 |
| Mar3 - dissolved if agreem
387
387.00 |
8.7 -6.4
4.7 -0.2
0.2 |
0.35 -0.25
0.19 -0.01
0.01 |
1.58 |
| Mar4 - I do not know
39
39.00 | -0.3 -1.3
0.0 -0.4
0.5 | -0.05 -0.21
0.00 -0.06
0.08 |
24.64 |
|
2_ - missing category
1
1.00 |
0.8
0.8 -0.8
0.1 -0.4 |
0.79
0.77 -0.81
0.09 -0.42 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
3 . Housekeeping works, take care of children...
|
| Mn1 - only women do it
42
42.00 | -3.5 -0.6 -0.9 -0.9 -0.4 | -0.52 -0.08 -0.14 -0.14 -0.06 |
22.81 |
| Mn2 - usually the women
336
336.00 | -2.4
4.9 -1.4 -2.3
2.0 | -0.11
0.22 -0.06 -0.10
0.09 |
1.98 |
| Mn3 - men and women
599
599.00 |
3.6 -4.3
2.1
2.9 -2.1 |
0.09 -0.11
0.05
0.07 -0.05 |
0.67 |
| Mn4 - I do not know
19
19.00 |
0.7 -0.3 -2.1 -0.7
0.1 |
0.15 -0.07 -0.47 -0.15
0.02 |
51.63 |
|
3_ - missing category
4
4.00 |
0.2 -1.3
1.2 -0.9
1.9 |
0.11 -0.64
0.62 -0.43
0.93 |
249.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
4 . Are you satisfied of your daily life
|
| Cad1 - a lot
259
259.00 | -0.8
5.3 -3.4
1.7 -0.9 | -0.04
0.28 -0.18
0.09 -0.05 |
2.86 |
| Cad2 - enough
549
549.00 | -0.9
0.1
1.2
0.1
0.2 | -0.03
0.00
0.03
0.00
0.00 |
0.82 |
| Cad3 - a little
145
145.00 |
1.9 -4.8
1.3 -1.3
1.1 |
0.14 -0.37
0.10 -0.10
0.08 |
5.90 |
| Cad4 - not at all
46
46.00 |
0.6 -3.3
2.0 -1.6 -0.3 |
0.08 -0.47
0.29 -0.23 -0.04 |
20.74 |
|
4_ - missing category
1
1.00 |
0.4
1.5
0.7 -0.6 -0.1 |
0.35
1.52
0.72 -0.56 -0.12 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
5 . The environmental protection and maintenance is...
|
| env1 - very important
657
657.00 |
8.0
0.0
1.8
0.8 -1.4 |
0.18
0.00
0.04
0.02 -0.03 |
0.52 |
| env2 - quite important
298
298.00 | -7.1 -0.1 -0.7
0.3
0.6 | -0.34
0.00 -0.04
0.02
0.03 |
2.36 |
| env3 - not important
36
36.00 | -3.0 -0.1 -2.8 -1.7
2.4 | -0.49 -0.01 -0.46 -0.27
0.39 |
26.78 |
| env4 - not at all important
7
7.00 | -0.4
0.1 -0.1 -2.5 -0.3 | -0.16
0.05 -0.02 -0.94 -0.13 |
141.86 |
|
5_ - missing category
2
2.00 |
0.3
0.8 -0.1 -0.3 -0.5 |
0.20
0.59 -0.10 -0.24 -0.37 |
499.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
6 . Do scientific discoveries ameliorate the quality of life ?
|
| sci1 - Yes, a little
509
509.00 | -1.9 -0.2
0.3
0.5 -0.6 | -0.06
0.00
0.01
0.02 -0.02 |
0.96 |
| sci2 - Yes, a lot
383
383.00 |
3.1
1.8 -1.2
0.8 -0.3 |
0.12
0.07 -0.05
0.03 -0.01 |
1.61 |
| sci3 - Not at all
105
105.00 | -1.6 -2.3
1.3 -2.0
1.6 | -0.15 -0.22
0.12 -0.19
0.15 |
8.52 |
|
6_ - missing category
3
3.00 | -1.1 -1.5
0.1 -0.8 -0.6 | -0.65 -0.89
0.07 -0.49 -0.36 |
332.33 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
7 . Are you satisfied of your health
|
| Snt1 - a lot
267
267.00 |
3.8
0.3
0.4 -1.1 -2.1 |
0.20
0.02
0.02 -0.06 -0.11 |
2.75 |
| Snt2 - satisfied
600
600.00 | -2.7
0.4
0.4
2.0
2.3 | -0.07
0.01
0.01
0.05
0.06 |
0.67 |
| Snt3 - a little
115
115.00 | -0.6 -0.8 -1.1 -1.3 -1.2 | -0.05 -0.07 -0.09 -0.12 -0.10 |
7.70 |
| Snt4 - not at all
18
18.00 | -1.1 -0.5 -0.2 -0.3
1.3 | -0.25 -0.11 -0.05 -0.06
0.30 |
54.56 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
8 . Evolution of your daily life for the last 10 years
|
| Ftr1 - improving a lot
102
102.00 |
1.7
0.9
0.5
1.8 -0.8 |
0.16
0.08
0.04
0.17 -0.08 |
8.80 |
| Ftr2 - improving a little
316
316.00 | -1.2 -1.5
1.8
4.2 -0.9 | -0.05 -0.07
0.08
0.20 -0.04 |
2.16 |
| Ftr3 - the same
250
250.00 |
0.8
2.3 -2.6 -3.0 -2.1 |
0.05
0.12 -0.14 -0.16 -0.11 |
3.00 |
| Ftr4 - a little worse
190
190.00 | -2.2
0.3
0.8 -2.1
3.7 | -0.14
0.02
0.05 -0.14
0.24 |
4.26 |
| Ftr5 - a lot worse
114
114.00 |
0.3 -0.1
1.3 -0.1
1.6 |
0.03 -0.01
0.12 -0.01
0.14 |
7.77 |
| Ftr6 - I do not know
26
26.00 |
2.9 -4.0 -3.2 -1.9 -2.3 |
0.55 -0.78 -0.61 -0.36 -0.45 |
37.46 |
|
8_ - missing category
2
2.00 | -0.7 -0.3 -1.2 -1.8 -1.0 | -0.47 -0.23 -0.83 -1.30 -0.73 |
499.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
9 . Your opinion on the justice running in 1986
|
| Jus1 - very well
13
13.00 |
0.0
1.7 -2.2 -2.1
0.2 |
0.01
0.47 -0.60 -0.57
0.06 |
75.92 |
| Jus2 - quite well
243
243.00 | -0.8
3.4 -0.1 -0.2 -0.8 | -0.05
0.19 -0.01 -0.01 -0.04 |
3.12 |
| Jus3 - quite bad
398
398.00 |
0.6 -1.0 -1.7
1.2 -1.8 |
0.02 -0.04 -0.06
0.05 -0.07 |
1.51 |
| Jus4 - very bad
256
256.00 |
1.3 -2.9
3.9 -1.3
1.1 |
0.07 -0.16
0.21 -0.07
0.06 |
2.91 |
| Jus5 - I do not know
65
65.00 | -3.3
0.5 -2.1
0.0
0.6 | -0.40
0.05 -0.26
0.00
0.07 |
14.38 |
| Jus6 - do not answer
25
25.00 |
2.2 -0.1 -0.4
1.9
3.4 |
0.43 -0.02 -0.09
0.37
0.68 |
39.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
10 . Do you think the society needs to change
|
| Soc1 - yes
759
759.00 |
1.8 -4.8
3.1 -0.3 -0.4 |
0.03 -0.08
0.05 -0.01 -0.01 |
0.32 |
| Soc1 - no
170
170.00 | -0.6
4.4 -2.3
0.9 -0.6 | -0.04
0.31 -0.16
0.06 -0.04 |
4.88 |
| Soc1 - I do not know
71
71.00 | -2.1
1.5 -1.7 -0.8
1.7 | -0.24
0.17 -0.20 -0.09
0.19 |
13.08 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
12 . Educational level of the respondent
|
| dip1 - No one
189
189.00 | -10.8 -3.5 -3.5 -14.2
5.3 | -0.70 -0.23 -0.23 -0.93
0.34 |
4.29 |
| dip2 - CEP
321
321.00 | -17.4
1.8
1.2
6.3 -1.5 | -0.80
0.08
0.05
0.29 -0.07 |
2.12 |
| dip3 - BEPC-BE-BEPS
158
158.00 |
3.1 -8.5 -2.3
6.5
7.7 |
0.23 -0.62 -0.17
0.47
0.56 |
5.33 |
| dip4 - Bac
162
162.00 | 13.2 -1.7 -5.1
3.7 -14.0 |
0.95 -0.12 -0.37
0.26 -1.01 |
5.17 |
| dip5 - brevet sup.
20
20.00 |
3.4
2.2
0.2
0.9 -2.1 |
0.76
0.48
0.05
0.19 -0.46 |
49.00 |
| dip6 - University
142
142.00 | 15.8 10.9
9.8 -3.5
3.3 |
1.23
0.85
0.76 -0.27
0.26 |
6.04 |
| dip7 - other
8
8.00 |
3.7
2.2
0.6 -0.2
1.4 |
1.31
0.77
0.21 -0.07
0.50 |
124.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
13 . What do you think about public nurseries
|
| cre1 - very satisfying
139
139.00 |
1.6 -3.6
1.5
1.8
2.3 |
0.13 -0.28
0.12
0.14
0.18 |
6.19 |
| cre2 - quite satisfying
386
386.00 |
1.8
2.8
1.8
0.7 -0.7 |
0.07
0.11
0.07
0.03 -0.03 |
1.59 |
| cre3 - not very satisfying
242
242.00 |
1.6
0.4 -0.4 -1.3 -1.5 |
0.09
0.02 -0.02 -0.08 -0.08 |
3.13 |
| cre4 - not at all satisf.
92
92.00 | -0.9 -1.8 -0.8 -1.1
0.8 | -0.09 -0.18 -0.08 -0.11
0.08 |
9.87 |
| cre5 - does not know
139
139.00 | -5.8
0.6 -2.7 -0.1
0.0 | -0.45
0.05 -0.21 -0.01
0.00 |
6.19 |
| 13_ - missing category
2
2.00 |
2.0
0.5 -1.6 -0.9 -1.4 |
1.40
0.37 -1.11 -0.66 -0.98 |
499.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+

56

Factorial Analyses with SPAD


|
14 . What do you think about at-home mothers
|
| cre1 - very satisfying
786
786.00 | -6.8
2.9 -4.5
3.5 -0.3 | -0.11
0.05 -0.07
0.06 -0.01 |
0.27 |
| cre2 - quite satisfying
129
129.00 |
6.0 -1.7
1.9 -1.8 -0.7 |
0.50 -0.14
0.16 -0.14 -0.06 |
6.75 |
| cre3 - not very satisfying
35
35.00 |
2.7 -1.5
2.8 -1.5
1.3 |
0.45 -0.25
0.47 -0.25
0.21 |
27.57 |
| cre4 - not at all satisf.
20
20.00 |
2.8 -1.0
2.4 -1.3
1.0 |
0.63 -0.22
0.53 -0.29
0.21 |
49.00 |
| cre5 - does not know
29
29.00 | -1.0 -1.5
2.1 -2.3
0.2 | -0.19 -0.27
0.38 -0.43
0.03 |
33.48 |
| 14_ - missing category
1
1.00 |
0.8
0.8 -0.8
0.1 -0.4 |
0.79
0.77 -0.81
0.09 -0.42 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
16 . Do you like your landscape view
|
| Log1 - a lot
516
516.00 | -4.8
4.7 -4.1
2.2
0.0 | -0.15
0.14 -0.13
0.07
0.00 |
0.94 |
| Log2 - enough
296
296.00 |
3.2 -0.4
3.7 -0.3 -0.5 |
0.16 -0.02
0.18 -0.01 -0.02 |
2.38 |
| Log3 - a little
82
82.00 |
1.3 -2.6
1.8 -1.0
1.5 |
0.14 -0.27
0.19 -0.10
0.16 |
11.20 |
| Log4 - not at all
104
104.00 |
1.7 -4.7 -0.4 -2.0 -0.5 |
0.15 -0.44 -0.03 -0.19 -0.05 |
8.62 |
| 16_ - missing category
2
2.00 |
1.0
0.3
0.1 -2.4
0.1 |
0.69
0.23
0.05 -1.68
0.05 |
499.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
17 . Do you own a dish washing machine ?
|
| lav1 - Yes
211
211.00 |
4.6
7.4
1.0
2.9 -6.0 |
0.28
0.45
0.06
0.18 -0.37 |
3.74 |
| lav2 - Not
789
789.00 | -4.6 -7.4 -1.0 -2.9
6.0 | -0.07 -0.12 -0.02 -0.05
0.10 |
0.27 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
18 . Do you own a color TV ?
|
| tco1 - Yes
373
373.00 | -2.5
3.8 -0.6
0.4
0.2 | -0.10
0.16 -0.02
0.02
0.01 |
1.68 |
| tco2 - Not
624
624.00 |
2.6 -3.7
0.5 -0.4 -0.4 |
0.06 -0.09
0.01 -0.01 -0.01 |
0.60 |
| 18_ - missing category
3
3.00 | -1.0 -0.3
0.8
0.1
1.0 | -0.59 -0.17
0.45
0.08
0.59 |
332.33 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
20 . Occupation status of housing
|
| Occ1 - homeowner
120
120.00 | -3.6 -0.7
9.9 11.9 -15.2 | -0.31 -0.06
0.85
1.02 -1.30 |
7.33 |
| Occ2 - owner
290
290.00 | -8.9 20.2 -10.3 -1.4 -0.2 | -0.44
1.00 -0.51 -0.07 -0.01 |
2.45 |
| Occ3 - tenant
523
523.00 |
9.0 -16.9
5.1 -5.0 10.9 |
0.27 -0.51
0.15 -0.15
0.33 |
0.91 |
| Occ4 - free housing
58
58.00 |
2.5 -2.2 -3.3 -2.6 -0.7 |
0.32 -0.28 -0.42 -0.33 -0.09 |
16.24 |
| Occ5 - other
9
9.00 |
1.3 -0.2 -3.2 -1.1 -2.9 |
0.44 -0.06 -1.05 -0.38 -0.96 |
110.11 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
21 . The housing expenses are for you
|
| Dp1 - unimportant
113
113.00 |
0.1
2.8 -4.0 -2.1
0.9 |
0.01
0.24 -0.36 -0.19
0.08 |
7.85 |
| Dp2 - without big problem
444
444.00 | -2.0
2.6
1.9 -0.4 -1.6 | -0.07
0.09
0.07 -0.01 -0.06 |
1.25 |
| Dp3 - a big problem
352
352.00 |
1.1 -2.9
1.9
2.8
1.9 |
0.05 -0.12
0.08
0.12
0.08 |
1.84 |
| Dp4 - a very big problem
55
55.00 |
0.2 -3.0
1.1
0.1
0.6 |
0.03 -0.39
0.14
0.01
0.07 |
17.18 |
| Dp5 - do not face with
6
6.00 | -0.2
1.2
0.8 -0.8 -0.1 | -0.10
0.47
0.32 -0.32 -0.03 |
165.67 |
| Dp6 - I do not know
22
22.00 |
2.0 -1.2 -4.1 -1.6 -2.3 |
0.42 -0.25 -0.86 -0.34 -0.48 |
44.45 |
| 21_ - missing category
8
8.00 |
0.9 -0.1 -3.2 -1.9 -1.8 |
0.33 -0.05 -1.13 -0.66 -0.62 |
124.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
22 . Are you embarassed with the noise ?
|
| bru1 - a little
196
196.00 |
1.7
0.4
0.6 -0.7 -0.2 |
0.11
0.03
0.04 -0.05 -0.02 |
4.10 |
| bru2 - a lot
197
197.00 |
2.8 -3.3
0.9 -3.8
1.8 |
0.18 -0.21
0.06 -0.25
0.11 |
4.08 |
| bru3 - not at all
606
606.00 | -3.6
2.3 -1.2
3.7 -1.3 | -0.09
0.06 -0.03
0.09 -0.03 |
0.65 |
| 22_ - missing category
1
1.00 | -1.0
0.9 -1.1
0.4
1.0 | -0.99
0.92 -1.15
0.36
0.95 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
23 . Do you participate to the environmental protection ?
|
| df1 - Yes
126
126.00 |
6.5
0.9
1.9 -1.3 -3.6 |
0.54
0.07
0.16 -0.11 -0.30 |
6.94 |
| df2 - No
874
874.00 | -6.5 -0.9 -1.9
1.3
3.6 | -0.08 -0.01 -0.02
0.02
0.04 |
0.14 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
24 . Last job
|
| cs01 - manoeuvre
13
13.00 | -3.1 -1.6
1.7 -1.6 -0.9 | -0.87 -0.44
0.46 -0.43 -0.26 |
75.92 |
| cs02 - ouvrier spcialis
98
98.00 | -9.1 -5.6
5.6 -6.0 -1.6 | -0.87 -0.54
0.53 -0.58 -0.16 |
9.20 |
| cs03 - ouvrier qualifi
152
152.00 | -11.3 -6.7
8.0 -10.0 -2.6 | -0.85 -0.50
0.60 -0.75 -0.19 |
5.58 |
| cs04 - employ de commerce
39
39.00 | -0.6 -2.6 -2.0
5.5
4.7 | -0.09 -0.41 -0.31
0.86
0.74 |
24.64 |
| cs05 - autre employ qual.
68
68.00 |
2.3 -3.9 -2.5
7.2
5.1 |
0.27 -0.45 -0.30
0.84
0.59 |
13.71 |
| cs06 - autre emp. non qual.
91
91.00 | -1.5 -2.4 -3.8
6.3
6.5 | -0.15 -0.24 -0.38
0.63
0.65 |
9.99 |
| cs07 - personnel de service
70
70.00 | -2.8 -1.9 -4.6
5.7
5.6 | -0.32 -0.22 -0.53
0.65
0.65 |
13.29 |
| cs08 - contrematre
14
14.00 | -1.9
1.5 -0.9
0.2
0.7 | -0.49
0.39 -0.24
0.04
0.20 |
70.43 |
| cs09 - artisan
18
18.00 | -2.3 -0.4 -1.2
3.0
3.3 | -0.55 -0.09 -0.28
0.70
0.78 |
54.56 |
| cs10 - petit commercant
35
35.00 | -2.8
1.1 -2.6
3.5
3.7 | -0.47
0.19 -0.43
0.58
0.62 |
27.57 |
| cs11 - cadre moyen
135
135.00 |
9.1
6.6
7.9
2.3 -3.5 |
0.73
0.53
0.64
0.19 -0.28 |
6.41 |
| cs12 - patron indus.commer.
10
10.00 |
2.2
4.4
1.8 -1.9
0.7 |
0.70
1.38
0.57 -0.61
0.22 |
99.00 |
| cs13 - profession librale
15
15.00 |
3.9
6.7
3.0 -0.5
0.5 |
0.99
1.72
0.76 -0.14
0.12 |
65.67 |
| cs14 - cadre suprieur
69
69.00 |
9.2 10.8
9.0 -1.8
0.8 |
1.07
1.25
1.05 -0.21
0.09 |
13.49 |
| cs15 - exploitant agricole
32
32.00 | -7.2
5.5 -4.6
0.8 -2.5 | -1.25
0.95 -0.80
0.13 -0.44 |
30.25 |
| cs16 - salari agricole
0
0.00 |
0.0
0.0
0.0
0.0
0.0 |
0.00
0.00
0.00
0.00
0.00 |
0.00 |
| cs17 - autre actif
13
13.00 |
1.2
1.2
1.2 -1.0 -1.9 |
0.34
0.32
0.34 -0.27 -0.51 |
75.92 |
| cs99 - inconnu
3
3.00 |
0.2
0.7 -1.7 -1.4 -0.9 |
0.12
0.43 -0.97 -0.78 -0.50 |
332.33 |
| 24_ - missing category
125
125.00 | 11.4 -2.4 -16.6 -5.0 -10.9 |
0.96 -0.20 -1.39 -0.42 -0.91 |
7.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
25 . Does your job expose you to health risk ?
|
| tra1 - Lots of risks
108
108.00 | -4.7 -2.1
4.5 -0.9 -3.7 | -0.43 -0.19
0.40 -0.08 -0.33 |
8.26 |
| tra2 - Few risks
192
192.00 | -2.1 -1.5
7.1
0.5 -1.5 | -0.14 -0.09
0.46
0.03 -0.10 |
4.21 |
| tra3 - No risk
276
276.00 |
2.5 -2.5
5.1
6.8
3.9 |
0.13 -0.13
0.26
0.35
0.20 |
2.62 |
| 25_ - missing category
424
424.00 |
2.4
4.7 -13.1 -6.0
0.0 |
0.09
0.17 -0.48 -0.22
0.00 |
1.36 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
26 . Do you have work-personal life problems
|
| con1 - yes
229
229.00 |
2.8 -1.6
7.7
3.2
1.1 |
0.16 -0.09
0.45
0.18
0.06 |
3.37 |
| con2 - no
338
338.00 | -5.0 -3.4
6.7
3.3 -0.9 | -0.22 -0.15
0.29
0.15 -0.04 |
1.96 |
| 26_ - missing category
433
433.00 |
2.4
4.6 -12.9 -5.8 -0.1 |
0.09
0.17 -0.47 -0.21
0.00 |
1.31 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
27 . Have you recently been nervous
|
| ner1 - Yes
273
273.00 |
2.6 -1.7
0.6
1.7
0.8 |
0.13 -0.09
0.03
0.09
0.04 |
2.66 |
| ner2 - No
726
726.00 | -2.6
1.7 -0.6 -1.8 -0.8 | -0.05
0.03 -0.01 -0.04 -0.02 |
0.38 |
| 27_ - missing category
1
1.00 |
0.4
0.3 -0.5
1.2
0.8 |
0.35
0.30 -0.53
1.23
0.77 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
28 . Have you recently been depressed
|
| ta1 - Yes
122
122.00 |
2.5 -1.9
0.1
0.6
0.7 |
0.21 -0.16
0.01
0.05
0.06 |
7.20 |
| ta2 - No
874
874.00 | -2.1
1.8 -0.2 -0.4 -0.6 | -0.03
0.02
0.00 -0.01 -0.01 |
0.14 |
| 28_ - missing category
4
4.00 | -1.7
0.4
0.5 -0.9 -0.1 | -0.87
0.18
0.27 -0.45 -0.05 |
249.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
30 . Do you own real estate properties ?
|
| vim1 - Yes
81
81.00 |
2.5
8.6 -3.8 -0.3 -1.6 |
0.27
0.92 -0.41 -0.03 -0.18 |
11.35 |
| vim2 - No
918
918.00 | -2.6 -8.6
3.6
0.2
1.9 | -0.02 -0.08
0.03
0.00
0.02 |
0.09 |
| 30_ - missing category
1
1.00 |
0.7
0.7
1.6
0.8 -2.2 |
0.69
0.66
1.63
0.78 -2.18 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
31 . Do you regularly impose restrictions
|
| rst1 - Yes
569
569.00 |
1.5 -6.3
1.3
0.8
1.6 |
0.04 -0.17
0.04
0.02
0.04 |
0.76 |
| rst2 - No
414
414.00 | -1.3
6.4 -1.4 -1.0 -1.6 | -0.05
0.24 -0.05 -0.04 -0.06 |
1.42 |
| 31_ - missing category
17
17.00 | -0.6 -0.1
0.6
0.6
0.1 | -0.15 -0.01
0.15
0.13
0.02 |
57.82 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+

57

MCA - Multiple Correspondence Analysis


|
32 . Your opinion on the evolution of French people life level
|
| Fr1 - a lot better
78
78.00 | -1.6
4.0 -1.6
0.4
0.3 | -0.17
0.43 -0.17
0.05
0.03 |
11.82 |
| Fr2 - a little better
321
321.00 | -0.1
3.9 -2.4
1.7 -3.8 |
0.00
0.18 -0.11
0.08 -0.18 |
2.12 |
| Fr3 - it is the same
159
159.00 | -1.6 -1.9
0.7
0.4
0.7 | -0.11 -0.14
0.05
0.03
0.05 |
5.29 |
| Fr4 - a little worse
276
276.00 |
0.1 -1.7
2.0 -1.3
1.3 |
0.00 -0.09
0.10 -0.07
0.07 |
2.62 |
| Fr5 - a lot worse
108
108.00 |
3.2 -3.1
3.4 -0.8
2.6 |
0.29 -0.29
0.31 -0.08
0.24 |
8.26 |
| Fr6 - I do not know
57
57.00 | -0.1 -1.8 -2.7 -1.0
0.3 | -0.02 -0.23 -0.35 -0.12
0.04 |
16.54 |
| 32_ - missing category
1
1.00 |
0.9 -1.4 -0.7
0.5
0.3 |
0.94 -1.43 -0.75
0.48
0.25 |
999.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
33 . Do you invite some friends for dinner ?
|
| bou1 - Often
606
606.00 |
8.1 -0.6
2.3
4.5 -2.9 |
0.21 -0.02
0.06
0.11 -0.07 |
0.65 |
| bou2 - Rarely
274
274.00 | -5.0
0.2
0.5 -1.9
1.1 | -0.26
0.01
0.02 -0.10
0.06 |
2.65 |
| bou3 - Never
120
120.00 | -5.4
0.6 -4.0 -4.1
2.9 | -0.46
0.05 -0.35 -0.35
0.25 |
7.33 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
34 . Are you a member of a religious association ?
|
| asc1 - yes
69
69.00 |
1.8
6.2 -1.2
0.6 -1.4 |
0.21
0.72 -0.14
0.07 -0.16 |
13.49 |
| asc2 - no
931
931.00 | -1.8 -6.2
1.2 -0.6
1.4 | -0.02 -0.05
0.01 -0.01
0.01 |
0.07 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
35 . Do you watch TV
|
| Tl1 - every day
419
419.00 | -12.1
2.6 -5.0 -1.7
1.3 | -0.45
0.10 -0.19 -0.06
0.05 |
1.39 |
| Tl2 - quite often
226
226.00 |
1.5
0.2
1.9
0.6
0.5 |
0.09
0.01
0.11
0.04
0.03 |
3.42 |
| Tl3 - not very often
231
231.00 |
7.4 -1.7
3.2
1.8 -3.2 |
0.43 -0.10
0.18
0.11 -0.19 |
3.33 |
| Tl4 - never
124
124.00 |
6.9 -2.0
1.0 -0.6
1.5 |
0.58 -0.17
0.09 -0.05
0.13 |
7.06 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
36 . In order to change the society, do we need...
|
| cha1 - Progressive reforms
490
490.00 | -0.5 -0.4 -1.7
1.5 -2.4 | -0.01 -0.01 -0.06
0.05 -0.08 |
1.04 |
| cha2 - Radical changes
258
258.00 |
2.4 -4.2
5.5 -1.4
2.0 |
0.13 -0.23
0.30 -0.07
0.11 |
2.88 |
| cha3 - does not know
29
29.00 | -0.4 -0.5 -1.8 -2.4
1.7 | -0.07 -0.09 -0.34 -0.44
0.31 |
33.48 |
| 36_ - missing category
223
223.00 | -1.9
5.1 -3.0
0.6
0.1 | -0.11
0.30 -0.18
0.03
0.01 |
3.48 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
38 . Are you a member of at leas one association ?
|
| ass1 - yes
536
536.00 |
3.3
4.6
1.2
1.9 -5.8 |
0.10
0.14
0.03
0.06 -0.17 |
0.87 |
| ass2 - no
464
464.00 | -3.3 -4.6 -1.2 -1.9
5.8 | -0.11 -0.16 -0.04 -0.06
0.20 |
1.16 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
40 . Age and gender
|
| enq1 - male LE than 38 yo
101
101.00 |
4.1 -1.2
0.4
1.3 -1.4 |
0.39 -0.11
0.04
0.12 -0.13 |
8.90 |
| enq2 - male GT 38 yo
35
35.00 | -3.1
0.0 -0.6
2.4 -3.3 | -0.52
0.00 -0.10
0.39 -0.55 |
27.57 |
| enq3 - female LE 38 yo
526
526.00 |
7.0
1.3
3.9 -2.6
6.3 |
0.21
0.04
0.12 -0.08
0.19 |
0.90 |
| enq4 - female GT 38 yo
338
338.00 | -8.8 -0.6 -4.1
1.0 -4.5 | -0.39 -0.03 -0.18
0.04 -0.20 |
1.96 |
| enq* - unknown
0
0.00 |
0.0
0.0
0.0
0.0
0.0 |
0.00
0.00
0.00
0.00
0.00 |
0.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
41 . Usual bedtime ?
|
| dod1 - 21h. or before
73
73.00 | -6.4
2.1 -2.9 -3.5
2.6 | -0.72
0.24 -0.32 -0.39
0.29 |
12.70 |
| dod2 - between 21h - 22h.
270
270.00 | -7.5
0.6 -0.8
0.7
0.3 | -0.39
0.03 -0.04
0.04
0.02 |
2.70 |
| dod3 - between 22h - 23h.
443
443.00 |
1.2 -1.3 -0.5
3.2 -1.1 |
0.04 -0.05 -0.02
0.11 -0.04 |
1.26 |
| dod4 - between 23h - 24h.
134
134.00 |
8.1
0.7
3.0 -1.2 -0.5 |
0.65
0.06
0.24 -0.10 -0.04 |
6.46 |
| dod5 - after midnight
63
63.00 |
6.1 -1.8
0.7 -2.0 -0.7 |
0.74 -0.22
0.08 -0.24 -0.08 |
14.87 |
| dod6 - variable
11
11.00 |
0.1 -1.3
2.4
0.3
0.3 |
0.02 -0.38
0.72
0.10
0.09 |
89.91 |
| 41_ - missing category
6
6.00 |
1.9
2.1 -0.8 -1.9
0.3 |
0.75
0.87 -0.31 -0.76
0.13 |
165.67 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
43 . Has the respondent been interested by the survey
|
| Int1 - a lot
332
332.00 | -2.8
1.9 -3.3
3.3 -4.9 | -0.13
0.08 -0.15
0.15 -0.22 |
2.01 |
| Int2 - enough
542
542.00 |
1.0 -1.7
1.1 -1.6
2.2 |
0.03 -0.05
0.03 -0.05
0.06 |
0.85 |
| Int3 - a little or not
124
124.00 |
2.3
0.0
3.1 -2.2
3.8 |
0.19
0.00
0.26 -0.18
0.32 |
7.06 |
| 43_ - missing category
2
2.00 |
2.1 -1.1 -0.4 -0.4 -1.4 |
1.46 -0.76 -0.29 -0.29 -0.96 |
499.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
44 . Ideal number of children ?
|
| enf0 - No one
51
51.00 |
0.4 -0.4
1.5 -1.9
0.4 |
0.06 -0.06
0.20 -0.26
0.06 |
18.61 |
| enf1 - one
39
39.00 |
0.5 -1.3
1.1 -0.1
2.5 |
0.07 -0.20
0.17 -0.02
0.39 |
24.64 |
| enf2 - two
431
431.00 | -2.4 -3.1
1.3
0.9
1.9 | -0.09 -0.11
0.05
0.03
0.07 |
1.32 |
| enf3 - three
393
393.00 |
0.1
2.4 -2.0
1.3 -1.5 |
0.00
0.10 -0.08
0.05 -0.06 |
1.54 |
| enf4 - for and more
86
86.00 |
3.5
2.4 -0.8 -2.3 -2.7 |
0.36
0.25 -0.08 -0.24 -0.28 |
10.63 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
54 . Job in 7 categories
|
| csp1 - ex. agr.-art-commer
95
95.00 | -6.4
5.3 -4.3
3.4
2.5 | -0.62
0.52 -0.42
0.33
0.25 |
9.53 |
| csp2 - prof. lib.-cad. sup.
84
84.00 | 10.1 12.8
9.5 -1.9
0.9 |
1.06
1.34
1.00 -0.20
0.09 |
10.90 |
| csp3 - ouvriers
263
263.00 | -16.1 -9.7 10.7 -12.6 -3.5 | -0.86 -0.51
0.57 -0.67 -0.18 |
2.80 |
| csp4 - employs
198
198.00 |
0.1 -5.5 -5.3 11.7 10.2 |
0.01 -0.35 -0.34
0.74
0.65 |
4.05 |
| csp5 - contrema-cad. moy.
149
149.00 |
8.2
6.8
7.3
2.3 -3.1 |
0.62
0.52
0.55
0.17 -0.24 |
5.71 |
| csp6 - personnel de service
70
70.00 | -2.8 -1.9 -4.6
5.7
5.6 | -0.32 -0.22 -0.53
0.65
0.65 |
13.29 |
| csp7 - autres
16
16.00 |
1.2
1.4
0.4 -1.5 -2.1 |
0.30
0.34
0.09 -0.37 -0.51 |
61.50 |
| 54_ - missing category
125
125.00 | 11.4 -2.4 -16.6 -5.0 -10.9 |
0.96 -0.20 -1.39 -0.42 -0.91 |
7.00 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+

CORRELATIONS BETWEEN CONTINUOUS VARIABLES AND FACTORS


AXES 1 A 5
+-----------------------------------+------------------------------------+------------------------------------+
|
VARIABLES
|
SUMMARY STATISTICS
|
CORRELATIONS
|
|-----------------------------------+------------------------------------+------------------------------------|
| NUM . (IDEN)
SHORT LABEL
| COUNT
ABS.WT
MEAN
ST.DEV. |
1
2
3
4
5
|
+-----------------------------------+------------------------------------+------------------------------------+
| 15 . (ring) Engineer annual sala |
806
806.00
8478.73
3668.95 | -0.04
0.06
0.04 -0.01
0.05 |
| 19 . (rmed) Doctor annual salary |
713
713.00 19383.85 12608.83 | -0.05
0.12
0.02
0.03 -0.06 |
| 37 . (ge ) Age
| 1000 1000.00
42.68
17.50 | -0.40
0.55 -0.14 -0.21
0.28 |
| 42 . (nrep) Number of missing va | 1000 1000.00
4.05
4.19 | -0.20
0.12 -0.20 -0.08
0.12 |
| 45 . (fin) End of study age
|
997
997.00
17.29
3.88 |
0.69
0.13
0.24
0.05 -0.11 |
| 46 . (rsou) Salary wishes
|
915
915.00
7244.48
4756.78 |
0.26
0.21
0.15
0.03 -0.09 |
| 47 . (rmin) Resources estimation |
897
897.00
5561.89
2423.40 |
0.19 -0.01
0.14 -0.08
0.14 |
| 48 . (vaca) Summer holidays in n | 1000 1000.00
18.31
19.37 |
0.38
0.02
0.03 -0.06 -0.07 |
| 50 . (POND) Weight
| 1000 1000.00
1.00
0.09 | -0.47
0.27 -0.11
0.25 -0.22 |
+-----------------------------------+------------------------------------+------------------------------------+

58

Factorial Analyses with SPAD

DEFAC - Factors description


This procedure provides help on the interpretation of the factors extracted from a factor
analysis.
A factor can be described quickly and clearly by the most significant elements. These may
be cases, categorical variables, continuous variables or frequencies, and used as active or
illustrative elements in the preceding analysis.

THE DESCRIPTION COMMAND TAB

59

MCA - Multiple Correspondence Analysis

THE PARAMETERS TAB

THE DEFAC RESULTS


INTERPRETATION TOOLS FOR FACTORIAL AXES
PRINTOUT ON FACTOR 1
BY ACTIVE CATEGORIES
+---------------------------------------------------------------------------------------------------------------------------+
| IDEN.| T.VALUE |
CATEGORY LABEL
|
VARIABLE LABEL
| WEIGHT
| NUMBER |
|---------------------------------------------------------------------------------------------------------------------------|
| die2 | -17.40 | CEP
| Diploma in 5 categories
|
321.00 |
1 |
| emp1 | -16.14 | Worker
| Job category
|
263.00 |
2 |
| die1 | -10.75 | No one
| Diploma in 5 categories
|
189.00 |
3 |
| agg1 | -10.11 | Lower than 2.000
| Urban area size (number of inhabitants)
|
83.00 |
4 |
| slo2 |
-8.88 | owner
| Occupation status of housing in 4 categories
|
290.00 |
5 |
| masc |
-8.62 | male
| Gender
|
469.00 |
6 |
| vmo2 |
-8.14 | No
| Do you own some securities ?
|
879.00 |
7 |
|---------------------------------------------------------------------------------------------------------------------------|
|
M I D D L E
A R E A
|
|---------------------------------------------------------------------------------------------------------------------------|
| slo3 |
8.98 | tenant
| Occupation status of housing in 4 categories
|
523.00 |
22 |
| agc1 |
10.73 | Lower than 25 yo
| Age in 5 categories
|
150.00 |
23 |
| 49_ |
11.44 | missing category
| Job category
|
125.00 |
24 |
| agg5 |
13.22 | Paris
| Urban area size (number of inhabitants)
|
326.00 |
25 |
| die4 |
13.86 | Bac - Brevet sup.
| Diploma in 5 categories
|
182.00 |
26 |
| emp3 |
14.64 | Manager
| Job category
|
229.00 |
27 |
| die5 |
16.38 | University
| Diploma in 5 categories
|
150.00 |
28 |
+---------------------------------------------------------------------------------------------------------------------------+
PRINTOUT ON FACTOR 2
BY ACTIVE CATEGORIES
+---------------------------------------------------------------------------------------------------------------------------+
| IDEN.| T.VALUE |
CATEGORY LABEL
|
VARIABLE LABEL
| WEIGHT
| NUMBER |
|---------------------------------------------------------------------------------------------------------------------------|
| vmo2 | -17.08 | No
| Do you own some securities ?
|
879.00 |
1 |
| slo3 | -16.86 | tenant
| Occupation status of housing in 4 categories
|
523.00 |
2 |
| agc1 | -13.04 | Lower than 25 yo
| Age in 5 categories
|
150.00 |
3 |
| emp1 |
-9.65 | Worker
| Job category
|
263.00 |
4 |
| agc2 |
-8.90 | 25 to 34 yo
| Age in 5 categories
|
284.00 |
5 |
| agg4 |
-8.83 | greater than 100.000 | Urban area size (number of inhabitants)
|
329.00 |
6 |
| die3 |
-8.54 | BEPC-BE-BEPS
| Diploma in 5 categories
|
158.00 |
7 |
|---------------------------------------------------------------------------------------------------------------------------|
|
M I D D L E
A R E A
|
|---------------------------------------------------------------------------------------------------------------------------|
| agc3 |
5.86 | 35 to 49 yo
| Age in 5 categories
|
209.00 |
22 |
| agg1 |
7.86 | Lower than 2.000
| Urban area size (number of inhabitants)
|
83.00 |
23 |
| die5 |
11.21 | University
| Diploma in 5 categories
|
150.00 |
24 |
| agc5 |
11.97 | 65 yo and more
| Age in 5 categories
|
169.00 |
25 |
| emp3 |
14.87 | Manager
| Job category
|
229.00 |
26 |
| vmo1 |
17.08 | Yes
| Do you own some securities ?
|
121.00 |
27 |
| slo2 |
20.24 | owner
| Occupation status of housing in 4 categories
|
290.00 |
28 |
+---------------------------------------------------------------------------------------------------------------------------+
PRINTOUT ON FACTOR 3
BY ACTIVE CATEGORIES

60

Factorial Analyses with SPAD


+---------------------------------------------------------------------------------------------------------------------------+
| IDEN.| T.VALUE |
CATEGORY LABEL
|
VARIABLE LABEL
| WEIGHT
| NUMBER |
|---------------------------------------------------------------------------------------------------------------------------|
| 49_ | -16.60 | missing category
| Job category
|
125.00 |
1 |
| fmi | -12.80 | gender
| Gender
|
531.00 |
2 |
| agc1 | -11.84 | Lower than 25 yo
| Age in 5 categories
|
150.00 |
3 |
| slo2 | -10.29 | owner
| Occupation status of housing in 4 categories
|
290.00 |
4 |
| agg1 | -10.05 | Lower than 2.000
| Urban area size (number of inhabitants)
|
83.00 |
5 |
| emp2 |
-8.50 | Employee
| Job category
|
335.00 |
6 |
| agc4 |
-6.40 | 50 to 64 yo
| Age in 5 categories
|
188.00 |
7 |
|---------------------------------------------------------------------------------------------------------------------------|
|
M I D D L E
A R E A
|
|---------------------------------------------------------------------------------------------------------------------------|
| agc3 |
6.74 | 35 to 49 yo
| Age in 5 categories
|
209.00 |
22 |
| die5 |
9.75 | University
| Diploma in 5 categories
|
150.00 |
23 |
| slo1 |
9.87 | homeowner
| Occupation status of housing in 4 categories
|
120.00 |
24 |
| emp1 |
10.73 | Worker
| Job category
|
263.00 |
25 |
| agc2 |
12.62 | 25 to 34 yo
| Age in 5 categories
|
284.00 |
26 |
| masc |
12.80 | male
| Gender
|
469.00 |
27 |
| emp3 |
13.18 | Manager
| Job category
|
229.00 |
28 |
+---------------------------------------------------------------------------------------------------------------------------+

61

MCA - Multiple Correspondence Analysis

CLUSTERING WITH SPAD

HAC / MIXED : clustering on factors scores

PARTI - DECLA : tree cut and cluster description

CLASS - MINER : clusters characterization

ESCAL : Storing the factorial axes and the partitions


The term cluster analysis encompasses a number of different algorithms and methods for
grouping objects (cases) of similar kind into respective categories. A general question
facing researchers in many areas of inquiry is how to organize observed data into
meaningful structures, that is, to develop taxonomies.
In other words cluster analysis is an exploratory data analysis tool which aims at sorting
different objects into groups in a way that the degree of association between two objects is
maximal if they belong to the same group and minimal otherwise.
Note that the above discussions refer to clustering algorithms and do not mention
anything about statistical significance testing. The point here is that, unlike many other
statistical procedures, cluster analysis methods are mostly used when we do not have any
a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster
analysis finds the "most significant solution possible."
62

Clustering with SPAD

RECIP / SEMIS - CLUSTERING ON FACTORS


SCORES

JUSTIFICATION FOR THE USE OF FACTORS SCORES


The RECIP/SEMIS procedure allows the SPAD user to perform a cluster analysis on
factors scores.
It is equivalent to perform a cluster analysis on a range of p variables than on the p factors
scores extracted from the factorial analysis.
Indeed, by transforming the original variables into factors, without reducing the number
of dimensions, despite their extraction ranked in decreasing variance explained, we do not
loose any information. Mathematically, it is simply a rotation of the original space.
However, it is interesting to consider a smaller factorial space with q dimensions (with q
lower than p) and perform a cluster analysis on this q first factorial scores. This way, we
focus on the most interesting part of the information (in that sense that the q factors
capture the main part of the overall variability) and we exclude the noisy remaining
information captured by the last factors.
In general, the data reduction by selecting the first q factors provides better and more
robust clustering.
The factors to retain for the cluster analysis are the ones that engender a smaller space
where the point cloud is stable.
In general, we retain a little bit more than the half of the factors (for MCA), even if a scree
appears after few factors on the eigenvalues graph.
In the parameters tab of this method, you can set the number of factors to retain for the
cluster analysis (10 by default).
Working with factorial scores means that whatever the factorial analysis performed, the
cluster analysis will always be processed on quantitative data.
The single distance, the Euclidean distance, will be used to measure the resemblance
between cases, and one aggregating criteria, Ward, will be used to calculate the difference
between two disjoints groups of cases.

63

RECIP / SEMIS - Clustering on factors scores

THE CLUSTERING ALGORITHMS


The clustering algorithms available in the SPAD software are the Hierarchical
Agglomerative Clustering (HAC or AHC, RECIP in SPAD) that provides a partitions
hierarchy, and the k-means method that provides a single partition.
A combining use of these two methods (mixed clustering) enables the consolidation of the
partition and increase its stability (SEMIS).
These two methods present the following disadvantage :
HAC provides a large number of partitions within one has to be chosen : it is not
always simple to select the most significant cut in the clustering tree. Moreover, the
clustering tree is not an optimal tree because the partition produced at a certain
level depends on the one produced at the previous step.
With the k-means method, the number of clusters has to be set by the user before
performing the analysis and the partition produced depends on the initial position of
the centrod clusters.
In order to compensate these disadvantages and to try to approach the optimal partition if
it exists, we can combine the use of the HAC and the k-means method : this is the purpose
of the mixed clustering method, called SEMIS in SPAD.
A first use combining theses two methods is the following : we perform K-means with a
large number of centrods and then we build a hierarchical tree by aggregating
successively the large number of clusters provided by the K-means method.
However, this method is relatively unstable on small size samples.
It is advised to choose HAC for sample size lower than 10 000 cases. For larger samples,
the mixed clustering method reduces a lot time processing and produces stable partitions.

64

Clustering with SPAD

In the CORMU parameters, modify the number of retain coordinates : 14.

THE RECIP / SEMIS parameters


The HAC algorithm (RECIP)

c Coordinates used for aggregation


With this parameter, the SPAD User defines the number of factors to retain to perform
the cluster analysis. This choice depends on the study of the eigenvalues in the
previous factorial analysis.
In our example, we use the 14 first factors.

65

RECIP / SEMIS - Clustering on factors scores

The mixed clustering method (SEMIS)

c Starting partition
Three procedures are available .
9 The first one consists in searching stable clusters by crossing many partitions
provided by centrods randomly selected.
The item Number defines the number of partitions (2 by default) and Size
determines the number of centrods for each partition.
9 The others produce a single partition based on N centrods chosen by the SPAD
User or randomly selected.

66

Clustering with SPAD

THE HIERARCHY EDITOR


To access the hierarchy editor, double click on this icon

THE CLUSTERING TREE


This tree is the graphical display of the partitions hierarchy.
The interest of it is to suggest graphically the number of clusters that exist in the dataset.
We can cut the tree where the gap is the highest.

10

48%

11%

11%

37%

9%

8%

20%

7%

8%

7%

30%

7%

8%

7%

14%

7%

8%

7%

14%

16%

8%

THE TOOL BAR OF THE HIERARCHY EDITOR


Display / delete
labels

Display
Node number

Delete
node number

Display
aggregation criteria

67

Display the cuts


of the tree

Vertical or
horizontal tree

8%

RECIP / SEMIS - Clustering on factors scores

CURVE OF THE LEVEL INDEXES

Edit - Curve of the level indexes

The level index is the gain of inter-cluster inertia obtained by subdividing one node into
two nodes.
The larger bar corresponds the cut of the tree into two clusters.

68

Clustering with SPAD

PARTI - DECLA -

CUT OF THE TREE AND CLUSTERS DESCRIPTION

The PARTI procedure constructs partitions by pruning an aggregation tree. The procedure
creates the partitions requested by the user or by an automatic search for the best
partitions, by possibly improving them by iteration on mobile centers (consolidation). The
partitions created in this way will then be characterized automatically.
The DECLA procedure lets you describe the partitions determined by the PARTI
procedure.
We can define either each cluster of a partition, or globally the partition itself. All the
elements available (actives and illustrative) may participate in the characterization:
categories of categorical variables, categorical variables themselves, continuous variables,
the frequencies and the factorial axes.

THE PARAMETERS OF THE PARTI-DECLA METHOD


THE CHOICE OF PARTITIONS TAB

69

PARTI - DECLA Cut of the tree and clusters description

THE PARTITIONING PARAMETERS TAB

THE PARTITIONS CHARACTERIZATION TAB


See the DEMOD method.

70

Clustering with SPAD

THE PARTI-DECLA RESULTS


BUILDING UP PARTITIONS
DETERMINING THE BEST PARTITIONS
RESEARCH OF IRREGULARITIES
+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
| 1993-- 1994|
-39.99 | **************************************************** |
| 1990-- 1991|
-19.79 | **************************
|
| 1995-- 1996|
-17.70 | ************************
|
+--------------+--------------+------------------------------------------------------+
LIST OF THE BEST 3 PARTITION BETWEEN 3 AND 10 CLUSTERS
1 - PARTITION IN
7 CLUSTERS
2 - PARTITION IN 10 CLUSTERS
3 - PARTITION IN
5 CLUSTERS
CUT "b" OF THE TREE INTO 7 CLUSTERS
CLUSTERS FORMATION (ON ACTIVE CASES)
SUMMARY DESCRIPTION
+---------+----------+------------+------------+
| CLUSTER | COUNT
|
WEIGHT
|
CONTENT |
+---------+----------+------------+------------+
|
bb1b |
106
|
106.00
|
1 TO
5 |
|
bb2b |
375
|
375.00
|
6 TO 23 |
|
bb3b |
70
|
70.00
| 24 TO 27 |
|
bb4b |
79
|
79.00
| 28 TO 32 |
|
bb5b |
67
|
67.00
| 33 TO 36 |
|
bb6b |
141
|
141.00
| 37 TO 42 |
|
bb7b |
162
|
162.00
| 43 TO 50 |
+---------+----------+------------+------------+
LOADINGS AND TEST-VALUES BEFORE CONSOLIDATION
AXES 1 A 5
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
CLUSTERS
|
TEST-VALUES
|
LOADINGS
|
|
|---------------------------------------------|-------------------------------|------------------------------------|----------|
| IDEN - LABEL
COUNT ABS.WT. |
1
2
3
4
5
|
1
2
3
4
5
| DISTO. |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
CUT "b" OF THE TREE INTO
7 CLUSTERS
|
|
|
| bb1b - Cluster 1 / 7
106
106.00 |
3.7 -8.7 -0.3
2.3
6.9 |
0.18 -0.39 -0.01
0.09
0.27 |
0.83 |
| bb2b - Cluster 2 / 7
375
375.00 | -15.2 -7.2
2.3 -6.6
4.2 | -0.32 -0.14
0.04 -0.12
0.07 |
0.22 |
| bb3b - Cluster 3 / 7
70
70.00 | -10.6
7.0 -9.4
6.8
0.6 | -0.63
0.39 -0.49
0.35
0.03 |
1.75 |
| bb4b - Cluster 4 / 7
79
79.00 | -6.3
2.4
3.6
8.2 -4.9 | -0.35
0.13
0.18
0.39 -0.23 |
1.57 |
| bb5b - Cluster 5 / 7
67
67.00 |
2.8 -2.1 -4.3 -2.8 -1.7 |
0.17 -0.12 -0.23 -0.15 -0.09 |
1.98 |
| bb6b - Cluster 6 / 7
141
141.00 | 12.2 -1.6 -1.9
2.9 -11.2 |
0.49 -0.06 -0.07
0.10 -0.38 |
0.75 |
| bb7b - Cluster 7 / 7
162
162.00 | 15.4 13.0
5.8 -4.8
3.5 |
0.58
0.46
0.19 -0.15
0.11 |
0.73 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+

CLUSTERING CONSOLIDATION
AROUND CENTERS OF THE
7 CLUSTERS ACHIEVED BY 10 ITERATIONS WITH MOVING CENTERS
BETWEEN-CLUSTERS INERTIA INCREASE
+-----------+---------------+---------------+---------------+
| ITERATION | TOTAL INERTIA | INTER-CLUSTERS|
RATIO
|
|
|
|
INERTIA
|
|
+-----------+---------------+---------------+---------------+
|
0
|
2.35008 |
0.77272 |
0.32881 |
|
1
|
2.35008 |
0.82435 |
0.35078 |
|
2
|
2.35008 |
0.82613 |
0.35153 |
|
3
|
2.35008 |
0.82630 |
0.35160 |
|
4
|
2.35008 |
0.82630 |
0.35160 |
+-----------+---------------+---------------+---------------+
STOP AFTER ITERATION 4. RELATIVE INCREASE OF BETWEEN-CLUSTER INERTIA
WITH RESPECT TO THE PREVIOUS ITERATION IS ONLY 0.000 %.

71

PARTI - DECLA Cut of the tree and clusters description


INERTIA DECOMPOSITION
COMPUTED ON 14 AXES.
+------------------+-----------------+-------------+-------------------+-----------------+
|
|
INERTIAS
|
COUNTS
|
WEIGHTS
|
DISTANCES
|
| INERTIAS
| BEFORE
AFTER |BEFORE AFTER | BEFORE
AFTER | BEFORE
AFTER |
+------------------+-----------------+-------------+-------------------+-----------------+
|
|
|
|
|
|
| BETWEEN CLUSTERS | 0.7727
0.8263 |
|
|
|
|
|
|
|
|
|
| WITHIN CLUSTER
|
|
|
|
|
|
|
|
|
|
|
| CLUSTER
1 / 7 | 0.1299
0.1731 | 106
128 |
106.00
128.00 | 0.8283
0.8028 |
| CLUSTER
2 / 7 | 0.6116
0.5710 | 375
358 |
375.00
358.00 | 0.2191
0.2551 |
| CLUSTER
3 / 7 | 0.0930
0.0945 |
70
72 |
70.00
72.00 | 1.7521
1.7687 |
| CLUSTER
4 / 7 | 0.1233
0.1336 |
79
82 |
79.00
82.00 | 1.5661
1.5452 |
| CLUSTER
5 / 7 | 0.1293
0.1293 |
67
67 |
67.00
67.00 | 1.9831
1.9831 |
| CLUSTER
6 / 7 | 0.2054
0.2180 | 141
149 |
141.00
149.00 | 0.7483
0.7707 |
| CLUSTER
7 / 7 | 0.2849
0.2043 | 162
144 |
162.00
144.00 | 0.7286
0.9060 |
|
|
|
|
|
|
| TOTAL INERTIA
| 2.3501
2.3501 |
|
|
|
+------------------+-----------------+-------------+-------------------+-----------------+
RATIO INTER INERTIA / TOTAL INERTIA) : BEFORE .. 0.3288
AFTER .. 0.3516
LOADINGS AND TEST-VALUES AFTER CONSOLIDATION
AXES 1 A 5
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
CLUSTERS
|
TEST-VALUES
|
LOADINGS
|
|
|---------------------------------------------|-------------------------------|------------------------------------|----------|
| IDEN - LABEL
COUNT ABS.WT. |
1
2
3
4
5
|
1
2
3
4
5
| DISTO. |
+---------------------------------------------+-------------------------------+------------------------------------+----------+
|
CUT "b" OF THE TREE INTO
7 CLUSTERS
|
|
|
| bb1b - Cluster 1 / 7
128
128.00 |
3.8 -8.6 -1.8
4.1
8.0 |
0.16 -0.35 -0.07
0.15
0.28 |
0.80 |
| bb2b - Cluster 2 / 7
358
358.00 | -16.0 -6.4
1.9 -8.1
4.0 | -0.35 -0.13
0.04 -0.15
0.07 |
0.26 |
| bb3b - Cluster 3 / 7
72
72.00 | -10.7
8.3 -9.3
6.5 -0.1 | -0.63
0.46 -0.48
0.32
0.00 |
1.77 |
| bb4b - Cluster 4 / 7
82
82.00 | -5.8
2.7
2.9
7.9 -5.6 | -0.32
0.14
0.14
0.37 -0.25 |
1.55 |
| bb5b - Cluster 5 / 7
67
67.00 |
2.8 -2.1 -4.3 -2.8 -1.7 |
0.17 -0.12 -0.23 -0.15 -0.09 |
1.98 |
| bb6b - Cluster 6 / 7
149
149.00 | 13.3 -1.2 -2.6
2.4 -11.6 |
0.52 -0.04 -0.09
0.08 -0.38 |
0.77 |
| bb7b - Cluster 7 / 7
144
144.00 | 15.1 11.5
9.4 -4.1
4.3 |
0.61
0.43
0.33 -0.14
0.14 |
0.91 |
+---------------------------------------------+-------------------------------+------------------------------------+----------+

CLUSTERS REPRESENTATIVES
CLUSTER 1/ 7
COUNT: 128
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.51034|0980
|| 2|
0.56936|0091
|| 3|
0.58376|0485
|
| 4|
0.58376|0619
|| 5|
0.62658|0368
|| 6|
0.62658|0897
|
| 7|
0.63989|0704
|| 8|
0.66465|0184
|| 9|
0.66465|0232
|
| 10|
0.66465|0238
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 2/ 7
COUNT: 358
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.66989|0459
|| 2|
0.80053|0043
|| 3|
0.80753|0322
|
| 4|
0.86366|0393
|| 5|
0.86366|0450
|| 6|
0.86366|0780
|
| 7|
0.86366|0540
|| 8|
0.86366|0460
|| 9|
0.90535|0082
|
| 10|
0.91404|0593
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 3/ 7
COUNT:
72
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.58799|0741
|| 2|
0.60470|0940
|| 3|
0.61735|0639
|
| 4|
0.61735|0788
|| 5|
0.69764|0789
|| 6|
0.70722|0758
|
| 7|
0.78494|0766
|| 8|
0.78494|0806
|| 9|
0.82442|0742
|
| 10|
0.82442|0946
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 4/ 7
COUNT:
82
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.74814|0156
|| 2|
0.98976|0575
|| 3|
1.01170|0730
|
| 4|
1.07622|0569
|| 5|
1.12107|0721
|| 6|
1.12879|0148
|
| 7|
1.12879|0660
|| 8|
1.12879|0715
|| 9|
1.14287|0566
|
| 10|
1.14460|0360
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 5/ 7
COUNT:
67

72

Clustering with SPAD


-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.97554|0358
|| 2|
1.10787|0130
|| 3|
1.12353|0328
|
| 4|
1.27382|0288
|| 5|
1.27888|0825
|| 6|
1.29654|0165
|
| 7|
1.30224|0828
|| 8|
1.30330|0302
|| 9|
1.30330|0326
|
| 10|
1.34956|0208
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 6/ 7
COUNT: 149
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.52061|0062
|| 2|
0.52061|0240
|| 3|
0.55153|0419
|
| 4|
0.55153|0611
|| 5|
0.66158|0991
|| 6|
0.70375|0286
|
| 7|
0.70767|0251
|| 8|
0.75757|0497
|| 9|
0.77031|0377
|
| 10|
0.78869|0242
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 7/ 7
COUNT: 144
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.54714|0141
|| 2|
0.58623|0007
|| 3|
0.60549|0243
|
| 4|
0.63791|0200
|| 5|
0.64338|0025
|| 6|
0.72304|0172
|
| 7|
0.72691|0004
|| 8|
0.74024|0006
|| 9|
0.74024|0352
|
| 10|
0.74024|0343
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
DISTANCE'S MATRIX BETWEEN CLUSTERS
|
1
2
3
4
5
6
7
-----+--------------------------------------------------------------1 |
0.000
2 |
1.134
0.000
3 |
1.701
1.443
0.000
4 |
1.628
1.402
1.856
0.000
5 |
1.752
1.608
1.990
1.984
0.000
6 |
1.327
1.183
1.746
1.637
1.703
0.000
7 |
1.383
1.247
1.820
1.702
1.770
1.283
0.000
-----+--------------------------------------------------------------|
1
2
3
4
5
6
7

73

PARTI - DECLA Cut of the tree and clusters description


DESCRIPTION OF: CUT "b" OF THE TREE INTO

7 CLUSTER

The characterizing elements are classified by order of importance with the help of a
statistical criterion (test-value) with which is associated a probability : the larger the testvalue is, the lower the probability, the better the element is defined.
In the case of the description of the classes by the categories of the categorical variables, an
option allows to sort the characterizing categories by decreasing test-values, or by
percentages.
CLUSTERS CHARACTERISATION BY ACTIVE CATEGORIES
CHARACTERISATION BY CATEGORIES OF CLUSTERS OR CATEGORIES
OF CUT "b" OF THE TREE INTO
7 CLUSTERS
Cluster 1 / 7
---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC
WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------12.80 Cluster 1 / 7
128
24.52 0.000
81.01 100.00 15.80 BEPC-BE-BEPS
Diploma in 5 categories
158
4.73 0.000
17.59
71.88 52.30 tenant
Occupation status of housing in 4 categories
523
3.10 0.001
18.31
40.63 28.40 25 to 34 yo
Age in 5 categories
284
3.08 0.001
17.61
46.09 33.50 Employee
Job category
335
2.85 0.002
20.67
24.22 15.00 Lower than 25 yo
Age in 5 categories
150
-2.04 0.021
8.73
15.63 22.90 Manager
Job category
229
-2.27 0.012
8.97
20.31 29.00 owner
Occupation status of housing in 4 categories
290
-2.33 0.010
2.08
0.78
4.80 Other
Job category
48
-2.72 0.003
3.61
2.34
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
-3.01 0.001
5.92
7.81 16.90 65 yo and more
Age in 5 categories
169
-3.28 0.001
6.22
10.16 20.90 35 to 49 yo
Age in 5 categories
209
-3.81 0.000
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-4.49 0.000
0.00
0.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
-6.27 0.000
0.00
0.00 15.00 University
Diploma in 5 categories
150
-7.06 0.000
0.00
0.00 18.20 Bac - Brevet sup.
Diploma in 5 categories
182
-7.22 0.000
0.00
0.00 18.90 No one
Diploma in 5 categories
189
-10.07 0.000
0.00
0.00 32.10 CEP
Diploma in 5 categories
321
----------------------------------------------------------------------------------------------------------------

Cluster

2 /

---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC


WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------35.80 Cluster 2 / 7
358
14.73 0.000
68.54
61.45 32.10 CEP
Diploma in 5 categories
321
12.34 0.000
67.68
49.72 26.30 Worker
Job category
263
11.58 0.000
73.02
38.55 18.90 No one
Diploma in 5 categories
189
6.09 0.000
49.24
45.25 32.90 greater than 100.000 Urban area size (number of inhabitants)
329
5.94 0.000
56.00
27.37 17.50 20.000 - 100.000
Urban area size (number of inhabitants)
175
5.32 0.000
38.68
94.97 87.90 No
Do you own some securities ?
879
4.33 0.000
50.89
24.02 16.90 65 yo and more
Age in 5 categories
169
4.14 0.000
41.87
61.17 52.30 tenant
Occupation status of housing in 4 categories
523
2.54 0.005
44.15
23.18 18.80 50 to 64 yo
Age in 5 categories
188
-2.58 0.005
30.06
27.37 32.60 Paris
Urban area size (number of inhabitants)
326
-3.64 0.000
22.67
9.50 15.00 Lower than 25 yo
Age in 5 categories
150
-3.88 0.000
26.41
20.95 28.40 25 to 34 yo
Age in 5 categories
284
-3.91 0.000
10.42
1.40
4.80 Other
Job category
48
-4.20 0.000
19.20
6.70 12.50 missing category
Job category
125
-5.32 0.000
14.88
5.03 12.10 Yes
Do you own some securities ?
121
-7.50 0.000
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-8.47 0.000
0.00
0.00
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
-8.69 0.000
0.00
0.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
-11.07 0.000
7.42
4.75 22.90 Manager
Job category
229
-11.86 0.000
0.00
0.00 15.00 University
Diploma in 5 categories
150
-12.22 0.000
0.00
0.00 15.80 BEPC-BE-BEPS
Diploma in 5 categories
158
-13.28 0.000
0.00
0.00 18.20 Bac - Brevet sup.
Diploma in 5 categories
182
----------------------------------------------------------------------------------------------------------------

74

Clustering with SPAD


Cluster

3 /

---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC


WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------7.20 Cluster 3 / 7
72
21.01 0.000
86.75 100.00
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
9.61 0.000
20.34
81.94 29.00 owner
Occupation status of housing in 4 categories
290
8.27 0.000
50.00
33.33
4.80 Other
Job category
48
4.40 0.000
12.77
56.94 32.10 CEP
Diploma in 5 categories
321
2.94 0.002
12.77
33.33 18.80 50 to 64 yo
Age in 5 categories
188
2.13 0.017
7.85
95.83 87.90 No
Do you own some securities ?
879
-2.13 0.017
2.48
4.17 12.10 Yes
Do you own some securities ?
121
-2.22 0.013
2.40
4.17 12.50 missing category
Job category
125
-2.55 0.005
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-2.81 0.003
3.06
9.72 22.90 Manager
Job category
229
-3.02 0.001
2.20
5.56 18.20 Bac - Brevet sup.
Diploma in 5 categories
182
-3.08 0.001
0.00
0.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
-3.10 0.001
3.04
11.11 26.30 Worker
Job category
263
-3.26 0.001
1.33
2.78 15.00 Lower than 25 yo
Age in 5 categories
150
-3.79 0.000
0.67
1.39 15.00 University
Diploma in 5 categories
150
-4.89 0.000
0.00
0.00 17.50 20.000 - 100.000
Urban area size (number of inhabitants)
175
-7.33 0.000
0.00
0.00 32.60 Paris
Urban area size (number of inhabitants)
326
-7.38 0.000
0.00
0.00 32.90 greater than 100.000 Urban area size (number of inhabitants)
329
-7.81 0.000
1.34
9.72 52.30 tenant
Occupation status of housing in 4 categories
523
----------------------------------------------------------------------------------------------------------------

Cluster

4 /

---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC


WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------8.20 Cluster 4 / 7
82
22.73 0.000
94.25 100.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
3.15 0.001
16.67
24.39 12.00 homeowner
Occupation status of housing in 4 categories
120
2.17 0.015
11.38
40.24 29.00 owner
Occupation status of housing in 4 categories
290
2.03 0.021
11.96
30.49 20.90 35 to 49 yo
Age in 5 categories
209
-1.98 0.024
4.00
7.32 15.00 Lower than 25 yo
Age in 5 categories
150
-2.80 0.003
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-3.10 0.001
5.54
35.37 52.30 tenant
Occupation status of housing in 4 categories
523
-3.25 0.001
0.00
0.00
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
-5.29 0.000
0.00
0.00 17.50 20.000 - 100.000
Urban area size (number of inhabitants)
175
-7.89 0.000
0.00
0.00 32.60 Paris
Urban area size (number of inhabitants)
326
-7.94 0.000
0.00
0.00 32.90 greater than 100.000 Urban area size (number of inhabitants)
329
----------------------------------------------------------------------------------------------------------------

Cluster

5 /

---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC


WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------6.70 Cluster 5 / 7
67
21.82 0.000 100.00 100.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-3.65 0.000
0.00
0.00 12.00 homeowner
Occupation status of housing in 4 categories
120
-6.51 0.000
0.00
0.00 29.00 owner
Occupation status of housing in 4 categories
290
-9.91 0.000
0.00
0.00 52.30 tenant
Occupation status of housing in 4 categories
523
----------------------------------------------------------------------------------------------------------------

Cluster

6 /

---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC


WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------14.90 Cluster 6 / 7
149
25.06 0.000
80.77
98.66 18.20 Bac - Brevet sup.
Diploma in 5 categories
182
5.87 0.000
27.95
42.95 22.90 Manager
Job category
229
4.67 0.000
30.40
25.50 12.50 missing category
Job category
125
4.23 0.000
27.33
27.52 15.00 Lower than 25 yo
Age in 5 categories
150
3.52 0.000
20.86
45.64 32.60 Paris
Urban area size (number of inhabitants)
326
2.27 0.012
22.50
18.12 12.00 homeowner
Occupation status of housing in 4 categories
120
2.17 0.015
19.01
36.24 28.40 25 to 34 yo
Age in 5 categories
284
-2.57 0.005
10.75
24.16 33.50 Employee
Job category
335
-3.00 0.001
7.98
10.07 18.80 50 to 64 yo
Age in 5 categories
188
-3.49 0.000
6.51
7.38 16.90 65 yo and more
Age in 5 categories
169
-4.20 0.000
1.20
0.67
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
-4.21 0.000
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-4.95 0.000
0.00
0.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
-6.87 0.000
0.00
0.00 15.00 University
Diploma in 5 categories
150
-7.09 0.000
0.00
0.00 15.80 BEPC-BE-BEPS
Diploma in 5 categories
158
-7.41 0.000
0.53
0.67 18.90 No one
Diploma in 5 categories
189
-7.85 0.000
1.90
3.36 26.30 Worker
Job category
263
-10.57 0.000
0.31
0.67 32.10 CEP
Diploma in 5 categories
321
----------------------------------------------------------------------------------------------------------------

75

PARTI - DECLA Cut of the tree and clusters description


Cluster

7 /

---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC


WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------14.40 Cluster 7 / 7
144
24.37 0.000
88.67
92.36 15.00 University
Diploma in 5 categories
150
11.52 0.000
40.17
63.89 22.90 Manager
Job category
229
7.36 0.000
26.69
60.42 32.60 Paris
Urban area size (number of inhabitants)
326
5.76 0.000
33.88
28.47 12.10 Yes
Do you own some securities ?
121
2.21 0.014
16.83
61.11 52.30 tenant
Occupation status of housing in 4 categories
523
-2.80 0.003
7.98
10.42 18.80 50 to 64 yo
Age in 5 categories
188
-4.12 0.000
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-4.70 0.000
0.00
0.00
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
-4.84 0.000
0.00
0.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
-5.40 0.000
6.27
14.58 33.50 Employee
Job category
335
-5.76 0.000
11.72
71.53 87.90 No
Do you own some securities ?
879
-6.95 0.000
0.00
0.00 15.80 BEPC-BE-BEPS
Diploma in 5 categories
158
-7.25 0.000
0.53
0.69 18.90 No one
Diploma in 5 categories
189
-7.57 0.000
0.00
0.00 18.20 Bac - Brevet sup.
Diploma in 5 categories
182
-7.65 0.000
3.12
6.94 32.10 CEP
Diploma in 5 categories
321
-8.65 0.000
0.76
1.39 26.30 Worker
Job category
263
----------------------------------------------------------------------------------------------------------------

THE GRAPH EDITOR

76

Clustering with SPAD

77

CLASS - MINER - Clusters description

CLASS - MINER - CLUSTERS DESCRIPTION


This procedure lets you describe the partitions created by the PARTI procedure with the
variables that did not participate in the analysis.
We can thus select variables by themes and evaluate their characterizing power on the
partitions constructed (typologies). The parameter settings and the outputs are identical to
those of the DECLA procedure of the PARTI-DECLA icon.
Characteristic elements are classified by order of importance with the help of a statistical
criterion (test-value) with which is associated a probability: the higher the level of the testvalue, and the weaker the probability, the more strongly the element is characterized.

78

Clustering with SPAD

ESCAL - STORING THE FACTORIAL AXES AND THE


PARTITIONS

79

Regression and Analysis of Variabce,


General Linear Model

THE LINEAR MODEL AND ITS APPLICATIONS

REGRESSION AND ANALYSIS OF VARIABCE,


GENERAL LINEAR MODEL

OBJECT
The general purpose of this procedure, called VAREG, is to learn more about the
relationship between several independent or predictor variables and a dependent
continuous variable.
VAREG allows you to perform least squares adjustement models with a constant term. It
can be used for many different analyses including:

Simple regression
Multiple regression
Analysis of variance
Analysis of covariance

VAREG enables you to specify interactions (crossed effects) up to the 3rd order. Each
regression coefficient is associated with the null test, which is valid in the classical context
where the random term is assumed to be generated by a Laplace-Gauss law.
The REPEATED statement enables you to specify effects in the model that represent
repeated measurements on the same experimental unit for the same response.
The VAREG procedure generates automatically a rule file that allows you to create a new
data set (with the Deployment Archiving\Archiving\Predcitions method) containing
the input dataset in addition to predicted values and residuals.
The treatment of missing data is handled by the parameters.

OUTPUTS
Summary statistics on the variables of the model are output: Marginal distributions of the
categorical variables; mean, standard deviation, minimum and maximum of the
continuous variables. The method supplies the identification of the coefficients of the
model: coefficient of the continuous (endogenous) variables, the categories of the factors
and of the eventual interactions. Subsequently it is possible to output the matrix of the
variances and covariance, or the correlations matrix.
80

The Linear Model and its applications

The procedure prints the coefficients, the estimation of their standard deviation, the
corresponding Students statistic, as well as the critical probability and the associated test
value. Also shown are the sum of the squares of the deviations, the multiple correlation
coefficient, and the estimate of the common factor variance of the residuals. Finally, the
test of simultaneous nullity of all the coefficients (test of an endogenous "y" constant) is
provided.
In the case of an analysis of variance, you also get the sum of the squared deviations
according to their source (residual, criteria or interaction) as well as Fishers statistics, the
critical probabilities and the associated test values. In the case of repeated observations,
the repeatability variance is displayed, as well as the estimates obtained including the
variance.

DEFINE A MODEL
The interface allows you to specify the definition of one or more models. The functions of
the CTRL, SHIFT keys are standard.

1. In the Selection list choose the TYPE of the variable(s) you want to define
81

Regression and Analysis of Variabce,


General Linear Model

2. Then in the Variables Available list, select one or more variables of the TYPE, and
confirm your choices with the transfer button. A double click on a variable
confirms the choice.
To delete a variable, or an interaction, of the model under construction, select it in the list
of models and confirm with the transfer button .
3. Save a model
Once that you have specified at least one endogenous variable and one exogenous
variable, click on the "Validate" Button to add the model to the Model list.
Delete a model
Select the model from the list and click on "Delete" button.
Change a model
Select the model in the list and click on "Modify" button.

PARAMETERS
The VAREG parameters allow you to handle missng data and to specify wether
measurments are repeated or not. With the printout parameters, you can specify the
desired ouputs.

82

The Linear Model and its applications

Missing data handling for continuous variables (LSUPR)


Possible values :
Deleted case / Mean imputation

If LSUPR = Deleted case, the cases presenting the missing data for one of the
variables of the model (endogenous or exogenous) will be eliminated from the
analysis.
If LSUPR = Mean imputation, the missing exogenous data will be replaced by the
corresponding variable.

Warning : if LSUPR = Mean imputation, the endogenous variable must not have any
missing data.
Missing data handling categorical variables (LZERO)
Possible values:
Re-coded / Deleted case

If LZERO = Re-coded, the missing values will treated as a normal state.


If LZERO = Deleted case, the cases with missing data will be eliminated.

Treatment with repetitions (LREP)


Possible values :
in disorder

No (there are no repetitions) / Repetitions in sequence / Repetitions

This parameter concerns the treatment of experiment plans.


When there are repetitions, the variance of the observations may be estimated on the
repetitions of observations, rather than on the whole of the observations. It is not
necessary that the number of repetitions is the same everywhere.

Choose LREP = Repetitions in sequence if the repetitions are one under the others
in the data table lines.
Choose LREP = Repetitions in confusion if the repetitions are unordered

Output Parameters
Summary statistics on the variables in the model (LSTAT)
Possible values :
Yes / No
If LSTAT = Yes, one obtains marginal distributions for the categorical variables of the
model, as well as the various statistics concerning the continuous variables : mean,
standard deviation, minimum and maximum.

Printout of the covariances matrix (LMAT)


Possible values :

No
(No output)
Variances, covariance
(Output the variance covariance matrix)
83

Regression and Analysis of Variabce,


General Linear Model

Correlations

(Output the correlations matrix)

File for Excel application


Possible values: Yes / No
If LEXCE = YES, you will have available on output an ASCII delimited file, which can be
directly imported into Excel application
Variables labels (LABEL)
Possible values :
short / long

If LABEL = short, we use 4 characters for categorical variable label and 20 for
continuous variable label.

If LABEL = long, we use 60 characters for categorical variable and continuous


variable label.

84

The Linear Model and its applications

OPTIMAL REGRESSIONS RESEARCH


General principles
This procedure selects the N best adjustments for a regression. The selection criterion
can be the R2, the adjusted R2 or Mallow's Cp.
Let N be the number of the best adjustments requested, and P be the number of explicative
(exogenous) variables for the model. The procedure shows the N best adjustments for all
sizes of the models, from 1 to P-1 variables (the adjustment with the P variables is unique).
The procedure supplies the criterion value (R2, adjusted R2 or the Cp), Fishers F
associated with R2, the critical probability associated with this F, and the corresponding
test value.
The list of the variables of the model is then shown with the estimated coefficients, the
nullity tests, the critical probability and the associated test value. Finally, a diagram
representing the evolution of the criterion as a function of the number of variables in the
models shows a quick summary of the selections.
For the R2 criterion, all the printed selections are optimal. For the other two criteria, the
selections are not always optimal (the adjusted R2 and Mallows Cp vary in a nonmonotone way as a function of the number of variables). A non-optimal selection is
identified if the procedure does not show the coefficients of the variables (only the names
of the variables and the value of the criterion are shown). In this case the selected
adjustment, if it is not optimal for the criterion, is, nonetheless better than the adjustments
that were not calculated.
Reference:
Selection algorithm is a transcription of the algorithm "leaps and bounds" from Furnival
& Wilson. (Technometrics, 174, Vol.16, pp.499-511).

Data
This dataset corresponds to the perception that has 100 companies of their furnishers.
Criteria are the following:
Delivery time
Price index
Price flexibility
Perceived quality
Service quality
Commercial image
Product quality
Satisfaction
The main goal is to find the best model explaining Satisfaction by a subset of the other
items.
85

Optimal Regressions Research

Id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

Price
Perceived Service Commercial
Company
Delivery Price
Index Flexibility
Quality
Quality
Image
Size
delay
< 50 employees
4,1
0,6
6,9
4,7
2,4
2,3
>= 50 employees
1,8
3
6,3
6,6
2,5
4
>= 50 employees
3,4
5,2
5,7
6
4,3
2,7
>= 50 employees
2,7
1
7,1
5,9
1,8
2,3
< 50 employees
6
0,9
9,6
7,8
3,4
4,6
>= 50 employees
1,9
3,3
7,9
4,8
2,6
1,9
< 50 employees
4,6
2,4
9,5
6,6
3,5
4,5
>= 50 employees
1,3
4,2
6,2
5,1
2,8
2,2
< 50 employees
5,5
1,6
9,4
4,7
3,5
3
>= 50 employees
4
3,5
6,5
6
3,7
3,2
2,4
1,6
8,8
4,8
2
2,8
< 50 employees
< 50 employees
3,9
2,2
9,1
4,6
3
2,5
>= 50 employees
2,8
1,4
8,1
3,8
2,1
1,4
3,7
1,5
8,6
5,7
2,7
3,7
< 50 employees
< 50 employees
4,7
1,3
9,9
6,7
3
2,6
< 50 employees
3,4
2
9,7
4,7
2,7
1,7
< 50 employees
3,2
4,1
5,7
5,1
3,6
2,9
< 50 employees
4,9
1,8
7,7
4,3
3,4
1,5
< 50 employees
5,3
1,4
9,7
6,1
3,3
3,9
< 50 employees
4,7
1,3
9,9
6,7
3
2,6
< 50 employees
3,3
0,9
8,6
4
2,1
1,8
< 50 employees
3,4
0,4
8,3
2,5
1,2
1,7
< 50 employees
3
4
9,1
7,1
3,5
3,4
>= 50 employees
2,4
1,5
6,7
4,8
1,9
2,5
< 50 employees
5,1
1,4
8,7
4,8
3,3
2,6
< 50 employees
4,6
2,1
7,9
5,8
3,4
2,8
>= 50 employees
2,4
1,5
6,6
4,8
1,9
2,5
< 50 employees
5,2
1,3
9,7
6,1
3,2
3,9
< 50 employees
3,5
2,8
9,9
3,5
3,1
1,7
>= 50 employees
4,1
3,7
5,9
5,5
3,9
3
>= 50 employees
3
3,2
6
5,3
3,1
3
< 50 employees
2,8
3,8
8,9
6,9
3,3
3,2
< 50 employees
5,2
2
9,3
5,9
3,7
2,4
>= 50 employees
3,4
3,7
6,4
5,7
3,5
3,4
>= 50 employees
2,4
1
7,7
3,4
1,7
1,1
>= 50 employees
1,8
3,3
7,5
4,5
2,5
2,4
>= 50 employees
3,6
4
5,8
5,8
3,7
2,5
< 50 employees
4
0,9
9,1
5,4
2,4
2,6
>= 50 employees
0
2,1
6,9
5,4
1,1
2,6
>= 50 employees
2,4
2
6,4
4,5
2,1
2,2
>= 50 employees
1,9
3,4
7,6
4,6
2,6
2,5
< 50 employees
5,9
0,9
9,6
7,8
3,4
4,6
< 50 employees
4,9
2,3
9,3
4,5
3,6
1,3
< 50 employees
5
1,3
8,6
4,7
3,1
2,5
>= 50 employees
2
2,6
6,5
3,7
2,4
1,7
< 50 employees
5
2,5
9,4
4,6
3,7
1,4
< 50 employees
3,1
1,9
10
4,5
2,6
3,2
>= 50 employees
3,4
3,9
5,6
5,6
3,6
2,3
< 50 employees
5,8
0,2
8,8
4,5
3
2,4
< 50 employees
5,4
2,1
8
3
3,8
1,4
< 50 employees
3,7
0,7
8,2
6
2,1
2,5
>= 50 employees
2,6
4,8
8,2
5
3,6
2,5
>= 50 employees
4,5
4,1
6,3
5,9
4,3
3,4
>= 50 employees
2,8
2,4
6,7
4,9
2,5
2,6
< 50 employees
3,8
0,8
8,7
2,9
1,6
2,1
< 50 employees
2,9
2,6
7,7
7
2,8
3,6

86

Product
Quality
5,2
8,4
8,2
7,8
4,5
9,7
7,6
6,9
7,6
8,7
5,8
8,3
6,6
6,7
6,8
4,8
6,2
5,9
6,8
6,8
6,3
5,2
8,4
7,2
3,8
4,7
7,2
6,7
5,4
8,4
8
8,2
4,6
8,4
6,2
7,6
9,3
7,3
8,9
8,8
7,7
4,5
6,2
3,7
8,5
6,3
3,8
9,1
6,7
5,2
5,2
9
8,8
9,2
5,6
7,7

The Linear Model and its applications

87

Optimal Regressions Research

Fuwil 3 - Excel sheet output


Missing data handling for exogenous variables
Missing values are replaced by general means
Variable label

Mean

Number of missing
values

Delivery Time
Price Index
Price Flexibility
Perceived Quality
Service Quality
Commercial Image
Product Quality
Usage Index

3,515
2,364
7,894
5,248
2,916
2,665
6,971
46,100

0
0
0
0
0
0
0
0

R criteria
Curve of R according to the number of variables
The following graph displays the evolution of the R criteria according to the number of
variables entered in the model. Higher is this criteria, better is the model.
But as this criterion increases automatically by entering new variables in the model, we
must evaluate the relative gain of adding each new variable. We will see further criteria
that penalize the R for each new entered variable: the adjusted R adjusted and the
Mallow C(p).
By looking at the graph below, we see that the R increases significantly up to 3 variables.
The next variables are redundant and do not bring any more information that could
improve significantly the model.
The R can be interpreted as the part of the variance explained by the model. It takes its
values between 0 and 1.

Curve of R2 accordind to the number of variables

1
2

Number of
model's
variables

3
4
5
6
7
8
0.45

0.48

0.52

0.55

0.59

0.62

0.66

Value of R2

88

0.69

0.73

0.77

0.80

The Linear Model and its applications

1 var
This output presents the 3 best adjustments with one exogenous variable.
Adjustments with 1 variable + constant DDL(Student) = 98
Adjustment 1 (Full printout)
R**2 = 0.5051
Fisher = 100.0162
Probability = 0.0000
Test-Value = 8.283
Variable label
Usage Index

Coefficient
0,0676

Student
10,00

Probability
0,000

Test-Value
8,28

Adjustment 2 (Full printout)


R**2 = 0.4233
Fisher = 71.9390
Probability = 0.0000
Test-Value = 7.327
Variable label
Delivery Time

Coefficient
0,4215

Student
8,48

Probability
0,000

Test-Value
7,33

Adjustment 3 (Full printout)


R**2 = 0.3985
Fisher = 64.9139
Probability = 0.0000
Test-Value = 7.040
Variable label
Service Quality

Coefficient
0,7189

Student
8,06

Probability
0,000

Test-Value
7,04

The number of degrees of freedom is 98.


The first adjustment is the best one, with an R of 0.5051 ; meaning that the variance
explained by the model represents 50,51 % of the total variance.
The Fisher statistic corresponds to the global validation of the model. This statistics
follows a fisher distribution with one and 98 degrees of freedom. Its value of 100.02
corresponds to a p-value lower than 1/10000 (0.0000). Thus, the model is acceptable. This
p-value is also expressed as a test-value: 8.283 here.
The Coefficient column presents the estimation of the coefficient of the variable Usage
Index: the model can be written: Satisfaction Index = constant + 0.0676 x Usage Index
The Student column tests the nullity of the coefficient for the concerned variable: this
statistic follows a student distribution with 98 degrees of freedom. Its value of 10
corresponds to a p-value lower than 1/10000 (0.0000). The coefficient is significantly
different than zero.
This probability is also expressed in test value. Since the model gets only one explanatory
variable, the test value of the coefficient is the same than the global model.

89

Optimal Regressions Research

6 vars
Adjustments with 6 variables + constant DDL(Student) = 93
Adjustment 1 (Full printout)
R**2 = 0.8009
Fisher = 62.3410
Probability = 0.0000
Test-Value = 11.408
Variable label
Delivery Time
Price Index
Price Flexibility
Perceived Quality
Commercial Image
Product Quality

Coefficient
0,3061
0,2446
0,2912
0,4324
-0,1978
-0,0470

Student
8,10
5,95
7,99
7,39
2,35
1,49

Probability
0,000
0,000
0,000
0,000
0,021
0,139

Test-Value
7,03
5,47
6,95
6,54
2,31
1,48

Adjustment 2 (Full printout)


R**2 = 0.7993
Fisher = 61.7159
Probability = 0.0000
Test-Value = 11.376
Variable label
Delivery Time
Price Flexibility
Perceived Quality
Service Quality
Commercial Image
Product Quality

Coefficient
0,0777
0,2846
0,4210
0,4536
-0,1926
-0,0417

Student
1,49
7,84
7,13
5,87
2,28
1,33

Probability
0,140
0,000
0,000
0,000
0,025
0,188

Test-Value
1,47
6,85
6,35
5,40
2,24
1,32

Adjustment 3 (Full printout)


R**2 = 0.7973
Fisher = 60.9833
Probability = 0.0000
Test-Value = 11.338
Variable label
Price Index
Price Flexibility
Perceived Quality
Service Quality
Commercial Image
Product Quality

Coefficient
-0,0624
0,2891
0,4167
0,5884
-0,1894
-0,0453

Student
1,14
7,83
7,03
7,93
2,23
1,42

Probability
0,256
0,000
0,000
0,000
0,028
0,159

Test-Value
1,14
6,84
6,28
6,91
2,20
1,41

The three adjustments listed above have 6 exogenous variables.


For the first adjustment, we can see that the variable Product Quality has a coefficient
non significantly different than zero to the usual threshold of 5%.
Finally, the best adjustment is obtained with 6 exogenous variables. It is confirmed by the
following graphs.
But since one coefficient is not significantly different than zero, we may choose the model
with 5 variables:
90

The Linear Model and its applications

Adjustments with 5 variables + constant DDL(Student) = 94


Adjustment 1 (Full printout)
R**2 = 0.7961
Fisher = 73.4081
Probability = 0.0000
Test-Value = 11.506
Variable label
Delivery Time
Price Index
Price Flexibility
Perceived Quality
Commercial Image

Coefficient
0,3247
0,2291
0,2993
0,4303
-0,2100

Student
9,05
5,73
8,25
7,31
2,49

Probability
0,000
0,000
0,000
0,000
0,015

Test-Value
7,65
5,29
7,14
6,49
2,44

The R adjusted criterion


Curve of R adjusted according to the number of explanatory variables
The R adjusted criterion is based on the standard R, but it imposes a penalty for each
additional explanatory variable that is used to build the model. To increase this criterion,
the entry of a new variable needs to be sufficient (if the variable is redundant with the
ones already included in the model, the criterion decreases).
The graph below shows that the best models have to be found in the ones with 5 or 6
explanatory variables.
Curve of R2 ajusted accordind to the number of variables

Number
of model's
variables

8
0.44

0.48

0.51

0.55

0.58

0.62

0.65

0.68

Value of R2 ajusted

91

0.72

0.75

0.79

Optimal Regressions Research

The Mallows CP criterion


Curve of Mallows CP according to the number of explanatory variables
Lower is this criterion, better is the adjustment. We get the same results than with the
previous criterions, the best models have 5 or 6 variables.

Curve of Mallows Cp accordind to the number of variables

1
2

Number
of
model's
variables

3
4
5
6
7
8
0.0

0.1

0.3

0.4

0.5

0.7

0.8

0.9

Value of Mallows Cp

92

1.1

1.2

1.3

The Linear Model and its applications

Formulas of the criterions R, R adjusted and Mallows Cp


1. R :
The coefficient of determination R2 (which takes values in the range 0 to 1) is a measure
of the proportion of the total variation that is associated with the regression process:

SSE
SST
SSE : Error Sum of Squares
SST : Total Sum of Squares.
R = 1

2. R adjusted :
The R adjusted criterion is based on the standard R, but it imposes a penalty for each
additional explanatory variable that is used to build the model.
( n 1)(1 R )
R = 1
(n p)
n : the number of observations,
p : the number of variables used for the model plus one.

3. Mallows CP - C(p) :
The Mallows C(p) is positively related to the error (SSE) and the number of
explanatory variables in the model :a model with a lot of variables or with a high error
will be penalized by this criterion.

C ( p) =

SSE
+ 2p n
SST

References:

Furnival, G.M. and Wilson, R.W. (1974), Regression by Leaps and Bounds
Technometrics, 16, 499 -511.

93

Logistic Regression

LOGISTIC REGRESSION

Logistic regression is a model used for prediction of the probability of occurrence of an


event by fitting data to a logistic curve. It makes use of several predictor variables that
may be either numerical or categorical.
Binomial (or binary) logistic regression is a form of regression which is used when the
dependent Y is a dichotomy and the independents are of any type X1, X2,..., Xp.

LOGIT INTRODUCTION
The logistic regression means to explain the probability of a binary event. This probability
cannot be explained by a traditional regression model using the least squares method.
Thus, we perform a qualified LOGIT transformation whose process uses the generalized
linear model and establishes a method based on the research of maximum likelihood.
If P is the probability that we are trying to explain, the P/(1-P) ratio must be defined as
ODDS and the magnitude that is finally explained corresponds to this ODDS logarithm.
We want to explain P (Y = 1/ X 1 , X 2 )
Thus: P (Y = 1/ X 1 , X 2 ) + P (Y = 2 / X 1 , X 2 ) = 1
The logit of the probability P is the logarithm of the quotient

P
Logit ( P ) = Log

1 P

P
:
1 P

(1)

Graphical representation of the P logit


P
Log
.
1 P

1/2

94

The Linear Model and its applications

LOGISTIC MODEL WITH BINARY EXPLANATORY VARIABLES


The model can be written:
P
Log
(2)
= 0 + 1 X 1 + 2 X 2
1 P
The logit of the probability is a linear function of the explanatory variables but the
probability itself is a non linear function. Indeed, according to (2)
exp ( 0 + 1 X 1 + 2 X 2 )
P=
1 + exp ( 0 + 1 X 1 + 2 X 2 )
The model (2) is an additive model for binary categorical exogenous variables (coded 0 or
1). The models with categorical exogenous variables with more than 2 categories and with
crossed effects are presented further.

LOGISTIC MODEL WITH CATEGORICAL EXOGENEOUS VARIABLES


WITH MORE THAN 2 CATEGORIES
A categorical variable with no hierarchy in the categories needs to be recoded before its
introduction in the model into many binary variables (0/1), well known under the name
of design variables.
We introduce as much design variables as categories.
But the following problem appears: the k design variables are not independent because
their sum makes 1 whatever the individual.
A simple solution is to eliminate one of the design variables. The category not introduced
in the model has a zero coefficient by convention. We can consider that it represents the
reference situation.
Mathematically, the choice of the reference category has no importance.
We can for example choose as reference the modal category (category with the largest
count).
Consider Y as the dependent variables with 2 categories 1 and 2. Consider Z as a
categorical variable with 4 categories corresponding to the race of the individual.
Z = 1: White
Z = 2: Black
Z = 3: Hispanic
Z = 4: Others
If we choose the White category as reference, the D matrix is the following:
The three columns of D (D2, D3, D4) correspond to the coding of Z into design variables
that will be introduced in the model.

95

Logistic Regression

Table 1
D Matrix construction
RACE (categories)

D2

D3

D4

0
1
0
0

0
0
1
0

0
0
0
1

White (1)
Black (2)
Hispanic (3)
Others (4)
The logistic model is then written this way:
P (Y = 1/ Z = 1)

Logit

P (Y = 1/ Z = 2 )
P (Y = 1/ Z = 3)

P (Y = 1/ Z = 4 )

0 0 0

1 0 0

1
1
{
D0

0 +

0 1 0
0 0 1
14243

2
3
4

Thus, the explanatory variable Z with k categories is transformed into (k-1) design
variables, notated du. If the first category is the reference, the logit is written:
k

Logit P (Y = 1/ Z ) = 0 + u d u
u =2

For example, we obtain

Logit P (Y = 1/ Z ) = 0 + 2d 2 + 3d 3 + 4d 4
with
du = 1 if Z = u
du = 0 otherwise

96

The Linear Model and its applications

LOGISTIC REGRESSION WITH SPAD

Iterations number :
Specifies the maximum number of iterations to perform.
By default, Iterations number=25. If convergence is not attained in n iterations, the
displayed output created by the procedure contain results that are based on the last
maximum likelihood iteration.
Seuil alpha pour les tests (en %) :
Sets the level of significance for (100 )% confidence intervals for regression
parameters or odds ratios. The value must be between 0 and 100. By default, is
equal 5%.

97

Logistic Regression

Parameterization method for categorical variables


Consider a model with one categorical variable A with four categories, 1, 2, 5, and 7.
Comparison to mean
Three columns are created to indicate group membership of the nonreference
categories. For the reference category, all three design variables have a value of -1. For
instance, if the reference category is 7 (REF='7'), the design matrix columns for A are as
follows.
Comparison to mean Coding
Design Matrix
A

A1

A2

A5

-1

-1

-1

Parameter estimates of a categorical variable main effects using the Comparison to


mean coding scheme estimate the difference in the effect of each nonreference
category compared to the average effect over all 4 categories.

GLM
Four columns are created to indicate group membership. The design matrix columns
for A are as follows.
GLM Coding
Design Matrix
A A1 A2 A5 A7
1

As in ANOVA, the last category coefficient is fixed to 0. Parameter estimates of a


categorical variable main effects using the GLM coding scheme estimate the difference
in the effects of each category compared to the last category.
Comparison to a reference
Three columns are created to indicate group membership of the nonreference
categories. For the reference category, all three design variables have a value of 0. For

98

The Linear Model and its applications

instance, if the reference level is 7 (REF='7'), the design matrix columns for A are as
follows.
Comparison to a Reference Coding
Design Matrix
A

A1

A2

A5

Parameter estimates of a categorical variable main effects using the Comparison to a


reference coding scheme estimate the difference in the effect of each nonreference
category compared to the effect of the reference category.

Variable selections :
The selection options are available only if the model contains simple factors (no
interaction).
No selection
The model is estimated with all the input variables, this is the default option.
Forward
The procedure first estimates parameters for factors forced into the model. These
factors are the intercepts and the first n explanatory factors in the model statement,
where n is the number specified by the Number of variables in initial model (n is
zero by default). Next, the procedure computes the score chi-square statistic for each
factor not in the model and examines the largest of these statistics. If it is significant at
the Threshold (%) for the variables entry in model level, the corresponding factor is
added to the model. Once a factor is entered in the model, it is never removed from the
model. The process is repeated until none of the remaining effects meet the specified
level for entry or until the Number of variables in final model value is reached.
Backward
Parameters for the complete model as specified in the model statement are estimated
unless the Number of variables in initial model option is specified. In that case, only
the parameters for the intercepts and the first n explanatory factors in the model
statement are estimated, where n is the Number of variables in initial model. Results
of the Wald test for individual parameters are examined. The least significant factor
that does not meet the Threshold (%) for a variable to stay in the model is removed.
Once a factor is removed from the model, it remains excluded. The process is repeated
until no other factor in the model meets the specified level for removal or until the
Number of variables in final model value is reached. Backward selection is often less
99

Logistic Regression

successful than forward or stepwise selection because the full model fit in the first step
is the model most likely to result in a complete or quasi-complete separation of
response values.
Stepwise
This option is similar to the FORWARD option except that factors already in the model
do not necessarily remain. Factors are entered into and removed from the model in
such a way that each forward selection step may be followed by one or more backward
elimination steps. The stepwise selection process terminates if no further factor can be
added to the model or if the factor just entered into the model is the only factor
removed in the subsequent backward elimination.

EXAMPLE BASED ON THE CREDIT.SBA DATASET


Response variable
1 . Type of client

2 CATEGORIES

Categorical explanatory variables:


2 . Age of client
3 . Family Situation
4 . Seniority
5 . Salary domiciliation
6 . Size of savings
7 . Profession
8 . Average outstanding
9 . Average transactions
10 . Number of withdrawals
11 . Overdraft
12 . Checkbook

4 CATEGORIES
4 CATEGORIES
5 CATEGORIES
2 CATEGORIES
4 CATEGORIES
3 CATEGORIES
3 CATEGORIES
4 CATEGORIES
3 CATEGORIES
2 CATEGORIES
2 CATEGORIES

100

The Linear Model and its applications

101

Logistic Regression

REGRESSION LOGISTIQUE
MODEL PRESENTATION
MODEL DEFINITION
================
RESPONSE VARIABLE ...............
NUMBER OF RESPONSE LEVELS .......
NUMBER OF OBSERVATIONS ..........
LINK FUNCTION ...................
OPTIMIZATION TECHNIQUE ..........

: Type of client
:
2
:
468
: BINARY LOGIT
: FISHER'S SCORING

RESPONSE PROFILE
================
VARIABLE RESPONSE : Type of client
==========================
ORDER RESPONSE FREQUENCY
-------------------------1
Good
237
2
Bad
231
==========================
PROBABILITY MODELED IS: Type of client = Good

DESCRITIVE STATISTICS FOR EXPLANATORY VARIABLES


===============================================

FREQUENCY DISTRIBUTION OF CATEGORICAL VARIABLES


========================================================================
Type of client
-----------------VARIABLE
VALUE
Good
Bad
TOTAL
-----------------------------------------------------------------------Seniority
1 year or less
66
133
199
From 1 to 4 years
19
28
47
From 4 to 6 years
42
27
69
From 6 to 12 years
44
22
66
Over 12 years
66
21
87
-----------------------------------------------------------------------Salary domiciliation
Sal. domiciliated
204
112
316
Sal. not domicil.
33
119
152
-----------------------------------------------------------------------Size of savings
No saving
169
201
370
Less than 10 KF
34
24
58
From 10 to 100 KF
26
6
32
More than 100 KF
8
0
8
THIS VARIABLE IS PARTIALLY NESTED IN THE RESPONSE VARIABLE!
-----------------------------------------------------------------------Profession
executive
51
26
77
employee
127
110
237
other
59
95
154
-----------------------------------------------------------------------Age of client
Less than 23 years
31
57
88
From 23 to 40 years
71
79
150
From 40 to 50 years
68
54
122
Over 50 years
67
41
108
-----------------------------------------------------------------------Family Situation
Single
80
90
170
Married
129
92
221
Divorced
24
37
61
Widow
4
12
16
-----------------------------------------------------------------------Average outstanding
Less than 2 KF
19
79
98
From 2 to 5 KF
168
140
308
More than 5 KF
50
12
62
-----------------------------------------------------------------------Average transactions
Less than 10 KF
44
110
154
From 10 to 30 KF
32
39
71
From 30 to 50 KF
82
47
129
More than 50 KF
79
35
114
-----------------------------------------------------------------------Number of withdrawals Less than 40
113
58
171
From 40 to 100
87
74
161
More than 100
37
99
136
-----------------------------------------------------------------------Overdraft
Authorized
83
119
202
Forbidden
154
112
266
-----------------------------------------------------------------------Checkbook
Authorized
231
184
415
Forbidden
6
47
53
========================================================================
NB : TO ALLOW CALCULATIONS ONE CASE WITH OPPOSITE WAS AFFECTED WITH EACH
LEVEL CAUSE OF PARTIAL NESTING!

102

The Linear Model and its applications


RESULTS ABOUT THE MODEL
FITTING OF MODEL
CONVERGENCE CRITERIOM (.1E-07) SATISFIED
================================================
INTERCEPT INTERCEPT AND
ONLY
COVARIATES
================================================
AKAIKE CRITERIOM
650.752
460.104
SCHWARZ CRITERIOM
654.900
567.964
-2 LOG (L)
648.752
408.104
================================================

TESTING GLOBAL NULL HYPOTHESIS : BETA = 0


======================================================
CHI-SQUARE
DF PROB > KHI2
-----------------------------------------------------LIKELIHOOD RATIO
240.6475
25
< 0.0001
WALD
119.1086
25
< 0.0001
======================================================

TYPE III ANALYSIS OF EFFECTS


=======================================================
EFFECT
DF WALD CHI-SQU PROB > CHISQ
------------------------------------------------------Seniority
4
23.2572
0.0001
Salary domiciliation
1
25.9650
< 0.0001
Size of savings
3
0.6047
0.8953
Profession
2
2.3555
0.3080
Age of client
3
8.0984
0.0440
Family Situation
3
12.6296
0.0055
Average outstanding
2
6.4046
0.0407
Average transactions
3
8.0692
0.0446
Number of withdrawals
2
21.1787
< 0.0001
Overdraft
1
0.2441
0.6213
Checkbook
1
15.6171
< 0.0001
=======================================================

ANALYSIS OF MAXIMUM LIKELIHOOD ESTIMATES


==================================================================================================
PARAMETER
DF
ESTIMATE STAND. ERROR WALD CHI-SQU. PROB > CHI2 EXP(ESTIM.)
-------------------------------------------------------------------------------------------------Intercept
1
-1.3248
0.5152
6.6123
0.0101
0.2659
Seniority 1
1
-1.0047
0.2304
19.0143
< 0.0001
0.3662
2
1
-0.1850
0.3369
0.3016
0.5829
0.8311
3
1
0.7539
0.3165
5.6730
0.0172
2.1252
4
1
0.0304
0.3123
0.0094
0.9226
1.0308
Salary domiciliation 1
1
0.7396
0.1451
25.9650
< 0.0001
2.0950
Size of savings 1
1
0.0430
0.5466
0.0062
0.9374
1.0439
2
1
0.2895
0.4440
0.4250
0.5145
1.3357
3
1
0.0220
0.5631
0.0015
0.9688
1.0223
Profession 1
1
0.3516
0.2681
1.7197
0.1897
1.4213
2
1
-0.0442
0.1853
0.0570
0.8113
0.9567
Age of client 1
1
-0.7262
0.2822
6.6230
0.0101
0.4838
2
1
-0.0130
0.2101
0.0039
0.9505
0.9870
3
1
0.4832
0.2242
4.6423
0.0312
1.6212
Family Situation 1
1
0.9222
0.2983
9.5593
0.0020
2.5147
2
1
0.2492
0.2639
0.8918
0.3450
1.2830
3
1
-0.6348
0.3555
3.1889
0.0741
0.5300
Average outstanding 1
1
-0.8553
0.3446
6.1612
0.0131
0.4252
2
1
0.0486
0.2946
0.0272
0.8690
1.0498
Average transactions 1
1
-0.5518
0.2245
6.0422
0.0140
0.5759
2
1
-0.1342
0.2564
0.2741
0.6006
0.8744
3
1
0.1469
0.2183
0.4527
0.5010
1.1582
Number of withdrawals 1
1
0.9794
0.2213
19.5817
< 0.0001
2.6629
2
1
0.0606
0.1804
0.1127
0.7371
1.0624
Overdraft 1
1
-0.0660
0.1336
0.2441
0.6213
0.9361
Checkbook 1
1
1.0448
0.2644
15.6171
< 0.0001
2.8427
==================================================================================================

103

Logistic Regression
ODDS RATIO ESTIMATES
=========================================================================
EFFECT
ESTIMATE
CONFIDENCE LIMITS *
------------------------------------------------------------------------Seniority 1
VS 5
0.244
0.109
0.548
2
VS 5
0.554
0.200
1.538
3
VS 5
1.417
0.535
3.755
4
VS 5
0.687
0.263
1.798
Salary domiciliation 1 VS 2
4.389
2.485
7.752
Size of savings 1
VS 4
1.488
0.101
22.004
2
VS 4
1.904
0.150
24.208
3
VS 4
1.457
0.126
16.898
Profession 1
VS 3
1.933
0.816
4.577
2
VS 3
1.301
0.745
2.271
=========================================================================
=========================================================================
EFFECT
ESTIMATE
CONFIDENCE LIMITS *
------------------------------------------------------------------------Age of client 1
VS 4
0.374
0.146
0.962
2
VS 4
0.764
0.350
1.668
3
VS 4
1.255
0.585
2.690
Family Situation 1
VS 4
4.300
0.851
21.734
2
VS 4
2.194
0.455
10.579
3
VS 4
0.906
0.166
4.960
Average outstanding 1
VS 3
0.190
0.041
0.882
2
VS 3
0.469
0.114
1.922
Average transactions 1 VS 4
0.336
0.154
0.732
2 VS 4
0.510
0.219
1.188
3 VS 4
0.676
0.325
1.404
Number of withdrawals 1 VS 3
7.534
3.164
17.939
2 VS 3
3.006
1.419
6.366
Overdraft 1
VS 2
0.876
0.519
1.479
Checkbook 1
VS 2
8.081
2.867
22.779
=========================================================================
* 95% WALD CONFIDENCE LIMITS

CONFUSION MATRIX
FREQUENCIES
-------------------------------| ESTIM Good
Bad |
TOTAL
-------------+------------------------------OBSERV Good |
191
45 |
236
Bad |
38
194 |
232
--------------------------------------------TOTAL
|
229
239 |
468
ROW PERCENTAGES
-------------------------------| ESTIM Good
Bad |
TOTAL
-------------+------------------------------OBSERV Good |
80.932 19.068 | 100.000
Bad |
16.379 83.621 | 100.000
--------------------------------------------TOTAL
|
48.932 51.068 | 100.000
COLUMN PERCENTAGES
-------------------------------| ESTIM Good
Bad |
TOTAL
-------------+------------------------------OBSERV Good |
83.406 18.828 |
50.427
Bad |
16.594 81.172 |
49.573
--------------------------------------------TOTAL
|
100.000 100.000 | 100.000
CLASSIFICATION
-------------------------------| CLASS. WELL
BAD |
TOTAL
-------------+------------------------------OBSERV Good |
80.932 19.068 | 100.000
Bad |
83.621 16.379 | 100.000
--------------------------------------------TOTAL
|
82.265 17.735 | 100.000

104

The Discriminant and its methods

THE DISCRIMINANT AND ITS METHODS

FUWILD - OPTIMAL DISCRIMINANT ANALYSIS


Purpose
This method is the branch and bound algorithm of Furnival and Wilson (1974).
The FUWILD procedure selects the N ''best'' adjustments for the linear discriminant
analysis. The selection criteria could be the R2, the adjusted R2 or the Cp of Mallows.
If N is the number of the best adjustments required and P is the number of explanatory
variables of the model, the procedure calculates the N best adjustments for all sizes of
models from 1 to P-1 variables (the adjustment with the P variables is unique).
The procedure supplies the value of the criterion (R2, R2 adjusted or Cp), Fisher's F
associated with R2, the critical probability associated with this F, and the corresponding
test value.
The list of the variables of the model is then presented with the estimated coefficients, the
null tests, the critical probability and the associated test value. Finally, a diagram
representing the evolution of the criterion as a function of the number of the variables in
the models supplies a quick summary of the selections.

Dataset
The dataset is extracted from a survey where 100 respondents judge their suppliers. The
criteria are :
Delivery time
Prices level
Prices flexibility
Image
Services
Commercial image
Product quality
About the suppliers, we know the size of the company in two classes: more or less than 50
employees.
The goal of the analysis is to study the differences between these two classes.
105

FUWILD - Optimal Discriminant Analysis


ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Delivery
Time
4,1
1,8
3,4
2,7
6
1,9
4,6
1,3
5,5
4
2,4
3,9
2,8
3,7
4,7
3,4
3,2
4,9
5,3
4,7
3,3
3,4
3
2,4
5,1
4,6
2,4
5,2
3,5
4,1
3
2,8
5,2
3,4
2,4
1,8
3,6
4
0
2,4
1,9
5,9
4,9
5
2
5
3,1
3,4
5,8
5,4

Prices
Level
0,6
3
5,2
1
0,9
3,3
2,4
4,2
1,6
3,5
1,6
2,2
1,4
1,5
1,3
2
4,1
1,8
1,4
1,3
0,9
0,4
4
1,5
1,4
2,1
1,5
1,3
2,8
3,7
3,2
3,8
2
3,7
1
3,3
4
0,9
2,1
2
3,4
0,9
2,3
1,3
2,6
2,5
1,9
3,9
0,2
2,1

Prices
Flexibility
6,9
6,3
5,7
7,1
9,6
7,9
9,5
6,2
9,4
6,5
8,8
9,1
8,1
8,6
9,9
9,7
5,7
7,7
9,7
9,9
8,6
8,3
9,1
6,7
8,7
7,9
6,6
9,7
9,9
5,9
6
8,9
9,3
6,4
7,7
7,5
5,8
9,1
6,9
6,4
7,6
9,6
9,3
8,6
6,5
9,4
10
5,6
8,8
8

Image

Services

4,7
6,6
6
5,9
7,8
4,8
6,6
5,1
4,7
6
4,8
4,6
3,8
5,7
6,7
4,7
5,1
4,3
6,1
6,7
4
2,5
7,1
4,8
4,8
5,8
4,8
6,1
3,5
5,5
5,3
6,9
5,9
5,7
3,4
4,5
5,8
5,4
5,4
4,5
4,6
7,8
4,5
4,7
3,7
4,6
4,5
5,6
4,5
3

2,4
2,5
4,3
1,8
3,4
2,6
3,5
2,8
3,5
3,7
2
3
2,1
2,7
3
2,7
3,6
3,4
3,3
3
2,1
1,2
3,5
1,9
3,3
3,4
1,9
3,2
3,1
3,9
3,1
3,3
3,7
3,5
1,7
2,5
3,7
2,4
1,1
2,1
2,6
3,4
3,6
3,1
2,4
3,7
2,6
3,6
3
3,8

106

Commercial Product
Image
Quality

Supplier's
Company Size

2,3
4
2,7
2,3
4,6
1,9
4,5
2,2
3
3,2
2,8
2,5
1,4
3,7
2,6
1,7
2,9
1,5
3,9
2,6
1,8
1,7
3,4
2,5
2,6
2,8
2,5
3,9
1,7
3
3
3,2
2,4
3,4
1,1
2,4
2,5
2,6
2,6
2,2
2,5
4,6
1,3
2,5
1,7
1,4
3,2
2,3
2,4
1,4

< 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
< 50 employees
>= 50 employees
< 50 employees
>= 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
< 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
< 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees

5,2
8,4
8,2
7,8
4,5
9,7
7,6
6,9
7,6
8,7
5,8
8,3
6,6
6,7
6,8
4,8
6,2
5,9
6,8
6,8
6,3
5,2
8,4
7,2
3,8
4,7
7,2
6,7
5,4
8,4
8
8,2
4,6
8,4
6,2
7,6
9,3
7,3
8,9
8,8
7,7
4,5
6,2
3,7
8,5
6,3
3,8
9,1
6,7
5,2

The Discriminant and its methods

107

FUWILD - Optimal Discriminant Analysis

Fuwil 4
The Fuwil 4 excel sheet gives the main statistics of each class regarding to the
explanatory variables.
The column Within-group mean displays the means of each explanatory variable
respectively for the group 1 and 2. By default, the group 1 is the first category (in the list)
of the endogenous variable. In this example, the group 1 concerns small suppliers (< 50
employees) and the group 2 bigger suppliers (50 or more employees)
The column General mean displays the mean of each variable observed on the total set.
Missing data handling for exogenous variables
Missing values are replaced by within-groups means
Group Variable label
1
1
1
1
1
1
1
2
2
2
2
2
2
2

Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality

Within-group
General mean
mean
4,192
1,948
8,622
5,213
3,050
2,692
6,090
2,500
2,988
6,803
5,300
2,715
2,625
8,293

3,515
2,364
7,894
5,248
2,916
2,665
6,971
3,515
2,364
7,894
5,248
2,916
2,665
6,971

Number of
missing
values
0
0
0
0
0
0
0
0
0
0
0
0
0
0

This table is useful to detect the variables with the main average differences between the
class and the overall sample.
For example, the class number 2 (suppliers with more than 50 employees), obtains an
average quality score of 8.293, while the class number 1 obtains a score of 6.090.
The Image variable does not differentiate small suppliers than bigger ones.
With the DEMOD procedure (Descriptive statistics), we would get these results :

108

The Discriminant and its methods

109

FUWILD - Optimal Discriminant Analysis

The R Criterion
Curve of R according to the number of explanatory variables
This graph displays the evolution of the R criterion according to the number of
explanatory variables included in the model. Higher is the R, better is the adjustment.
The R increases automatically with the number of explanatory variables.
Therefore, it is recommended to find a compromise between the best R and the smallest
model in terms of explanatory variables.
Some other criterions are available in the parameters tab such as : R adjusted, Mallows
CP.
The graph below shows that the R increases until the entry of the 4th explanatory variable;
then adding some other variables do not increase the R and the adjustments quality :
these variables are redundant.
The R can be interpreted as the part of variance explained by the linear discriminant
function. It goes from 0 to 1.

Curve of R2 accordind to the number of variables


1

Number of
model's
variables

0.43

0.45

0.48

0.50

0.53

0.55

0.57

0.60

0.62

0.65

0.67

Value of R2
The excel sheets 1 var to 7 vars display the 3 best adjustments regarding to the R for
models from 1 to 7 explanatory variables.

110

The Discriminant and its methods

1 var
This table lists the 3 best adjustments (R) with one single explanatory variable.
Adjustments with 1 variable + constante DDL(Student) = 98
Adjustment 1 (Full printout)
R**2 = 0.4680
Fisher = 86.2000
Probability = 0.0000
Test-Value = 7.845
Variable label
Product Quality

Coefficient
-0,4337

Student
9,28

Probability
0,000

Test-Value
7,85

Adjustment 2 (Full printout)


R**2 = 0.4173
Fisher = 70.1912
Probability = 0.0000
Test-Value = 7.258
Variable label
Prices Flexibility

Coefficient
0,4683

Student
8,38

Probability
0,000

Test-Value
7,26

Adjustment 3 (Full printout)


R**2 = 0.3977
Fisher = 64.7156
Probability = 0.0000
Test-Value = 7.032
Variable label
Delivery Time

Coefficient
0,4799

Student
8,04

Probability
0,000

Test-Value
7,03

The number of degrees of freedom is 98.


The first adjustment is the best one, with a R of 0.468 ; this means that the between group
variance (between the two classes) represents 46.8% of the overall variance. A model that
is unable to differentiate the two classes is given a 0 R.
The Fisher statistic corresponds to the global model validation.
Higher is the between group variance, higher is the Fisher statistic. This criterion follows a
Fisher law with 1 and 98 degrees of freedom.
The 86.2 Fisher statistic corresponds to a probability lower than 1/10000 (0.0000).
The model is acceptable. This probability is converted into a test-value, here 7.85.
The column coefficient contains the estimation of the coefficient Product Quality: the
function discriminant D is written : D = constant 0.4337 x Product Quality.
The Student column test the nullity of the coefficient Product Quality : this statistic
follows a student law with 98 degrees of freedom; the 9.28 value corresponds to a
probability lower than 1/10000 (0.0000). The coefficient is significantly different from 0.
The probability is also converted into a test value, here we obtain 7.85. As the model
contains one single explanatory variable, test values of the coefficient and the overall
quality adjustment are equal.

111

FUWILD - Optimal Discriminant Analysis

6 vars
The two following adjustments contain both 6 explanatory variables.
Adjustments avec 6 variables + constante DDL(Student) = 93
Adjustment 1 (Full printout)
R**2 = 0.6718
Fisher = 31.7290
Probability = 0.0000
Test-Value = 9.210
Variable label
Delivery Time
Prices Level
Prices Flexibility
Services
Commercial Image
Product Quality

Coefficient
0,3005
0,1242
0,2418
-0,2308
0,1516
-0,2812

Student
1,12
0,45
4,40
0,45
1,85
5,90

Probability
0,264
0,656
0,000
0,657
0,067
0,000

Test-Value
1,12
0,44
4,18
0,44
1,83
5,42

Adjustment 2 (Full printout)


R**2 = 0.6716
Fisher = 31.6987
Probability = 0.0000
Test-Value = 9.207
Variable label
Delivery Time
Prices Level
Prices Flexibility
Image
Commercial Image
Product Quality

Coefficient
0,1863
0,0070
0,2383
-0,0328
0,1833
-0,2790

Student
3,27
0,11
4,33
0,37
1,44
5,87

Probability
0,002
0,910
0,000
0,711
0,152
0,000

Test-Value
3,17
0,11
4,12
0,37
1,43
5,40

Adjustment 3 (Full printout)


R**2 = 0.6716
Fisher = 31.6925
Probability = 0.0000
Test-Value = 9.206
Variable label
Delivery Time
Prices Flexibility
Image
Services
Commercial Image
Product Quality

Coefficient
0,1844
0,2368
-0,0317
0,0029
0,1831
-0,2779

Student
2,35
4,34
0,36
0,02
1,44
5,88

Probability
0,021
0,000
0,722
0,980
0,153
0,000

Test-Value
2,31
4,13
0,36
0,02
1,43
5,41

For the first adjustment, the variables Prices Flexibility and Product Quality are the
only ones significant to 5% (probability that the related coefficient is null, lower than 5%).

112

The Discriminant and its methods

3 Vars
Finally, we should search the best adjustments in the models with 3 or 4 explanatory
variables, where all the coefficients are significant and the models test values are the
highest.
Adjustments avec 3 variables + constante DDL(Student) = 96
Adjustment 1 (Full printout)
R**2 = 0.6591
Fisher = 61.8789
Probability = 0.0000
Test-Value = 9.660
Variable label
Delivery Time
Prices Flexibility
Product Quality

Coefficient
0,2031
0,2370
-0,2592

Student
3,64
4,55
5,79

Probability
0,000
0,000
0,000

Test-Value
3,51
4,32
5,35

Adjustment 2 (Full printout)


R**2 = 0.6392
Fisher = 56.6932
Probability = 0.0000
Test-Value = 9.378
Variable label
Prices Flexibility
Services
Product Quality

Coefficient
0,3016
0,2206
-0,3097

Student
6,06
2,68
7,12

Probability
0,000
0,009
0,000

Test-Value
5,56
2,63
6,36

Adjustment 3 (Full printout)


R**2 = 0.6338
Fisher = 55.3919
Probability = 0.0000
Test-Value = 9.303
Variable label
Prices Flexibility
Commercial Image
Product Quality

Coefficient
0,3018
0,1953
-0,3323

Student
6,02
2,38
7,46

Probability
0,000
0,019
0,000

Test-Value
5,53
2,34
6,61

113

FUWILD - Optimal Discriminant Analysis

The R adjusted criterion


Curve of R adjusted according to the number of explanatory variables
The R adjusted criterion is based on the standard R, but it imposes a penalty for each
additional explanatory variable that is used to build the model. To increase this criterion,
the entry of a new variable needs to be sufficient (if the variable is redundant with the
ones already included in the model, the criterion decreases).
The graph below shows that the best models have to be found in the ones with 3 or 4
explanatory variables.

Curve of R2 ajusted accordind to the number of variables


1
2
3

Number of
model's
variables

4
5
6
7
0.42

0.45

0.47

0.49

0.52

0.54

0.56

0.59

0.61

0.63

0.66

Value of R2 ajusted

4 vars
The firs adjustment with 4 explanatory variables is the following:
Adjustments avec 4 variables + constante DDL(Student) = 95
Adjustment 1 (Full printout)
R2AJ = 0.6574
Fisher = 48.4911
Probability = 0.0000
Test-Value = 9.612
Variable label
Delivery Time
Prices Flexibility
Commercial Image
Product Quality

Coefficient
0,1840
0,2390
0,1476
-0,2788

Student
3,28
4,64
1,86
6,13

Probability
0,001
0,000
0,066
0,000

Test-Value
3,18
4,40
1,84
5,61

The R adjusted is about 0.6574; very close to the standard R of 0.6711. The explanatory
variables are meaningful, thus the penalty related to the R adjusted is very small.

114

The Discriminant and its methods

The Mallows CP criterion


Curve of Mallows CP according to the number of explanatory variables
Lower is this criterion, better is the adjustment. We get the same results than with the
previous criterions, the best models have 3 or 4 variables.

Curve of Mallows Cp accordind to the number


of variables
1
2

Number
of
model's
variables

3
4
5
6
7
0.00 0.05 0.11 0.16 0.22 0.27 0.32 0.38 0.43 0.49 0.54

Value of Mallows Cp
4 vars
Adjustments avec 4 variables + constante DDL(Student) = 95
Adjustment 1 (Full printout)
C(P) = 2.2916
Fisher = 48.4607
Probability = 0.0000
Test-Value = 9.610
Variable label
Delivery Time
Prices Flexibility
Commercial Image
Product Quality

Coefficient
0,1840
0,2390
0,1476
-0,2788

115

Student
3,28
4,64
1,86
6,13

Probability
0,001
0,000
0,066
0,000

Test-Value
3,18
4,40
1,84
5,61

FUWILD - Optimal Discriminant Analysis

Formulas of the criterions R, R adjusted and Mallows Cp


4. R :
The coefficient of determination R2 (which takes values in the range 0 to 1) is a measure
of the proportion of the total variation that is associated with the regression process:

SSE
SST
SSE : Error Sum of Squares
SST : Total Sum of Squares.
R = 1

5. R adjusted :
The R adjusted criterion is based on the standard R, but it imposes a penalty for each
additional explanatory variable that is used to build the model.
( n 1)(1 R )
R = 1
(n p)
n : the number of observations,
p : the number of variables used for the model plus one.

6. Mallows CP - C(p) :
The Mallows C(p) is positively related to the error (SSE) and the number of
explanatory variables in the model :a model with a lot of variables or with a high error
will be penalized by this criterion.

C ( p) =

SSE
+ 2p n
SST

References :

Furnival, G.M. and Wilson, R.W. (1974), Regression by Leaps and Bounds
Technometrics, 16, 499 -511.

116

The Discriminant and its methods

DIS2GD - LINEAR DISCRIMINANT ANALYSIS BASED ON


CONTINUOUS VARIABLES
This procedure executes a linear discriminant analysis with two groups on continuous
variables, using Fisher's classical method.
The procedure provides bootstrap estimates of the bias and the precision of the principal
results of the discrimination : coefficients, case classification probabilities, and global
percentage classifications. It allows the modification of the costs and a priori probabilities
of classification in the groups. It manages base, test and anonymous cases.
The procedure outputs in advance the descriptive statistics on the variables of the model
in each of the two groups. The discriminant analysis results follow: classification tables,
discriminant function, results of the equivalent regression, and output of assignment to
cases.
If a bootstrap validation is required, the results of the discrimination are output again with
the bootstrap estimates. In particular, the bias and the precision of the global classifications
are shown facing the direct classifications. For anonymous cases, the procedure calculates
the bootstrap probability of their assignment.
If an evaluation of the case tests is required, the procedure will output the results of the
discrimination for these cases. If the assignment of anonymous cases is requested, only the
display of the assignments is shown.
The procedure can archive the rules for the discriminant function so that they can be
applied later on another file with the same structure.

117

DIS2GD - Linear Discriminant Analysis based on


continuous variables

118

The Discriminant and its methods

Dis2g 3
The following table describes the differences observed between the two classes, regarding
to the input explanatory variables.
Analyse discriminante linaire sur l'chantillon DE BASE
Description des chantillons
G1 :
< 50 salaris [ 60]

Libell de la variable

G2 :
>= 50 salaris [ 40]

Dlais de livraison
Moyenne
Ecart-type
Minimum
Maximum

4.192
1.029
2.100
6.100

T de
Student

Probabilit

8.045

0.000

8.378

0.000

9.284

0.000

2.500
1.006
0.000
4.900

Flexibilit des prix


Moyenne
Ecart-type
Minimum
Maximum

8.622
1.154
5.100
10.000

6.803
0.879
5.000
8.500

Moyenne
Ecart-type
Minimum
Maximum

6.090
1.282
3.700
8.500

8.293
0.918
6.200
10.000

Qualit du produit

The first group G1 corresponds to the suppliers with less than 50 employees. There are 60
in the sample.
The second group G2 corresponds to the suppliers with 50 or more employees, there are
40.
SPAD displays the means, standard deviations, minima and maxima for each explanatory
variable by group.
The Student T column corresponds to the test that the two means of the two groups are
equal for each explanatory variable. We reject this hypothesis for the three variables
because the associated probabilities are lower than 1/10000.
The product quality is perceived as significantly higher for the suppliers with more than
50 employees (average score of 8.29 against 6.09).
Reversely, delivery times and prices flexibility are better for smaller suppliers.

119

DIS2GD - Linear Discriminant Analysis based on


continuous variables

Dis2g 4
This table displays all the correlation matrices associated with the discriminant analysis.
Correlation matrix
Correlation matrix on group 1 : < 50 employees (Cont = 60)
Delivery
Prices
Time
Flexibility
1,00
Delivery Time
0,32
1,00
Prices Flexibility
-0,17
0,04
Product Quality
Correlation matrix on group 2 : >= 50 employees (Cont = 40)
Delivery
Prices
Time
Flexibility
1,00
Delivery Time
-0,12
1,00
Prices Flexibility
0,07
-0,16
Product Quality

Product
Quality

1,00

Product
Quality

1,00

Within-group common correlation

Delivery Time
Prices Flexibility
Product Quality

Delivery
Time
1,00
0,17
-0,09

Prices
Flexibility

Product
Quality

1,00
-0,01

1,00

Delivery
Time
1,00
0,51
-0,48

Prices
Flexibility

Product
Quality

1,00
-0,45

1,00

Total correlation

Delivery Time
Prices Flexibility
Product Quality

The first two correlation matrices display the correlations between explanatory variables
inside each group. For example, the correlation between delivery time and prices
flexibility is 0.32 for the group 1 and 0.12 for the group 2.
These two matrices allow us to determine redundancies between explanatory variables:
this is not the case in this example.

120

The Discriminant and its methods

Dis2g 6
Classification table of the discriminant analysis
Result of the FISHER linear discriminant analysis on sample: TRAIN
Table of groups counts

Original group: < 50 employees


Original group: >= 50 employees

Assignment
group: < 50
employees
50
4

Classification table (counts and percentages)


Well
classified
50
Original group: < 50 employees
83,33
36
Original group: >= 50 employees
90,00
86
Total
86,00

Assignment
group: >= 50
employees
10
36

Total
60
40

Misclassified

Total

10
16,67
4
10,00
14
14,00

60
100,00
40
100,00
100
100,00

The adjustment presents a good classification rate on the current set: 50 of the 60 small
suppliers and 36 of 40 big suppliers, respectively 83% and 90%.
Globally, the good classification rate is 86% = (50+36)/100.

121

DIS2GD - Linear Discriminant Analysis based on


continuous variables

Dis2g 9
This table displays the characteristics of the linear discriminant function :
Linear discriminante function
R2 = 0.65913 Fisher = 61.87877 Probability = 0.0000
D2 (Mahalanobis) = 7.89599 T2 (Hotelling) = 189.50369 Probability =
Variable label
Delivery Time
Prices Flexibility
Product Quality
CONSTANT

Correlations
with D.L.F.
(Threshold =
0.201)
0,632
0,648
-0,686

D.L.F.
coefficients

Regression
coefficients

1,191760
1,390700
-1,521000
-3,774790

0,203073
0,236972
-0,259174
-0,777758

0.0000

Standard
T de Student
deviation
(regression)
(Regression)
0,0558
0,0521
0,0448
0,5981

3,6373
4,5482
5,7880
1,3005

Probability
0,0004
0,0000
0,0000
0,1966

The R is 0.659; it means that the between group variance (that expresses the differences
between the two groups) represents 65.9% of the total variance.
The Fisher statistic corresponds to the global model validation.
Higher is the between group variance, higher is the Fisher statistic. This criterion follows a
Fisher law with 1 and 96 degrees of freedom.
The 61.87 Fishers statistic corresponds to a probability lower than 1/10000 (0.0000).
The model is acceptable.
D is the Mahalanobis distance between the two groups. This distance takes into account
the relationships between explanatory variables (the common correlation matrix).
The T of Hotelling is a generalization of the Student test when we have more than one
explanatory variable. It tests the hypothesis that all the means are equal.
In this example, T of Hotelling is 189.503 ; the associated probability is lower than 1/1000:
differences between means are significant.
For each explanatory variable, SPAD displays its correlation with the F.L.D. (Linear
Discriminant Function). The threshold of 0.201 corresponds to the limit where we consider
a correlation as significant (the threshold is given in absolute value).
The correlations between each explanatory variable and the linear discriminant function
are significant and quite close: the linear discriminant function is a good well-balanced
compromise between these three variables.
The F.L.D. coefficients give the model equation: therefore the best linear combination of
the 3 explanatory variables to separate the two groups is the following:
S1(x) = 1.191 x Delivery Time + 1.39 x Prices Flexibility 1.52 x Product Quality 3.77.
This equation gives high scores to suppliers that provide good delivery times and prices flexibility
(group 1, < 50 employees), and low scores for suppliers that have good quality products (group 2,
>= 50 employees).

122

The Discriminant and its methods

Of course, the following equation is equivalent to the previous one but inverses the sign of
scores :
S2(x) = - 1.191 x Delivery Time - 1.39 x Prices Flexibility + 1.52 x Product Quality
+ 3.77.
The suppliers hierarchy is not modified.
The regression' coefficients column is redundant with the discriminant function
coefficients column : they are proportional.
Linear discriminant analysis based on two groups is a particular case of multiple regressions.

This equation :

S3(x) = - 0.203 x Delivery Time + 0.236 x Prices Flexibility


0.299 x Product Quality 0.777
Is still equivalent to the two previous ones.
The Students T and the associated probabilities are calculated from the regression
coefficients, but are valid for the discriminant function coefficients because of the
proportionality.
The Students T are the rate between the regression coefficient and their standard
deviation: for example, 3.63 = 0.203 / 0.558.
Thus, we can see that our three coefficients are significant at 1% but not the constant term.

123

DIS2GD - Linear Discriminant Analysis based on


continuous variables

BOOTSTRAP Estimations: Dis2g - 12 and Dis2g - 13


SPAD provides a Bootstrap validation for all its discriminant functions : the purpose is to
simulate by resampling several samples to calculate for each one an adjustment. In this
example, we have chosen 250 samples.
At the end, we obtain 250 estimations for the classification table and for the coefficients of
the linear discriminant function.
The good classification and misclassification rates are calculated as an average of the 250
estimations. It is the same for the coefficients.

Dis2g - 12
Discriminant analysis by bootstrap estimations: 250 random samples
Classification table (Counts and percentages)
Training
Training
sample - Well
sample classified
Misclassified
Original group: < 50 employees
Original group: >= 50 employees
Total

50,00
83,33
36,00
90,00
86,00
86,00

10,00
16,67
4,00
10,00
14,00
14,00

Bootstrap Well
classified

Bootstrap Misclassified

Total

49,53
82,55
35,78
89,45
85,31
85,31

10,47
17,45
4,22
10,55
14,69
14,69

60,00
100,00
40,00
100,00
100,00
100,00

Dis2g 13
Bootstrap estimations for linear discriminant function
Variable label
Delivery Time
Prices Flexibility
Product Quality
CONSTANT

Correlations
with D.L.F.
(Mean)
0,637
0,648
-0,691

Standard
deviation
0,051
0,064
0,038

124

D.L.F
coefficients
(Mean)
1,296
1,500
-1,633
-4,163

Standard
deviation
0,379
0,513
0,327
4,680

Mean /
Standard
deviation
3,418
2,924
4,996
0,889

The Discriminant and its methods

Dis2g 11
In this excel sheet, SPAD displays for each case their observed group, their assigned
group, the probability of being assigned to this group by the model and their discriminant
score.
The column Group of origin gives for each case has to be compared with the Group
assignment column. If the model is right, SPAD prints '=='.
The Fisher function or score is calculated by the model with the following equation:
S(x) = 1.191 x Delivery Time + 1.39 x Prices Flexibility 1.52 x Product Quality 3.77.
For example, for the case n79, the score 7.767 is calculated this way : (Delivery Time 1.00,
Prices Flexibility 7.1, and Product Quality 9.9)
-7.767 = 1.197 x 1.00 + 1.39 x 7.1 1.52 x 9.9 3.77.
Cases are listed by decreasing scores. Thus the case n79 gets the lowest score and
therefore the highest probability of assignment to the group 2 (50 and more employees).
Reversely, cases with high scores have higher probability of assignment to the group 1
(lower than 50 employees).
For each case, SPAD calculates the probabilities to be assigned to each group and assigns
the case to the group with the highest probability. The indifference point (equal
probabilities for the two groups) corresponds here to a zero Fisher score; it does not
appear in this example
The assignment probability is obtained from the Fisher Score (S(x)):
P (G1 / x) =

exp( S ( x))
and then P(G2 / x) = 1 P(G1 / x)
1 + exp( S ( x))

Sample: TRAINING
List of group assignments and related probabilities
Case identifier
Individu n 79
Individu n 39
Individu n 65

Individu n 93
Individu n 88
Individu n 84

Individu n 87
Individu n 13
Individu n 85

Individu n 25
Individu n 42
Individu n 5

==
==
==

Assignment
probability
1,000
1,000
1,000

Fisher
function
-7,767
-7,716
-7,661

<50
<50
<50

>=50
>=50
>=50

0,877
0,873
0,848

-1,962
-1,932
-1,720

>=50
>=50
>=50

<50
<50
<50

0,640
0,687
0,690

0,577
0,788
0,802

<50
<50
<50

==
==
==

1,000
1,000
1,000

8,623
9,763
9,882

Original
group
>=50
>=50
>=50

125

Assignment

DIS2GFP - Linear Discriminant Analysis


based on Principal Factors

DIS2GFP - LINEAR DISCRIMINANT ANALYSIS


BASED ON PRINCIPAL FACTORS

General principles
This procedure outputs a linear discriminant analysis with two groups on the factorial
coordinates from a NOT NORMED principal components analysis using the classical
Fisher method.
It provides bootstrap estimates of the bias and the precision of the principal results of the
discrimination: coefficients, case classification probabilities, global classification
percentages. It also allows the modification of the a priori costs and probabilities of the
classification in the groups. It provides the management of the base cases, of the test cases
and of the anonymous cases.
The procedure offers a print preview of the descriptive statistics of the model variables in
each of the two groups. Next the results of the discriminant analysis are shown:
classification tables, discriminant function, and output of the assignment of cases.
The decision rule is finally expressed as a function of the original variables. The results of
the regression equivalent are only indicative, since the classical hypotheses of normality
are meaningless in this context.
If a bootstrap validation is requested, the results of the discrimination are repeated with
the bootstrap estimates. In particular, the bias and the precision of the global classifications
are shown with the direct classifications. For anonymous cases, the procedure calculates
their bootstrap probability assignment.
If an evaluation of the test cases is required, the procedure outputs the results of the
discrimination relative to these cases. If the assignment of anonymous cases is required,
only the assignments are output.
The procedure can archive the rules for the discriminant function so they can be applied
later to another file of the same structure.

126

The Discriminant and its methods

127

DIS2GFP - Linear Discriminant Analysis


based on Principal Factors

Dis2g 1
This first excel sheet displays the studied model: the variable to explain is the same than in
the previous methods (Suppliers company size), the explanatory variables are the
principal factors get from the principal component analysis based on all the continuous
variables available in the dataset except Satisfaction index.
By default, SPAD assigns the prefix F and the number corresponding to each factor.
F1 , F2 , . We ordered SPAD to process this analysis on the 7 factors, that is to say
99.99% of the total inertia
Model : V8=F1+F2+F3+F4+F5+F6+F7
Variable number
8
1
2
3
4
5
6
7

Variable label
Supplier's Company Size
F 1
F 2
F 3
F 4
F 5
F 6
F 7

EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION.. 89.9375
SUM OF EIGENVALUES............ 89.9375
HISTOGRAM OF THE FIRST
8 EIGENVALUES
+--------+------------+-------------+-------------+----------------------------------------------+
| NUMBER | EIGENVALUE | PERCENTAGE | CUMULATED |
|
|
|
|
| PERCENTAGE |
|
+--------+------------+-------------+-------------+----------------------------------------------+
|
1
| 81.8822
|
91.04
|
91.04
| *********************************************|
|
2
|
4.0759
|
4.53
|
95.58
| ****
|
|
3
|
1.4053
|
1.56
|
97.14
| **
|
|
4
|
1.2298
|
1.37
|
98.51
| **
|
|
5
|
0.7842
|
0.87
|
99.38
| *
|
|
6
|
0.3903
|
0.43
|
99.81
| *
|
|
7
|
0.1617
|
0.18
|
99.99
| *
|
|
8
|
0.0081
|
0.01
|
100.00
| *
|
+--------+------------+-------------+-------------+----------------------------------------------+

128

The Discriminant and its methods

Dis2g 6 : Classification Table


Result of the FISHER linear discriminant analysis on sample: TRAIN
Table of groups counts

Original group: < 50 employees


Original group: >= 50 employees

Assignment
group: < 50
employees
54
0

Classification table (counts and percentages)


Well
classified
54
Original group: < 50 employees
90,00
40
Original group: >= 50 employees
100,00
94
Total
94,00

Assignment
group: >= 50
employees
6
40

Total
60
40

Misclassified

Total

6
10,00
0
0,00
6
6,00

60
100,00
40
100,00
100
100,00

The adjustment presents good classification rate on this sample: it assigns correctly 54 of
60 small suppliers and all the big ones, respectively 90% and 100%.
The global good classification rate is 94% = (54+40)/100.
Comparison with the model of the previous chapter
We can notice that this model obtains better results than the previous one that only used
three predictors (Delivery time, Prices flexibility and Quality product).
Classification table of the previous model :
Result of the FISHER linear discriminant analysis on sample: TRAIN
Table of groups counts

Original group: < 50 employees


Original group: >= 50 employees

Assignment
group: < 50
employees
50
4

Assignment
group: >= 50
employees
10
36

Total
60
40

Classification table (counts and percentages)

Original group: < 50 employees


Original group: >= 50 employees
Total

Well
classified
50
83,33
36
90,00
86
86,00

Misclassified

Total

10
16,67
4
10,00
14
14,00

60
100,00
40
100,00
100
100,00

In our current model, we have kept all the available information (almost all the
explanatory variables and all the factors), it is normal to get better results.

129

DIS2GFP - Linear Discriminant Analysis


based on Principal Factors

Dis2g 9 : results of the model based on principal factors


Linear discriminante function
R2 = 0.71210 Fisher = 32.50721 Probability = 0.0000
D2 (Mahalanobis) = 10.09961 T2 (Hotelling) = 242.39072 Probability =
Axe label
F 1
F 2
F 3
F 4
F 5
F 6
F 7
CONSTANT

Correlations
with D.L.F.
(Threshold =
0.201)
-0,380
-0,651
0,240
0,036
0,028
0,278
-0,101

D.L.F.
coefficients

Regression
coefficients

-0,291099
-2,234360
1,403290
0,226999
0,221504
3,090510
-1,747540
1,009960

-0,041896
-0,321575
0,201965
0,032670
0,031879
0,444793
-0,251510
0,000000

0.0000

Standard
T de Student
deviation
(regression)
(Regression)
0,0062
0,0277
0,0472
0,0504
0,0632
0,0895
0,1391
0,0559

6,7769
11,6055
4,2798
0,6477
0,5046
4,9676
1,8078
0,0000

Probability
0,0000
0,0000
0,0000
0,5188
0,6150
0,0000
0,0739
1,0000

The R is 0.7120 ; it means that the between group variance represents 71.20 % of the total
variance.
The Fisher statistic is 32.50 corresponding to a probability lower than 1/10000 (0.0000).
Thus, the model is accepted.
All the statistics displayed in the above table are described in the previous section page 19.
We can see that the factors 4 and 5 present coefficient none significantly different from
zero (probabilities 0.5188 and 0.6150). The factor 7 presents also a probability greater than
0.05.
The coefficients of the Linear Discriminant Function give the following equation:
S1(x) = - 0.291 x F1 - 2.23 x F2 + 1.40 x F3 + 0.226 x F4 + 0.221 x F5 + 3.09 x F6
- 0.14 x F7 + 1.0099.

Dis2g 10 : Fisher linear discriminant function rebuilt,


starting from original variables
This excel sheet is the most interesting for the user because it displays the equation model
based on the original variables and no longer on the principal factors.
Thus, we find the variables Delivery Time , Prices flexibility with strong positive
coefficients.
To understand the coefficients, we have to remember that the equation opposites the two
groups by giving high scores to the small suppliers and low scores to the bigger ones. By
default, SPAD always gives high scores to the first category (in the list) of the endogenous
variable.

130

The Discriminant and its methods

Remark : coefficients calculation on the original variables


SPAD displays in the table below the linear discriminant function based on original
variables; it has been calculated from the linear discriminant function based on the
principal factors. We know that each principal factor is a linear combination of the original
variables.
The coefficients of these combinations are available in the PCA outputs in the column
called Normed eigenvectors.
Normed eigenvectors
Label variable
Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Frequency of use

Axis 1 Axis 2 Axis 3 Axis 4 Axis 5 Axis 6 Axis 7


-0,10 -0,28
0,26
-0,09 -0,74
0,36
-0,04
-0,01
0,48
0,03
-0,47
0,35
0,49
-0,07
-0,09 -0,40 -0,25
0,49
0,33
0,65
-0,01
-0,03
0,26
0,69
0,41
0,11
0,07
0,52
-0,06
0,11
0,16
-0,29 -0,18
0,42
-0,03
-0,02
0,15
0,40
0,31
0,07
-0,04 -0,85
0,04
0,65
-0,45
0,43
-0,42
0,12
0,02
-0,99
0,07
-0,06 -0,02
0,03
-0,12
0,01

The factor 1 can be calculated this way :


F1 = -0.10 x "Delivery Time" -0.01 x Prices level" - 0.9 x "Prices flexibility"
- 0.03 x "Image" -0.06 x "Services" 0.02 x "Commercial Image"
+ 0.04 x "Product Quality" 0.99 x "Frequency of use".
and so on for all the factors. Starting from these equations, SPAD can assign a
coefficient to each original variable.
FISHER linear function rebuilt starting original variables
Variable label

Category label

Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Frequency of use
CONSTANTE

D.L.F.
coefficients

Regression
coefficients

2,018560
0,590870
2,771660
-0,198512
1,257900
1,674080
-1,764960
-0,327111
-9,065840

0,290515
0,085039
0,398903
-0,028570
0,181039
0,240937
-0,254017
-0,047079
-1,450130

Standard
T de Student
deviation
(regression)
(Regression)
0,0588
4,9366
0,0573
1,4851
0,0684
5,8360
0,0831
0,3437
0,0430
4,2083
0,1206
1,9979
0,0453
5,6085
0,0130
3,6132

Probability
0,0000
0,1410
0,0000
0,7319
0,0001
0,0487
0,0000
0,0005

The linear discriminant function equation is the following :


D1 = 2.02 x Delivery Time + 0.59 x Prices level + 2.77 x Prices Flexibility
0.20 x Image + 1.26 x Services + 1.67 x Commercial Image 1.76 x Product Quality
0.33 x Frequency of use 9.07.

The variables Image et Prices Level are not significant (respective probabilities of
0.7319 et 0.141). The small contribution of the variable Image is not surprising, we get

131

DIS2GFP - Linear Discriminant Analysis


based on Principal Factors

the same result than the ones obtained with the automatic characterization. (see table
below)
About the variable Prices level , it is surprising to find it not significant in the model
while it appears significant in the automatic characterization.
This is due to the correlations existing between the explanatory variables: the prices level
is related to the variables Delivery Time , Prices Flexibility These variables tend to
reduce the specific effect due to the prices level.

Characterisation by continuous variables of categories of


Supplier's Company Size
< 50 employees

(Weight =

Characteristic variables
Prices Flexibility
Delivery Time
Frequency of use
Services
Commercial Image

5,213
1,948
6,090

Image
Prices Level
Product Quality
>= 50 employees

(Weight =

Characteristic variables
Product Quality
Prices Level
Image
Commercial Image
Services
Frequency of use
Delivery Time
Prices Flexibility

60.00 Count =
60 )
Category
Category Std. Overall Std.
Overall mean
mean
deviation
deviation
8,622
7,894
1,154
1,380
4,192
3,515
1,029
1,314
48,767
46,100
8,724
8,944
3,050
2,916
0,584
0,747
2,692
2,665
0,859
0,767
5,248
2,364
6,971

1,281
1,018
1,282

1,126
1,190
1,577

40.00 Count =
40 )
Category
Category Std. Overall Std.
Overall mean
deviation
deviation
mean
8,293
6,971
0,918
1,577
2,988
2,364
1,156
1,190
5,300
5,248
0,838
1,126
2,625
2,715
42,100
2,500
6,803

2,665
2,916
46,100
3,515
7,894

0,601
0,905
7,690
1,006
0,879

132

0,767
0,747
8,944
1,314
1,380

Test-value

Probability

6,43
6,27
3,63
2,18
0,42

0,000
0,000
0,000
0,014
0,336

-0,38
-4,26
-6,81

0,354
0,000
0,000

Test-value

Probability

6,81
4,26
0,38

0,000
0,000
0,354

-0,42
-2,18
-3,63
-6,27
-6,43

0,336
0,014
0,000
0,000
0,000

The Discriminant and its methods

Simplified Model : dis2g - 9 et dis2g - 10


We modify our previous model by keeping only the significant principal factors:1, 2, and 3
et 6.
The results are listed below :
Linear discriminante function
R2 = 0.69976 Fisher = 55.35311 Probability = 0.0000
D2 (Mahalanobis) = 9.51685 T2 (Hotelling) = 228.40439 Probability = 0.0000
Axe label
F 1
F 2
F 3
F 6
CONSTANT

Correlations
with D.L.F.
(Threshold =
0.201)
-0,380
-0,651
0,240
0,279

D.L.F.
coefficients

Regression
coefficients

-0,279138
-2,142560
1,345630
2,963530
0,951688

-0,041896
-0,321575
0,201965
0,444793
0,000000

Standard
T de Student
deviation
(regression)
(Regression)
0,0062
0,0278
0,0474
0,0900
0,0562

6,7436
11,5484
4,2588
4,9432
0,0000

Probability
0,0000
0,0000
0,0000
0,0000
1,0000

Factors are orthogonal, then the Student statistics do not change except the rounding error
: we keep the same hierarchy and the same relative importance of the factors.
The new linear discriminant function is now written :
S1(X) = - 0.27 x F1 - 2.14 x F2 +1.34 x F3 + 2.96 x F6 + 0.95.

FISHER linear function rebuilt starting original variables


Variable label

Category label

Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Frequency of use
CONSTANTE

D.L.F.
coefficients

Regression
coefficients

2,043930
0,472917
2,460700
0,572879
1,261200
0,103654
-1,669530
-0,297403
-8,387220

0,306772
0,070980
0,369324
0,085983
0,189292
0,015557
-0,250579
-0,044637
-1,401670

Standard
T de Student
deviation
(regression)
(Regression)
0,0353
8,6795
0,0461
1,5395
0,0605
6,1044
0,0340
2,5320
0,0390
4,8564
0,0196
0,7920
0,0300
8,3423
0,0128
3,4890

Probability
0,0000
0,1272
0,0000
0,0131
0,0000
0,4304
0,0000
0,0007

We find the same opposition between the characteristic variables of the small suppliers
(Delivery Time and Prices Flexibility) and the bigger ones (Product Quality).
The variable Commercial Image is still not significant, but the variable Image
becomes significant. Moreover, its positive coefficient indicates a characteristic of the small
companies. However, it is recommended to interpret this result with care, because the
automatic characterization shows that small suppliers have a lower image score than the
big ones (average of 5.21 compared to 5.3). This is due to the correlations existing between
variables. Working on a restricted number of factors was not sufficient to erase them.
Finally, by eliminating non significant variables, principal factors, or variables whose the
coefficients sign is not coherent, we get back to the model of the previous chapter with the
following variables Delivery Time , Prices Flexibility and Product Quality .
Even if it discriminates less good than the other models studied in this chapter, we may
keep this one because of its coherence regarding to the relative contributions and effects
signs.

133

DISCO - Discriminant Analysis


based on Qualitative variables

DISCO - DISCRIMINANT ANALYSIS

BASED ON QUALITATIVE VARIABLES

SCORE - SCORING FUNCTION

With SPAD, building a scoring function requires the following steps:


9 Firstly, we determine the most discriminant variables regarding to the endogenous
variables (The DEMOD and MSMOD procedures)
9 Then, we perform a Multiple Correspondence Analysis (MCA) on the qualitative
variables selected.
9 We perform a linear discriminant analysis based on the factorial coordinates extracted
from the Multiple Correspondence Analysis.
9 Then, we rebuilt the discriminant function starting from the original qualitative
variables.
9 We normalize the coefficients of each explanatory category to get only zero or positive
scores. The maximum score is defined by the user (100, 1000).
9 Then, each case is assigned a score regarding to its profile.
NB :

The steps 2 and 3 are implemented in the DISCO procedure of the scoring chain.

The SPAD scoring method performs a multiple correspondence analysis for the following
reasons :
9 The linear discriminant analysis is a method that requires only input continuous
variables.
9 The MCA transforms qualitative variables into continuous factorial coordinates that
can be used for the discriminant analysis.
9 The factorial coordinates are orthogonal and we are liberated from the multicolinearity
problems.
9 At last, the factorial coordinates selection optimizes the results.

134

The Discriminant and its methods

To illustrate the scoring methodology of SPAD, we use the CREDIT.SBA dataset.


The main goal is to discriminate bad customers from good customers (Target variable
GOOD_BAD) regarding to all their bank and sociodemographic profiles.
The final goal is of course to build decision rules to assign to new customers.
Extraction from the dataset CREDIT.SBA :
GOOD_BAD
GOOD
GOOD
BAD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
GOOD
BAD
GOOD
BAD

AGE
GE 50 years
LT 23 years
GE 23 LT 40 years
GE 23 LT 40 years
LT 23 years
GE 23 LT 40 years
GE 50 years
GE 50 years
GE 40 LT 50 years
GE 50 years
GE 50 years
GE 40 LT 50 years
GE 23 LT 40 years
GE 23 LT 40 years
GE 40 LT 50 years
GE 40 LT 50 years
GE 50 years
GE 50 years

MARITAL
Single
Single
Widowed
Divorced
Single
Single
Married
Married
Single
Single
Married
Married
Single
Married
Divorced
Divorced
Single
Widowed

SENIORITY
GT 12 years
LE 1 year
GT 6 LT 12 years
GT 1 LE 4 years
GT 6 LT 12 years
LE 1 year
GT 6 LT 12 years
GT 12 years
GT 1 LE 4 years
GT 4 LE 6 years
GT 12 years
LE 1 year
GT 4 LE 6 years
GT 6 LT 12 years
GT 4 LE 6 years
GT 6 LT 12 years
GT 12 years
GT 12 years

SALARY
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
NO SALARY
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
NO SALARY
NO SALARY
SALARY AT THE BANK
NO SALARY
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK

SAVINGS
No saving
No saving
No saving
LT 10 KF Sav.
No saving
No saving
No saving
No saving
No saving
No saving
No saving
LT 10 KF Sav.
No saving
No saving
LT 10 KF Sav.
No saving
No saving
No saving

JOB
Employee
Employee
Employee
Employee
Employee
Employee
Executive
Executive
Employee
Employee
Employee
Executive
Other
Employee
Executive
Employee
Other
Other

The dataset CREDIT.SBA has 468 cases and 12 qualitative variables.


QUALITATIVE VARIABLES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

GOOD_BAD
AGE
MARITAL STATUS
SENIORITY
SALARY
SAVINGS
JOB
CHECKING ACCOUNT
AVERAGE TRANSACTIONS
WITHDRAWALS
NEGATIVE ACCOUNT BALANCE
CHEQUE AUTHORIZATION

( 2 categories )
( 4 categories )
( 4 categories )
( 5 categories )
( 2 categories )
( 4 categories )
( 3 categories )
( 3 categories )
( 4 categories )
( 3 categories )
( 2 categories )
( 2 categories )

Before building the scoring function, it is recommended to start by descriptive statistics


such as the STATS, DEMOD and MSMOD procedures.

135

SCORE - Scoring Function

THE SCORING FAVOURITE


We create a new chain using the Predefined Chain command from the general Chain
Menu.

In the favourites tab, select the Scoring rubric and double click on
Discriminant analysis on categorical variables and scoring.

SPAD displays the following methods in the diagram.

Import the dataset credit.sda by using the


SPAD Data Archive File import
method.
Icons methods are grey because you have to
configure them.

The SCORING parameters will be


defined by default.

136

The Discriminant and its methods

DISCO PARAMETERS
The configuration of the DISCO procedure start by defining the model to build : the
endogenous variable and the qualitative exogenous variables :

The model is the following :


V1 = V2 + + V12
In this Model tab, we can
specify the real model, i.e. built
on the factorial coordinates
extracted from the MCA.
To proceed, click on the button
Calculation Options

We decide to build the complete


model, with all the factorial
coordinates.
Click on OK to go back the
Model tab et again Ok to
finish the Disco configuration.
Run the methods.

137

SCORE - Scoring Function

Right-click on the discriminant method icon to access the results.


Starting by the complete model allows us to keep in a second step only the significant
factors.

We visualize and select the


factorial axes that really
discriminate the target variable.
To do this, we use the ratio
Coefficient/StDev, that can be
interpreted as a Students T.
We could keep all the axes with
an absolute ratio greater than
1.96.

138

The Discriminant and its methods

DISCO RESULTS
Linear Discriminant Function
Model
V1=F1+F2+F3+F4+F5+F6+F7+F8+F9+F10+F11+F12+F13+F14+F15+F16+F17+F18+F19+F20
+F21+F22+F23+F24+F25
Linear discriminant function
R2 = 0.41398 Fisher = 12.48967 Probability = 0.0000
D2 (Mahalanobis) = 2.81410 T2 (Hotelling) = 329.19614 Probability =
Axis label
F 1
F 2
F 3
F 4
F 5
F 6
F 7
F 8
F 9
F 10
F 11
F 12
F 13
F 14
F 15
F 16
F 17
F 18
F 19
F 20
F 21
F 22
F 23
F 24
F 25
CONSTANT

Correlations
with D.L.F.
D.L.F.
(Threshold coefficients
= 0.093)
-0,475
0,290
0,104
0,170
-0,007
-0,057
-0,022
0,061
0,139
-0,045
0,004
-0,028
-0,030
-0,070
0,045
0,002
-0,017
-0,105
0,049
-0,008
-0,074
0,024
0,068
-0,061
0,019

-3,228700
2,342510
0,897833
1,532160
-0,072457
-0,571836
-0,227015
0,641800
1,515070
-0,502921
0,051269
-0,319744
-0,356309
-0,847106
0,567041
0,023938
-0,219652
-1,389350
0,676453
-0,119744
-1,071810
0,367523
1,151150
-1,190570
0,608556
0,018039

Regression
coefficients
-0,950022
0,689267
0,264181
0,450828
-0,021320
-0,168259
-0,066797
0,188845
0,445797
-0,147981
0,015086
-0,094082
-0,104841
-0,249255
0,166848
0,007043
-0,064631
-0,408807
0,199041
-0,035234
-0,315374
0,108141
0,338719
-0,350316
0,179063
0,000000

0.0000

Ratio
Standard
Coefficient /
deviation
ST.
(Regression)
Deviation
0,0729
0,0867
0,0925
0,0967
0,1057
0,1077
0,1099
0,1130
0,1173
0,1192
0,1224
0,1237
0,1279
0,1300
0,1350
0,1359
0,1405
0,1425
0,1487
0,1546
0,1553
0,1624
0,1819
0,2089
0,3351
0,0364

-13,0262
7,9474
2,8551
4,6611
-0,2018
-1,5617
-0,6076
1,6705
3,8017
-1,2411
0,1233
-0,7605
-0,8197
-1,9170
1,2364
0,0518
-0,4599
-2,8691
1,3381
-0,2279
-2,0303
0,6659
1,8622
-1,6768
0,5343
0,0000

The factors with a ratio whose the absolute value is greater than 1,96 are displayed in bold.
These factors are to be included in the optimal model.
To build this model, we need to return to the Disco configuration and click on the button
Calculation Options .

139

SCORE - Scoring Function

NEW CONFIGURATION OF THE DISCO METHOD

We have specified the


optimal model to use for
building the discriminant
function and in a second
step the scoring function.
The optimal model is built
with the following
factors :

F1 F4, F9, F18 et F21.


We have to re-run the
chain.

Now that the optimal model is available, we want to partition the dataset into two subsets
: one to perform the analysis, the other one to confirm and validate the analysis. This part
is called validation. We talk about Learning set and Testing set or test-cases in the
following tab.

In this example, we
choose to select randomly
25 % of the cases to test
the model based on the 75
% remaining cases.
Validation is very useful
for testing that the model
does not overfit the data
and has a good predictive
power.

140

The Discriminant and its methods

THE DISCO RESULTS


To measure the prediction performance of the model, we read the following classification
table :
Result of the FISHER linear discriminant analysis on sample: TRAINING
Table of groups counts

Original group: GOOD


Original group: BAD

Assignment
group:
GOOD
150
35

Classification table (counts and percentages)


Well
classified
150
Original group: GOOD
84,27
138
Original group: BAD
79,77
288
Total
82,05

Assignment
group: BAD

Total

28
138

178
173

Misclassified

Total

28
15,73
35
20,23
63
17,95

178
100,00
173
100,00
351
100,00

Result of the FISHER linear discriminant analysis on sample: TEST


Table of groups counts

Original group: GOOD


Original group: BAD

Assignment
group:
GOOD
50
21

Classification table (counts and percentages)


Well
classified
50
Original group: GOOD
84,75
37
Original group: BAD
63,79
87
Total
74,36

Assignment
group: BAD

Total

9
37

59
58

Misclassified

Total

9
15,25
21
36,21
30
25,64

59
100,00
58
100,00
117
100,00

On the TRAINING SET, 82.05 % of the cases are well classified.


On the TESTING SET, 74.36 % of the cases are well classified.
The built model presents a good predictive power on both sets. It does not overfit the
training set and looks reproductible.
Another validation way would be to use bootstrapping.

141

SCORE - Scoring Function

Linear discriminant function


R2 = 0.38387 Fisher = 40.94150 Probability = 0.0000
D2 (Mahalanobis) = 2.48185 T2 (Hotelling) = 290.32867 Probability =
Correlations
with D.L.F.
(Threshold =
0.093)
-0,475
0,290
0,104
0,170
0,139
-0,105
-0,074

Axis label
F 1
F 2
F 3
F 4
F 9
F 18
F 21
CONSTANT

D.L.F.
coefficients

Regression
coefficients

-3,070890
2,228010
0,853949
1,457270
1,441010
-1,321450
-1,019430
0,015909

-0,950022
0,689267
0,264181
0,450828
0,445797
-0,408807
-0,315374
0,000000

0.0000

Ratio
Standard
Coefficient /
deviation
(Regression) ST. Deviation
0,0733
0,0872
0,0930
0,0972
0,1179
0,1432
0,1561
0,0366

-12,9600
7,9070
2,8406
4,6374
3,7824
-2,8545
-2,0200
0,0000

FISHER linear function rebuilt starting original variables


Variable label

Category label

D.L.F.
coefficients

Regression
coefficients

Age of client

Less than 23 years


From 23 to 40 years
From 40 to 50 years
Over 50 years
Single
Married
Divorced
Widow
1 year or less
From 1 to 4 years
From 4 to 6 years
From 6 to 12 years
Over 12 years
Sal. domiciliated
Sal. not domicil.
No saving
Less than 10 KF
From 10 to 100 KF
More than 100 KF
executive
employee
other
Less than 2 KF
From 2 to 5 KF
More than 5 KF
Less than 10 KF
From 10 to 30 KF
From 30 to 50 KF
More than 50 KF
Less than 40
From 40 to 100
More than 100
Authorized
Forbidden
Authorized
Forbidden
CONSTANT

-4,413170
1,944030
1,169940
-0,425731
0,009629
1,427100
-3,502810
-6,459620
-6,138460
-8,631250
9,017780
2,082220
9,972050
4,923760
-10,236200
-1,401220
3,659600
6,438820
12,519100
3,238490
2,657760
-5,709430
-12,409300
2,567540
6,859870
-3,598390
-0,489471
1,643420
3,306170
6,128640
-0,076000
-7,615890
-1,481820
1,125290
1,654080
-12,951800
0,015909

-1,365270
0,601412
0,361938
-0,131706
0,002979
0,441492
-1,083640
-1,998370
-1,899020
-2,670200
2,789770
0,644162
3,084990
1,523230
-3,166720
-0,433488
1,132150
1,991940
3,872970
1,001870
0,822213
-1,766290
-3,839000
0,794304
2,122200
-1,113210
-0,151425
0,508415
1,022810
1,895980
-0,023512
-2,356080
-0,458423
0,348125
0,511713
-4,006810
0,000000

Family Situation

Seniority

Salary domiciliation
Size of savings

Profession

Average outstanding

Average transactions

Number of withdrawals

Overdraft
Checkbook

142

Standard
Ratio
deviation
Coefficient /
(Regression) ST. Deviation
0,4690
-2,9112
0,2750
2,1873
0,2642
1,3698
0,3968
-0,3319
0,3449
0,0086
0,2180
2,0249
0,1992
-5,4407
0,9437
-2,1177
0,3039
-6,2495
0,3826
-6,9787
0,5665
4,9243
0,4131
1,5592
0,6592
4,6798
0,1359
11,2049
0,2826
-11,2049
0,0864
-5,0190
0,2840
3,9865
0,6526
3,0524
1,1221
3,4515
0,3962
2,5287
0,1786
4,6033
0,2101
-8,4074
0,4235
-9,0644
0,1589
4,9987
0,4772
4,4468
0,3180
-3,5007
0,2069
-0,7318
0,4045
1,2571
0,2583
3,9605
0,2382
7,9587
0,2219
-0,1060
0,3085
-7,6361
0,3886
-1,1795
0,2951
1,1795
0,0658
7,7814
0,5149
-7,7814

The Discriminant and its methods

THE SCORING FUNCTION


The SCORE procedure transforms the FLD coefficients by using the two following rules :
Minimum coefficient for each variable : for each categorical variable, the smallest
coefficient is set to the value zero. The minimum score possible for a case is zero. It is
obtained for a case who, for each variable, presents the assigned category to zero.
Maximum possible of the score function : the value for the maximum possible score is
chosen by the user (for example 1000). This maximum corresponds to the sum of the
largest transformed coefficients for each variable.
The score attributed to a case is obtained by adding the transformed coefficients associated
with the categories of the case. The transformed score function classifies the cases in the
same way as the initial discriminant function.

143

SCORE - Scoring Function

THE SCORE CONFIGURATION


Parameter to modify
eventually, for assigning a
target category to low

To tick for creating a file


containing the Decision
Rules to be applied on new

Click OK and run the method.

144

The Discriminant and its methods

THE SCORING RESULTS


Coefficients of the Discriminant and Score functions
Age of client
Linear
discriminant
function
coefficient
-4,413
1,944
1,170
-0,426

Category label
Less than 23 years
From 23 to 40 years
From 40 to 50 years
Over 50 years
Family Situation

Linear
discriminant
function
coefficient
0,010
1,427
-3,503
-6,460

Category label
Single
Married
Divorced
Widow
Seniority

Linear
discriminant
function
coefficient
-6,138
-8,631
9,018
2,082
9,972

Category label
1 year or less
From 1 to 4 years
From 4 to 6 years
From 6 to 12 years
Over 12 years
Salary domiciliation

Linear
discriminant
function
coefficient
4,924
-10,236

Category label
Sal. domiciliated
Sal. not domicil.
Size of savings

Linear
discriminant
function
coefficient
-1,401
3,660
6,439
12,519

Category label
No saving
Less than 10 KF
From 10 to 100 KF
More than 100 KF
Profession

Linear
discriminant
function
coefficient
3,238
2,658
-5,709

Category label
executive
employee
other
Average outstanding

Linear
discriminant
function
coefficient
-12,409
2,568
6,860

Category label
Less than 2 KF
From 2 to 5 KF
More than 5 KF
Average transactions

Linear
discriminant
function
coefficient
-3,598
-0,489
1,643
3,306

Category label
Less than 10 KF
From 10 to 30 KF
From 30 to 50 KF
More than 50 KF
Number of withdrawals

145

Score
function
coefficient
0,00
49,66
43,62
31,15
Score
function
coefficient
50,54
61,61
23,10
0,00
Score
function
coefficient
19,47
0,00
137,88
83,69
145,33
Score
function
coefficient
118,43
0,00
Score
function
coefficient
0,00
39,54
61,25
108,75
Score
function
coefficient
69,90
65,37
0,00
Score
function
coefficient
0,00
117,00
150,53
Score
function
coefficient
0,00
24,29
40,95
53,94

SCORE - Scoring Function

OPTIMAL SCORING PILOT


Double-clicking on the following icon
opens the Optimal Scoring Pilot interface.
Click on New in the File menu to display the graph below.

The user can define a rate called the Classification Error tolerance Abbreviated as CET in
the parameters Tab of the Score method. In this example, we chose 10% by default.
This rate supports the calculation of regions on the score function scale :
The low boundary 528 has been chosen to assign 10.0% of the real good customers to the
weak scores group (misclassified) and the high boundary 655 to assign 9.7% of the real bad
to high scores group (misclassified).
These boundaries are shrinkable if the user wants to modify the misclassification rates.

Three regions are displayed on the graph :


A "green" region, which corresponds to the high scores (here the category GOOD), where
one expects to find the majority of the GOOD customers. In this region, a miss-classified
case is a BAD customer assigned to GOOD because of its high score. The boundary is
calculated to contain a rate of miss-classified cases that does not exceed the CET.
In this example: 10.0% of the real BAD customers are assigned to this region and 62.4% of the
real GOOD are well assigned.

146

The Discriminant and its methods

A "red" region of low scores containing most of the cases in BAD category - and therefore
correctly classified - and a percentage not exceeding the CET of cases of GOOD and
therefore miss-classified.
In this example : 64.5% of the real BAD are well assigned and 9.7% of the real GOOD are
misclassified.
An intermediary Orange region between the boundaries of the red and green regions,
where group assignment is left undecided. This region of indecision shrinks when the user
increases the CET.
In this example : 25.5% of the real BAD and 27.8% of the real GOOD are assigned to the
Orange region.
Sometimes, it is not necessary to keep this intermediary region, for direct marketing
campaigns. Then, by clicking on the Single score checkbox, we keep only two regions (Red
and Green) and one single boundary.
Modifying the boundaries by using the scores table
This part of the user interface allows us to modify manually the CET and
therefore the boundaries.

The Data view


This view is not available when the number of cases is greater than 10 000.

147

SCORE - Scoring Function

The fields of the data view are described below :


Identifier :
The cases identifier truncated to 40 characters.
Weight :
The weight defined in the Weighting Tab of the DISCO method; 1 by default.
Sample :
The set assignment for each case : learning or test set.
Score :
The score calculated for each case.
Group :
The original group (G1 or G2) of the case.
Assign. :
Displays the group assignment determined by the model (G1, NC or G2).
NC means that the case is not assigned, or assigned to the orange region.
Err. G1 - Error group 1 :
If the case belongs to the group 1 (G1) and is assigned to :
the group 1, no error
the orange zone, error coded (x)
the group 2, error coded (xx)
Err. G2 - Error group 2 :
If the case belongs to the group 2 (G2) and is assigned to:
the group 2, no error
the orange zone, error coded (x)
the group 1, error coded (xx)
Sort the data by a field:
Clicking on any field name allows us to :
- sort the data in increasing order,
- sort the data in decreasing order,
- return to the initial order.

The case profile :


By clicking on any case in the Data view, it is possible to :
- locate the case on the previous graph, the case is displayed in red with its
identifier, press Escape to return to the Data view
- display its profile in a condensed view
- display its questionnaire and scores associated, the cases categories and
associated scores are blue. We know its original group and its assignment.

148

The Discriminant and its methods

Interactive simulations, after clicking on Questionnaire and score :


It is possible to simulate new score by clicking on the chosen categories. They
become red.

149

SCORE - Scoring Function

Density curves

This graph draws respectively the density curves of the real BAD and the real GOOD
customers.

150

The Discriminant and its methods

Lift or Gain Curve

Horizontal axis : % of the scored cases selected, sorted by decreasing scores


Vertical axis :
% of the target category captured by the selection
The optimal curve is the one in grey where the selection captured the entire target
category and only the target category.

151

SCORE - Scoring Function

ROC Curve (Receiver Operating Curve)

Sensitivity :
Specificity :
1-Specificity :

Percentage of the target category captured (GOOD classified as GOOD)


Percentage of the other category well-classified
Percentage of the other category misclassified in the target category
(BAD misclassified as GOOD)

Closer is the curve to the upper left part of the graph, better is the separation between the
two categories of the target variables.
When the densities are equal, the ROC curve is confounded with the diagonal of the
square.

152

The Discriminant and its methods

APPLY THE SCORING FUNCTION TO A NEW DATASET


Firstly, you need to archive the model rules using the method Predictive model rules file from
the Deployment Archiving\Archiving rubric. Connect this new method to the scoring method
as follows:

9 Give a name to the rule file and specify its location


9 Import the new dataset on which you want to apply the model you archived.
9 Connect the method Predictive model deployment from the
Deployment
Archiving\Deployment rubric and configure it.
9 Run this new method and check the data view.

153

IDT 1 - Interactive Decision Tree 1

IDT 1 - INTERACTIVE DECISION TREE 1

IDT 2 - INTERACTIVE DECISION TREE 2

The IDT procedure produces decision trees from a data set. It is a discriminant procedure
for predicting the values of a categorical variable (variable to explain, with K groups) from
a set of explanatory variables that may be categorical, ordinal or continuous.
The IDT procedure gives the user a choice of three well-established methods in Data
Mining: CHAID, C4.5 and C&RT. The model produced by the method is a Decision Tree,
which can be evaluated with a test sample or by crossed validation. The procedure
includes additional information that lets you refine the results: integration of adjustment
with the a priori group inclusion probabilities, and the introduction of a cost matrix for
incorrect assignment.
The IDT procedure lets the user interactively manipulate the decision tree produced by the
method: pruning from the root, interactive segmentation of a node, and description of the
properties of a segmentation. The procedure also offers a fully interactive mode, in which
the construction of the tree is entirely based on the user's ideas. Several supporting tools (a
list of the best segmentations, descriptive statistics et al.) lets you choose the tree which
best corresponds to the problem to be solved.
At all stages of the design conceived by the user, it is possible to output the reports in
HTML format: on the complete decision tree, or locally on each node including a subset of
the database analyzed.

To illustrate this method, we use the same dataset than for the scoring function: the Credit
English.sba dataset.

154

The Discriminant and its methods

MARGINAL DISTRIBUTIONS OF THE CATEGORICAL VARIABLES


GOOD_BAD
Categories label
GOOD
BAD
Overall

Counts
237
231

%
50,64
49,36

468

100,00

JOB
Categories label
Executive
Employee
Other
Overall

AGE
Categories label
LT 23 years
GE 23 LT 40 years
GE 40 LT 50 years
GE 50 years
Overall
MARITAL
Categories label
Single
Married
Divorced
Widowed
Overall
SENIORITY
Categories label
LE 1 year
GT 1 LE 4 years
GT 4 LE 6 years
GT 6 LT 12 years
GT 12 years
Overall
SALARY
Categories label
SALARY AT THE BANK
NO SALARY
Overall
SAVINGS
Categories label
No saving
LT 10 KF Sav.
GE 10 LT 100 KF Sav.
GE 100 KF Sav.
Overall

Counts
77
237
154

%
16,45
50,64
32,91

468

100,00

Counts
98
308
62

%
20,94
65,81
13,25

468

100,00

Counts
88
150
122
108

%
18,80
32,05
26,07
23,08

CHECKIN ACCOUNT
Categories label
LT 2KF Account
GE 2 LT 5KF Account
GE 5KF Account

468

100,00

Overall

Counts
170
221
61
16

%
36,32
47,22
13,03
3,42

AVERAGE TRANSACTIONS
Categories label
Counts
154
LT 10 KF Trans.
71
GE 10 LT 30 KF Trans
129
GE 30 LT 50 KF Trans
114
GE 50 KF Trans.

%
32,91
15,17
27,56
24,36

468

100,00

Overall

468

100,00

Counts
199
47
69
66
87

%
42,52
10,04
14,74
14,10
18,59

Counts
171
161
136

%
36,54
34,40
29,06

468

100,00

468

100,00

Counts
316
152

%
67,52
32,48

468

100,00

Counts
370
58
32
8

%
79,06
12,39
6,84
1,71

468

100,00

WITHDRAWALS
Categories label
LT 40 With.
GE 40 LT 100 With.
GE 100 With.
Overall

155

NEGATIVE ACCOUNT BALANCE


Categories label
Counts
Allowed
202
Not allowed
266

%
43,16
56,84

Overall

100,00

468

CHEQUE AUTHORIZATION
Categories label
Counts
CHEQUE OK
415
NO CHEQUE
53

%
88,68
11,32

Overall

100,00

468

IDT 2 - Interactive Decision Tree 2

IDT 1
The IDT1 procedure prepares the data for the construction of the tree (Procedure IDT2). In
particular, it handles the missing data of the selected variables. The procedure outputs a
report of the treatment of the missing data.
You also have available by default an automatic characterization of the variable to
discriminate by the set of the selected explanatory variables.
This characterization gives you a better selection of the explanatory variables; for example,
by removing all of those that have no connection with the variable to discriminate.

156

The Discriminant and its methods

IDT 2
The IDT2 procedure constructs an initial segmentation tree as a function of the chosen
method (CHAID, C&RT, C4.5) and the associated parameters.
After the execution of the procedure, to the right of the method you have available a
graphical icon to handle the initial tree.

The CHAID algorithm


The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of
the oldest tree classification methods originally proposed by Kass (1980; according to
Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and
Messenger, 1973). CHAID will "build" non-binary trees (i.e., trees where more than two
branches can attach to a single root or node), based on a relatively simple algorithm that is
particularly well suited for the analysis of larger datasets.
Chi-squared Automatic Interaction Detector derives from the basic algorithm that is used
to construct (non-binary) trees, which for classification problems (when the dependent
variable is categorical in nature) relies on the Chi-square test to determine the best next
split at each step. Specifically, the algorithm proceeds as follows:
Preparing predictors. The first step is to create categorical predictors out of any
continuous predictors by dividing the respective continuous distributions into two
categories. For categorical predictors, the categories (classes) are "naturally" defined.
Merging categories. The next step is to cycle through the predictors to determine for each
predictor the pair of (predictor) categories that is least significantly different with respect
to the dependent variable; for classification problems (where the dependent variable is
categorical as well), it will compute a Chi-square test (Pearson Chi-square). If the
respective test for a given pair of predictor categories is not statistically significant as
defined by an alpha-to-merge value, then it will merge the respective predictor categories
and repeat this step (i.e., find the next pair of categories, which now may include
previously merged categories). If the statistical significance for the respective pair of
predictor categories is significant (less than the respective alpha-to-merge value), then
(optionally) it will compute a Bonferroni adjusted p-value for the set of categories for the
respective predictor.
Selecting the split variable. The next step is to choose the split the predictor variable with
the smallest adjusted p-value, i.e., the predictor variable that will yield the most significant
split; if the smallest (Bonferroni) adjusted p-value for any predictor is greater than some
alpha-to-split value, then no further splits will be performed, and the respective node is a
terminal node.
Continue this process until no further splits can be performed (given the alpha-to-merge
and alpha-to-split values).

157

IDT 2 - Interactive Decision Tree 2

The CHAID algorithm that is particularly well suited for the analysis of larger datasets
and for a first exploration of the data.

158

The Discriminant and its methods

The CART algorithm (written CR-T in SPAD because of the copyright)


C&RT builds classification and regression trees for predicting continuous dependent
variables (regression) and categorical predictor variables (classification). The classic C&RT
algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984;
see also Ripley, 1996).
For classification, the CART algorithm uses the Gini impurity criterion. It is based on
squared probabilities of membership for each target category in the node. It reaches its
minimum (zero) when all cases in the node fall into a single target category.
CART is based on a decade of research, assuring stable performance and reliable results.
CART's proven methodology is characterized by:
Reliable pruning strategy CART algorithm considers that no stopping rule could be
relied on to discover the optimal tree, so CART integrates the notion of over-growing trees
and then pruning back; this idea, fundamental to CART, ensures that important structure
is not overlooked by stopping too soon.
Powerful binary-split search approach - CART's binary decision trees are more sparing
with data and detect more structure before too little data are left for learning. Other
decision-tree approaches use multi-way splits that fragment the data rapidly, making it
difficult to detect rules that require broad ranges of data to discover.
Automatic self-validation procedures - In the search for patterns in databases it is
essential to avoid the trap of "over fitting" or finding patterns that apply only to the
training data. CART's embedded test disciplines ensure that the patterns found will hold
up when applied to new data. Further, the testing and selection of the optimal tree are an
integral part of the CART algorithm.
In addition, CART accommodates many different types of modeling problems by
providing a unique combination of automated solutions:
- surrogate splitters intelligently handle missing values;
- adjustable misclassification penalties help avoid the most costly errors
The classification and regression trees (C&RT) algorithms are generally aimed at achieving
the best possible predictive accuracy. Operationally, the most accurate prediction is
defined as the prediction with the minimum costs. The notion of costs was developed as a
way to generalize, to a broader range of prediction situations, the idea that the best
prediction has the lowest misclassification rate.
In most applications, the cost is measured in terms of proportion of misclassified cases, or
variance. In this context, it follows, therefore, that a prediction would be considered best if
it has the lowest misclassification rate or the smallest variance. The need for minimizing
costs, rather than just the proportion of misclassified cases, arises when some predictions
that fail are more catastrophic than others, or when some predictions that fail occur more
frequently than others.

159

IDT 2 - Interactive Decision Tree 2

160

The Discriminant and its methods

The C4.5 algorithm


C4.5 belongs to a succession of decision tree learners that trace their origins back to the
work of Hunt and others in the late 1950s and early 1960s (Hunt 1962). Its immediate
predecessors were ID3 (Quinlan 1979).
C4.5 uses the gain ratio to select the attribute whose value will be tested to decide how the
original training set will be partitioned. The gain ratio is a refinement of the information
gain, used in C4.5s precursor ID3.
This method presents two particularities:
9 As for the CART algorithm, C4.5 integrates the notion of over-growing trees and then
pruning back. The good size of the tree is determined by pruning. In a first step, the
tree is completely developed with the information gain criterion, and then pruned in
order to minimize the misclassification rate.
9 If the splitter of a parent node is a categorical variable, each category of the splitter
becomes a child node even if some of them are null.
C4.5 built large trees with lots of leaves. The calculations time is a bit longer.

161

IDT 2 - Interactive Decision Tree 2

Partitioning

By random sampling:
If you did not select a variable Samples definition in the IDT1 method, the choice of
the samples is done by random sampling. You have to define the percentages for the
learning set, the testing set and the pruning set (for CART only).
The parameter Random sampling initialization allows you to define different
samples from the same size.
By categories value:
If you have chosen a variable Samples definition in the IDT1 method, the categories
are listed in the window to be assigned a specific sample: learning, testing and pruning
for CART only.

162

The Discriminant and its methods

IDT2 parameters
Type of analysis:

By default: Automatic
Automatic:
The tree is growth automatically with respect to the stopped criterions defined by the user.
Automatic and crossed validation:
The tree is growth automatically and the procedure evaluates the error by crossed validation.
In this case, you have to define the number of division or subsets to be used for the crossedvalidation.
Interactive :

The tree is not growth at all. In the graphic interface, the user can develop it manually.

163

IDT 2 - Interactive Decision Tree 2

Thresholds:

Minimum count for cutting (splitting) a segment (node): (by default: 5)


This parameter defines the minimum count for splitting a node. Behind this threshold,
no further split can be performed.
By increasing this parameter, one reduces the size of the tree.
Admissible count: (by default: 1)
This parameter defines the minimum count for a leave after a split.
By increasing this parameter, one reduces the size of the tree.
N umber of tree level: (by default: 10)
This parameter defines the depth of the tree.
By decreasing this parameter, one reduces the size of the tree.
Specification threshold: (by default: 0.9)
This parameter defines the threshold from where we consider a node to belong to a
single target category. Then, no further split can be performed.
When target categories present real unbalanced weights, it is recommended to choose 1.
By decreasing this parameter, one reduces the size of the tree.
Configure the method and run it. Right-click on the IDT2 method, choose the Results
command and click on Interactive decision tree Editor.

164

The Discriminant and its methods

INTERACTIVE DECISION TREE EDITOR


Results windows
The tool for viewing the Decision Tree produced by the IDT procedure comprises several
windows grouped together in a tabs page. They correspond to several different levels of
information relative to the model constructed.

165

IDT 2 - Interactive Decision Tree 2

View the data


In a grid, this window shows all the data under analysis. Used together with the
information window on the roots, it lets you follow the path of a case in the Decision Tree.
You can copy the contents of the grid to the clipboard and then paste it into a spreadsheet
application.
The data grid has the format "Cases x Variables:
the leftmost column in grey corresponds to the identifier of individuals;
the next column contains the variable to explain;
the following columns represent the explanatory variables;
the column furthest on the right indicates the weight associated with each case.

166

The Discriminant and its methods

View the Decision Tree


This window offers a graphical representation of the Decision Tree. You can adjust the
display scale with the Zoom in/Zoom out commands, or by clicking on the corresponding
icons in the Toolbar.
The tree is presented horizontally, starting from the root on the left and moving towards
the terminal node leaves on the right. Each node offers the distribution of the estimated
conditional probability of the predicted variable in absolute (real elements), and relative
(percentages). In the upper right of the window, a caption is available associating
categorical variables with the color codes used. Attention! If an adjustment is requested,
the tool shows the adjusted estimated probabilities. In the upper part of the node there is
shown the decision rule (variable -- operator -- value) related to the creation of a leaf node.
By clicking on a terminal node in the Tree, it is possible to obtain additional information,
supplied on the right side of the window: the full path from the root to the active terminal
leaf node, and the relevance of the candidate variables for the segmentation. The latter
may be sorted according to the name of the variable, or according to the value of the
quality of the segmentation (click on the List's header).
You can also explore further the subset of individuals circumscribed by the terminal node,
or control interactively your analysis.

167

IDT 2 - Interactive Decision Tree 2

Information on the roots


When you click on a specific node, you can carry out an in-depth analysis by shifting to
the Local exploration window via the Local exploration menu, or by clicking on the
corresponding tab.
Path information and the relevance of the variables are repeated.
It is also possible to view individual elements present in the selected node, together with
their values for each variable in the analysis. Note that for each root there corresponds a
conclusion assigned by the method: the individuals that do not correspond to this
conclusion are shown in red.
Finally, it is possible to go deeper into the analysis of the root by requesting, for each
variable, in the lower part of the window, descriptive statistics on the whole set of
individuals (the root of the Tree).

168

The Discriminant and its methods

Information on the Decision Tree


This Window lets you judge the quality of the Decision Tree. The Window is divided into
several areas:

Characteristics of the Tree: shows the properties of the Decision Tree produced by the
method, as well as the number of nodes in the tree, the number of terminal node leaves
and its maximum depth. Also shown is the size of the sample used for training, for the
test and, if required, for the pruning.

Impact of the attributes: shows the role of each attribute in the elaboration of the Tree.
The value indicated represents the weighted mean of the impact of each attribute on all
the segmentation candidates. Less importance is given to the impacts measured on the
lower parts of the Tree.

Confusion Matrix: lists the confrontation between the predictions of the tree and the
values observed on the dependent variable to predict. The matrix may be measured on
the training sample, on the test sample, or in crossed validation. These last two options
are active if they were requested during the parameter setting for the procedure.

Profile: presents the current confusion matrix in the form of a row profile (e.g. to
measure sensitivities), or in the form of a column profile (to measure the specifics).

169

IDT 2 - Interactive Decision Tree 2

Explore and modify the Decision Tree


The originality of the IDT procedure rests largely on the fact that the user can explore and
edit the Tree: either by changing the tree produced by the induction procedure, or by
constructing it from zero using their expert knowledge.
Several tools are made available to the user, allowing them to set the properties of the
node segmentations, while letting them prune the parts of the Tree that are of little
interest.
The operators available that can be applied, either on the set of the nodes (leaves, more
generally), or on a previously selected node.
Operations on a node in the Tree
By right clicking on a node, you have access the context menu. According to the status of
the node (leaf or internal node), different options are available. The options let the user
specify precisely the Tree which will be best suited for the current analysis.
Two main operators are available: Prune for a node within the tree, and Segment for a leaf
node.
Prune a sub-Tree
Pruning a sub-tree consists in deleting the nodes and leaves located beneath a node that
has been previously selected. This operation is necessary when we consider that the
corresponding sub-tree does not add anything to the active analysis; or when we want to
manually induce another segmentation starting from the selected node.
Warning! This operation is only possible on the internal nodes of the Tree.
Procedure
1. Select the node from which you wish to start the pruning process
2. Right click and the Prune menu is available, if the node is not a leaf
3. Click on the Prune menu

170

The Discriminant and its methods

Segment a Tree node


For each tree node, we have the list of candidate variables for segmentation, with their
respective impacts. At your convenience, you can sort these variables by name or by
relevance so as to recover the variables of interest.
A first originality of the IDT procedure is the ability given to the user to introduce the
segmentation that they seem the most pertinent, either by following suggestions of the
IDT method, or by choosing themselves the segmentation variable.
A second very useful innovation is the possibility given to the user to themselves change
the properties of a segmentation, by letting them introduce the discretization limit for a
continuous variable. For example, the method proposes, for a segmentation according to
age, to set the limit at the 17.5 age level. On the basis of their personal knowledge of the
problem, the user may decide to change this value and manually set a limit of 18 years,
corresponding to adulthood.
Segmentation is impossible in three specific cases:
the terminal node is not a leaf. In other words, it has already been segmented, and
already has child nodes;
the node is empty and there are no cases on the node;
The node is pure, which means that a single attribute of the variable to predict is
attached to the node, and in this case the decision rule is unambiguous, so it is
pointless to take the analysis any further.
Attention! In this setting the rules for halting the expansion of the tree are deactivated
(e.g., minimum elements on the node, specialization threshold etc.).

171

IDT 2 - Interactive Decision Tree 2

Change the properties of a segmentation


IDT lets the user select the variable the most relevant for a segmentation. It also lets the
user change the properties of the segmentation they have selected.
According to the type of the variable in play in the segmentation, the procedure changes:
change the discrete/continuous threshold for the continuous variables
change the re-grouping of the categorical variables (nominal or ordinals)
Procedure
The procedure for changing the properties of the segmentation has a part in common with
the procedure for manual segmentation.
1. Select the leaf node you want to segment
2. Right click on the leaf node - So that the Segment with... menu is active, the leaf node
must be a leaf and the segmentation must be possible.
3. In the dialogue box which appears, we see the list for explanatory variables candidates
and the segmentations they propose. The sort function of the variables respects the sort
function requested in the Decision Tree window.
4. To change the properties of the variable under selection, you must click on the Change
button
5. Depending on the type of the variable, one of two dialogue boxes appear:
for the continuous variables
6. the dialogue box indicates the variable on which we are working, and offers the
discretization limit used up until now
7. the user must then enter a new threshold. Attention! The edit area does not
accept numerical values, and the decimal point character is the full-stop.
8. now validate your new threshold by clicking on the OK button

for the categorical variables (nominal -- ordinals)


6. The dialogue box indicates, in the list on the left, the sub-trees (leaves coming
from the segmentation procedure) are output, and in the list on the right, the
levels available for developing the sub-trees
7. to change the content of a sub-tree, all its contents (the levels of the explanatory
variable) must be passed to the list on the right with the help of the ">>" button,
then transfer them, by default, to another sub-tree, with the help of the "<<"
button
8. you can add or delete a sub-tree with the help of the "+" et "-" buttons
9. when the changes have been completed, you must validate the new
segmentation with the help of the OK button

172

The Discriminant and its methods

173

IDT 2 - Interactive Decision Tree 2

Edit a Tree by levels


The user can examine various options on each node. They also have the ability to
interactively edit the Decision Tree while moving through the hierarchical structure of the
nodes.
In this context, the procedure will carry out the operation requested on all the leaves
situated at the lowest level of the tree.
Two types of operation are available:
Move up one level: the procedure prunes all the nodes situated on the penultimate
level of the tree.
Go down one level: for each leaf situated on the last level of the tree, the procedure
looks for the most efficient segmentation. Warning! The rules for stopping the
expansion of the tree are deactivated at this level.
Procedure
According to the operation requested, click on the Operations menu -- Go up one level or
Operations -- Go down one level

Continue with an Automatic Analysis


At all stages during the exploration and editing of the Tree, the user has the option to
request the procedure to continue the construction of the model automatically; using the
options specified when the method's parameters were set. Users can, for example, identify
the segmentation which they find the most interesting on the tree root, then ask the
application to automatically continue the search for the best tree following this first cut.
All the options selected are active in this context, in particular the rules for halting the
expansion of the Tree.
Procedure
Make sure the Decision Tree is selected
Click on the menu for: Operations -- Automatic analysis

174

The Discriminant and its methods

Save and Backup procedures


On the first execution of the procedure, the Decision Tree is saved with the title of Default
Analysis. The Tree is shown by default when the IDT procedure starts up.
You can freely edit and change the Tree supplied by default. The results can then be saved
in two different ways:

you can either save the Tree by overwriting the previous version,
or save a new version of the Tree for the same problem with a suitable title.

At anytime you can reload into IDT a decision tree that you have saved. The different
versions are identified by the titles you have given them.
Warning! Changing the analysis parameters automatically deletes all the saves carried
out for the problem analyzed. If you want to save your work permanently, you are
advised to use reports or exports.
Save the current tree
On the execution of the IDT procedure, a tree with the title Default Analysis is
automatically created. This is the tree shown when the IDT procedure starts up. The user
can personalize this tree, and save the results of their changes permanently.
In general, it is possible to save any tree on which the user is working.
Procedure
1. Click on the File Save menu
2. IDT deletes the old version of the tree and replaces it with the new one.
Save a new version of the Tree
When working on an analysis, the user may wish to work in parallel on several different
scenarios corresponding to multiple trains of thought: you therefore have the option to
save individual versions of the Tree with different names.
Procedure
The user wants to save a new version of the Tree, from which they have pruned a branch.
1. Proceed with pruning a part of the Tree
2. Then click on the File -- Save as... menu
3. A dialogue box appears, asking the user to give a new title to this new version of
the Tree. This operation is obligatory, since the different versions are
distinguished by their titles.
4. Click on the OK button
Warning! by clicking on File -- Save, the user overwrites the version in memory.

175

IDT 2 - Interactive Decision Tree 2

Load a Decision Tree


At any time in IDT, the user can load into memory a previously saved version of the tree.
For each version of the tree there is a title assigned by the user.
Procedure
1. Click on the menu File -- Open
2. A box lists the different versions associated with the current problem
3. Select the analysis you want by clicking on its title
4. And confirm by clicking on OK
Export Rules
Any Decision Tree may be transformed into a rules base without loss of information. A
rule is a path leading from the root to the given terminal leaf node. The conclusion
associated with the rule corresponds to the conclusion associated with the leaf node.
The rule is therefore of the form: If condition Then conclusion
IDT produces the list of rules associated with a tree in HTML format. Additional
information is provided: the support for the rule which corresponds to the number of
individuals concerned by the rule; this is an indicator of the reliability of the rule. The
confidence of the rule indicates the percentage of individuals correctly classified by the
rule. This is an indicator of the precision of the rule.
IDT produces also SQL rules and SPAD rules to be assigned to new datasets.
Procedure
Click on the menu File -- Export Rules in HTML format, or SPAD or SQL
Whether your previous choice, a dialog box appears in which you can enter the
name of the file generated and its location, or an SQL editor to generate Select or
Update rules.
Then click on the Save button

176

You might also like