You are on page 1of 92

MODIFIED SEQUENTIAL FENCES

FOR IDENTIFYING
UNIVARIATE OUTLIERS
Name

: WONG HUI SHEIN

Matric number

: GS41099

Supervisor

: DR. ANWAR FITRIANTO

Co-Supervisor

: PROF. HABSHAH MIDI

INTRODUCTION
Outliers
An observation that appears to be markedly deviated
from other observations
Outside rate - Probability that an observation from
uncontaminated normal population is beyond a
specified limit or boundary
By mistake in recording or due to the malfunction of
measuring instrument, natural variability which comes
from the outside of the sample
Great influence on the parametric data analyses and
resulted in misleading results

Swamping And Masking


Swamping effect occurs when a second
observation is labeled as an outlier in the
presence of first outlier. After discard the first
outlying observation, the second observation is
detected as clean observation.
Masking effect is that the second observation
is classified as outlier without the existence of
the first outlier. After eliminate the first outlier,
the second observation is appeared as an
outlier.

Outside rate
Probability that an observation from
uncontaminated normal population is
beyond a specified limit or boundary.
The outside rate per sample is the
probability that at least one sample
observation is incorrectly assorted as an
outlier.

Tukeys Boxplot

Sequential fences
Sequential fences by Schwertman and de Silva (2007)

SDSF q2

t df , nm
kn

( IQR)

OBJECTIVES
The study is focused on modification of sequential fences, the
general objectives of this study are
i) to propose method for outlier identification in symmetric
distribution with higher accuracy and lower misclassification of
non-contaminated observations as outliers
ii) to increase the accuracy in detecting the real outliers in
asymmetric distribution with minimum swamping and masking
effect
iii) to provide an efficient sequential fences in identifying outliers
in wider types of distributions with new algorithm and parameters
estimation.

LITERATURE REVIEW

Many outlier labeling methods have been introduced and developed by


researchers in past literature Barnett (1978), Barnett and Lewis (1994), Cao
et al. (2010), Cerioli (2010), Dang and Serfling (2010), Dovoedo and
Chakraborti (2013, 2015), Hawkins (2006), Louni (2008), Schwertman et al.
(2004), Schwertman and de Silva (2007), Tukey (1977).
Generally, the methods can be categorized into two kinds, namely formal
and informal techniques.
Formal tests - Test statistics involving hypothesis testing. Null hypothesis
usually based on assumption of a particular distribution, and test if the target
extreme value is an outlier which belongs to the distribution at a given level
of significance (Iftikhar, 2011; Seo, 2006), Barnett and Lewis (1994) and
Iglewicz and Hoaglin (1993)
Informal tests - An interval or criterion for outlier detection are created
instead of hypothesis testing. Any data point exceeds the interval or criterion
is flagged as an outlier. Barnett and Lewis (1994), Iglewicz and Hoaglin
(1993), Davies and Gather (1993), and Bendre and Kale (1987).

First Contribution:
OUTLIERS DETECTION USING
SEQUENTIAL FENCES WITH DIFFERENT
ROBUST SCALES
Problem Statement
The popular Tukeys boxplot is too liberal and cause
many unusual observations are overlooked.
The sequential fences method (SDSF) uses
interquartile range to measure the dispersion of the
data.
A natural question comes to our mind is whether it can
be developed using an alternative robust scale that
can measure the dispersion of the data in the
sequential fences method to improve the performance
of sequential fences approach in detecting the outliers.
When total number of outliers is unknown, swamping
and masking effect might happen.

OBJECTIVES
Based on the technique proposed by Schwertman and
de Silva (2007), we have proposed modified methods by
substituting the interquartile range (IQR) with appropriate
robust scales
i) to increase the effectiveness in correctly identifying the
extreme values under certain outside rates
ii) to minimise the misclassification of uncontaminated
observations as outliers
iii) to compare the modified sequential fences methods
with ESD and SDSF

LITERATURE
REVIEW

1.

Tukey (1977)
Box plot.

2.

Rosner (1983)
Generalized extreme deviation (ESD) test.

3.

Hoaglin et al. (1986)


Defined outside rate per observation.

Hoaglin and Iglewicz (1987)


2. Modification of the traditional box plot method and made some adjustments to the
sample size.

Kimber (1990)
fences procedure with the use of a constant multiple of lower semi-interquartile range
and the upper semi-interquartile range.

6. Barnett and Lewis (1994)


listed 47 procedures
methodology for detection of outliers were altered from the normal to
other popular distributions
7. Iglewicz and Banerjee (2001)
adjustment formulations and graphs
handle smaller sample in both symmetric and skewed distributions.
8. Rosner (1983) and Schwertman and de Silva (2007)
low level of outside rate is necessary to reduce the swamping effect
9. Carling (2000)
modified Tukeys method by using median
size-dependent formula to regulate the boxplot for sample size
10. Schwertman et al. (2004)
simple and more flexible in setting the outside rate

11. Schwertman and de Silva (2007)


sequential fences
12. Hubert and Vandervieren (2008)
whiskers of the boxplot were modified
medcouple (MC) to assess the skewness
13. Dovoedo and Chakraborti (2015)
modified the traditional boxplot
multiples of lower and upper semi-interquartile range (SIQR)
outside rate is derived for the location-scale distributions family

OUTLIERS DETECTION USING SEQUENTIAL


FENCES WITH DIFFERENT ROBUST SCALES

METHODOLO
Ginis mean difference (GMD) GY
GMD

1
n

2

x x
i

i j

Initiated by Gini (1912)


Gerstenberger and Vogel (2015) - superior scale estimator even
under normality and wide range of distributions including heavier
tail distribution
Performs well over a wide range of distributions including heavier
tail distribution than normal distribution.
Combines the benefits of mean deviation and standard deviation
Independent on any central measure location

GSF q2

P ( X m) e

t df , nm
kn

(GMD)

2
m 1

n nm
n nm
1 n nm

m 1 !
2!

n nm

Other proposed methods with different robust


scales
Q Q

Semi-interquartile range:SIQR

SIQRSF q2

Sn: Sn 1.1926 Medi xi Med j ( x j )

Qn 2.2219 xi x j ; i j

kn

t df , nm
kn

QnSF q2

Mean absolute deviation:

( SIQR)

SnSF q 2
Qn:

t df , nm

MAD Med i

( Sn)

t df , nm
kn

(Qn)

xi Med j x j

MADSF q2

t df , nm
kn

( MAD)

Generalised Extreme Studentised


Deviate Test (ESD)

n i

(n i )t p ,v
1 t 2p ,v

n i 1

, i 1, 2, 3, , r

NUMERICAL RESULTS & DISCUSSION


Example 1
Data set of wood specific gravity data from Draper and
Smith (1966) was contaminated by Rousseeuw and Leroy
(1987).

The data consists of 20 observations.

The contaminated observations are observations 4, 6, 8 and 19


which are 1.078, 1.480, 1.649, and 2.115 standard deviations
from mean of data.

(a) GSFUsing
with
outside rate
Gini with alpha=0.25
0.25

0.65

(b) GSF
with outside rate
Using Gini with outside rate=0.20
0.20

0.65

m=1

0.60

m=1
0.60

m=2

m=2
m=3
m=4
m=5
m=6

0.55

m=3
m=4
m=5
m=6

0.55

0.50

0.50

m=6
m=5
m=4
m=3

0.45

m=6
m=5
m=4
m=3

0.45

m=2

m=2

m=1

m=1

0.40

0.40
0

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

10

11

12

(c) SDSF
with outside rate
Using IQR with outside rate=0.25
0.25

14

15

16

17

18

19

20

21

22

23

24

25

id

id

0.65

13

(d)SDSFUsingwith
outside rate 0.20
IQR with outside rate=0.20
x

0.65

m=1

m=1
0.60

0.60

0.55

m=2

m=2

m=3
m=4
m=5
m=6

m=3
m=4
m=5
m=6

0.55

0.50

0.50

m=6
m=5
m=4
m=3

0.45

m=6
m=5
m=4
m=3

0.45

m=2

m=2

m=1

m=1

0.40

0.40
0

10

11

12

13

id

14

15

16

17

18

19

20

21

22

23

24

25

10

11

12

13

id

14

15

16

17

18

19

20

21

22

23

24

25

(c)SIQRSF
with outside rate 0.25
Using SIQR with outside rate=0.25

(i) SIQRSFUsingwith
outside rate 0.20
SIQR with outside rate=0.20

0.60

0.60

m=1

m=1
0.55

m=2
m=3
m=4
m=5
m=6

0.50

0.55

m=2
m=3
m=4
m=5
m=6

0.50

m=6
m=5
m=4
m=3
m=2

m=6
m=5
m=4
m=3
m=2

m=1
0.45

m=1
0.45

0.40

0.40
0

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

10

11

12

id

13

14

15

16

17

18

19

20

21

22

23

24

25

id

(d)QnSFUsingwith
outside rate 0.25
Qn with outside rate=0.25

(j)QnSFUsingwith
outside rate 0.20
Qn with outside rate=0.25

0.65

0.65

m=1
0.60

m=2
m=3
m=4
m=5
m=6

0.55

0.50

m=1
0.60

m=2
m=3
m=4
m=5
m=6

0.55

0.50

m=6
m=5
m=4
m=3

0.45

m=6
m=5
m=4
m=3

0.45

m=2
m=1

0.40
0

10

11

12

13

id

14

15

16

17

18

19

20

21

22

23

24

25

m=2
m=1

0.40
0

10

11

12

13

id

14

15

16

17

18

19

20

21

22

23

24

25

(e)SnSF with
outside rate 0.25
Using Sn with outside rate=0.25

(k)SnSF with
outside rate 0.20
Using Sn with outside rate=0.20
x

0.60

0.60

m=1

m=1

m=2

m=2
m=3
m=4
m=5
m=6

0.55

m=3
m=4
m=5
m=6

0.55

0.50

0.50

m=6
m=5
m=4
m=3

m=6
m=5
m=4
m=3
0.45

m=2

0.45

m=2

m=1

m=1
0.40

0.40
0

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

id

id

(f)MADSF
with outside rate 0.25
Using MAD with outside rate=0.25

(l)MADSF
with outside rate 0.20
Using MAD with outside rate=0.20

0.60

0.60

m=1

m=1
0.55

m=2
m=3
m=4
m=5
m=6

0.50

0.55

m=2
m=3
m=4
m=5
m=6

0.50

m=6
m=5
m=4
m=3
m=2

m=6
m=5
m=4
m=3
m=2

m=1
0.45

m=1
0.45

0.40

0.40
0

10

11

12

13

id

14

15

16

17

18

19

20

21

22

23

24

25

10

11

12

13

id

14

15

16

17

18

19

20

21

22

23

24

25

Table 2: Result of ESD Method for Wood specific gravity


data.
i

Obs.
number

Mean

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

19
8
6
4
3
20
11
5
9
2
10
13
2
1
17

0.501
0.50626
0.51089
0.5156
0.5197
0.5163
0.5126
0.5095
0.5063
0.5091
0.5119
0.5148
0.5176
0.5151
0.5120

Standard
Test statistics
Deviation
s
Ri
0.047293
0.0421437
0.0380802
0.0334422
0.0298032
0.0275465
0.0244371
0.0222134
0.0197996
0.0180192
0.016258
0.0142897
0.0122467
0.0108386
0.0076158

2.1145
1.9756
2.0979
1.9616
1.6882
1.8756
1.6924
1.7349
1.5783
1.5589
1.5931
1.5940
1.4187
1.7398
1.3131

Critical value
Critical value
at significance at significance
level 0.25 i
level 0.20 i
2.1213
2.1901
2.0985
2.1672
2.0743
2.1428
2.0484
2.1167
2.0884
2.0589
2.0264
1.9912
1.9527
1.9102
1.8630
1.8098
1.7491
1.6785
1.5943

* Existing method

Example 2
Data of Newcomb's third series of measurements on the
passage time of light, made from July 24, 1882 till Sept. 5, 1882,
with the data made available in Stigler (1977) is considered.
Three randomly chosen data has been contaminated with
addition of constant 2.5 standard deviations fixed shift.
The contaminated observations are observation 6, 9
and 10 which are two outliers situated at lower tail and
one outlier situated at upper tail.

(a) GSF
with outside rate 0.25
Using Gini with alpha=0.25

(b)SDSF with
outside rate 0.25
Using IQR with alpha=0.25
x

m=1

45

45

m=1
40

m=2

40

35

35

30

30

25

25

20

20

m=3
m=2

15

15

m=1

10

10

m=1

5
0

10

20

30

40

50

60

70

10

20

30

40

id

id

(e) SnSF with outside rate 0.25


Using Sn with outside rate=0.25

45

40

m=1

35

m=2
m=3
m=4
m=5
m=6

30

25

20

m=6
m=5
m=4
m=3
m=2

15

m=1

10

5
0

10

20

30

40

id

50

60

70

50

60

70

Table 4: Result of ESD test for Newcombs data

Obs.
numbe
r

Mean

Standard
deviation
s

Test
statistics
Ri

Critical value
at significance
level 0.25 i

Critical value
at significance
level 0.20 i

27.409091 5.8123933

3.0264485

2.6075519

2.6757258

27.138462 5.4223983

3.1606792

2.6018441

2.6700555

10

27.40625

5.013375

2.8735632

2.5960354

2.6642843

55

27.634921 4.7051845

2.4154376

2.5901223

2.6584088

31

27.451613

2.1166398

2.5841012

2.6524254

4.5111062

* Existing method

SIMULATION
STUDY

Table 5: Proportion misclassifying the uncontaminated observations


as outliers when there is no outlier
No outlier (n=20)
Lower Tail
Nominal
outside rate
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

Upper Tail

0.05

0.025

0.005

0.05

0.025

0.005

0.0344
0.0606
0.0503
0.4531
0.1935
0.0666
0.4759

0.0208
0.0366
0.0204
0.3380
0.1289
0.0396
0.3715

0.0031
0.0035
0.0023
0.2250
0.0510
0.0086
0.2621

0.0259
0.0598
0.0499
0.4564
0.2003
0.0667
0.4777

0.0206
0.0376
0.0202
0.3482
0.1306
0.0365
0.3783

0.0035
0.0102
0.0019
0.2279
0.0500
0.0095
0.2699

No outlier (n=100)
Lower Tail
Nominal
outside rate
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

Upper Tail

0.05

0.025

0.005

0.05

0.025

0.005

0.0351
0.0405
0.0237
0.9612
0.3519
0.2680
0.9649

0.0326
0.0243
0.0207
0.9174
0.2322
0.1633
0.9221

0.0044
0.0135
0.0036
0.7490
0.0775
0.0474
0.7546

0.0463
0.0428
0.0354
0.9589
0.3515
0.2685
0.9608

0.0269
0.0220
0.0212
0.9181
0.2354
0.1674
0.9187

0.0021
0.0038
0.0020
0.7434
0.0829
0.0487
0.7507

(a) Lower Tail

(b) Upper Tail

0.5

0.5

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.4

Proportion of misclassified as outliers

Proportion of misclassified as outliers

0.4

0.3

0.2

0.3

0.2

0.1

0.1

0.0

0.0
0.00

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Plots of proportion of misclassified outliers versus nominal outside rates at lower


tail (a) and upper tail (b) (without outliers) for sample size 20

(a) Lower Tail

(b) Upper Tail

1.0

0.9

0.8

0.8

0.7

0.7

0.6

0.5

0.4

0.3

0.6

0.5

0.4

0.3

0.2

0.2

0.1

0.1

0.0

0.0

0.00

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.9

Proportion of misclassified as outliers

Proportion of misclassified as outliers

1.0

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Plots of proportion of misclassified outliers versus nominal outside rates at lower


tail (a) and upper tail (b) (without outliers) for sample size 100

Table 6: Proportion classifying as outliers when there is one outlier


at upper tail
One outlier (n=20)
Lower Tail

Upper Tail (First)

Nominal
outside rate

0.05

0.025

0.005

0.05

0.025

0.005

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.0636
0.0619
0.0450
0.6531
0.1806
0.0576
0.6759

0.0254
0.0368
0.0251
0.5380
0.1000
0.0331
0.5714

0.0222
0.0125
0.0003
0.3250
0.0466
0.0077
0.3619

0.8436
0.7539
0.8206
0.8256
0.8163
0.8346
0.8336

0.5425
0.5519
0.7763
0.7635
0.7640
0.7265
0.7543

0.3615
0.4173
0.5475
0.6813
0.4499
0.4236
0.6711

One outlier (n=100)


Lower Tail
Nominal
outside rate
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

Upper Tail (First)

0.05

0.025

0.005

0.05

0.025

0.005

0.2292
0.1458
0.0854
0.9612
0.3519
0.2654
0.9649

0.1168
0.1054
0.0526
0.9174
0.2321
0.1611
0.9221

0.0043
0.0095
0.0035
0.7490
0.0774
0.0471
0.7546

0.8818
0.8610
0.8571
0.9685
0.6129
0.7180
0.8638

0.6242
0.775
0.7844
0.8976
0.4910
0.5909
0.8149

0.4582
0.7157
0.7416
0.8085
0.1496
0.2028
0.7667

(b) Upper Tail

(a) Lower Tail


1.0

1.0

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.9

0.8

0.8

0.7

0.7
Proportion of correctly classified as outlier

Proportion of misclassified as outliers

0.9

0.6

0.5

0.4

0.3

0.6

0.5

0.4

0.3

0.2

0.2

0.1

0.1

0.0
0.00

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.0
0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Plots for the proportion of misclassified outliers at lower tail (a) and proportion of
correctly classified outliers at upper tail (b) versus nominal outside rates (with
single outlier) for sample size 100

0.07

Table 7: Proportion classifying as outliers when there is two outliers


at upper tail
Two outliers (n=20)
Upper Tail (First)

Lower Tail

Upper Tail (Second)

Nominal
outside
rate

0.05

0.025

0.005

0.05

0.025

0.005

0.05

0.025

0.005

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.0363
0.0549
0.0051
0.6531
0.1586
0.0492
0.6752

0.0227
0.0364
0.0138
0.5380
0.1048
0.0273
0.5710

0.0032
0.0124
0.0021
0.3250
0.0380
0.0065
0.3612

0.7239
0.9363
0.9452
0.9474
0.8566
0.9370
0.9494

0.6631
0.8543
0.8590
0.8614
0.7943
0.8727
0.8834

0.5284
0.5870
0.5931
0.7788
0.5903
0.7633
0.7495

0.7092
0.8943
0.8949
0.8496
0.6871
0.8428
0.8968

0.6551
0.7939
0.8473
0.8175
0.5794
0.8328
0.8497

0.5235
0.5196
0.5245
0.7546
0.5068
0.6794
0.7464

Two outliers (n=100)


Upper Tail (First)

Lower Tail

Upper Tail (Second)

Nominal
outside
rate

0.05

0.025

0.005

0.05

0.025

0.005

0.05

0.025

0.005

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.0405
0.0894
0.0241
0.9612
0.3515
0.2622
0.9649

0.0207
0.0368
0.0128
0.9174
0.2315
0.1575
0.9221

0.0035
0.0064
0.002
0.749
0.0773
0.0466
0.7546

0.9066
0.9865
0.9974
1.0000
1.0000
1.0000
1.0000

0.8306
0.967
0.9923
1.0000
1.0000
1.0000
1.0000

0.7052
0.8593
0.9844
1.0000
0.9999
0.9999
1.0000

0.8275
0.9639
0.9875
1.0000
1.0000
1.0000
1.0000

0.7575
0.9111
0.964
1.0000
1.0000
1.0000
1.0000

0.7032
0.6843
0.9443
1.0000
0.9997
0.9998
1.0000

Table 8: Proportion classified as outliers when there is three outliers


at the upper tail
Three outliers (n=20)
Upper Tail (First)
Upper Tail (Second)

Lower Tail

Upper Tail (Third)

Nominal
outside
rate

0.05

0.025

0.005

0.05

0.025

0.005

0.05

0.025

0.005

0.05

0.025

0.005

ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.0534
0.0689
0.0043
0.6531
0.0377
0.1287
0.6712

0.0232
0.0448
0.0121
0.5380
0.0208
0.0807
0.5655

0.0028
0.0186
0.0014
0.3250
0.0042
0.0302
0.3543

0.7059
0.9746
0.9874
0.9981
0.9726
0.8892
0.9126

0.5004
0.9525
0.9587
0.9972
0.9569
0.8356
0.9131

0.3052
0.8293
0.9043
0.9965
0.8022
0.5707
0.8530

0.7034
0.9583
0.9691
0.9934
0.9536
0.8867
0.9126

0.5002
0.9122
0.9337
0.9896
0.9407
0.8346
0.8731

0.3001
0.829
0.9041
0.939
0.8022
0.5707
0.8530

0.6021
0.9262
0.9573
0.9916
0.9375
0.8755
0.8124

0.4967
0.8354
0.9033
0.9742
0.8759
0.8296
0.7313

0.2999
0.8283
0.8692
0.8896
0.8020
0.5706
0.7453

Three outliers (n=100)


Upper Tail (First)
Upper Tail (Second)

Lower Tail
Nominal
outside
rate
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

Upper Tail (Third)

0.05

0.025

0.005

0.05

0.025

0.005

0.05

0.025

0.005

0.05

0.025

0.005

0.3685
0.1489
0.2826
0.9612
0.3497
0.2972
0.9649

0.1296
0.0978
0.1228
0.9174
0.2302
0.1557
0.9221

0.0186
0.0158
0.0149
0.749
0.077
0.045
0.7546

0.7331
0.9939
0.9925
1.0000
1.0000
1.0000
1.0000

0.6835
0.9881
0.9836
1.0000
1.0000
1.0000
1.0000

0.5907
0.6843
0.9927
1.0000
0.9999
0.9999
1.0000

0.7278
0.9897
1.0000
1.0000
1.0000
1.0000
1.0000

0.5892
0.9865
1.0000
1.0000
1.0000
1.0000
1.0000

0.5848
0.8593
0.9918
0.9984
0.9848
0.9963
0.9999

0.6154
0.9826
1.0000
1.0000
1.0000
1.0000
1.0000

0.4982
0.9359
1.0000
1.0000
1.0000
1.0000
1.0000

0.3765
0.7016
0.7804
0.9476
0.8997
0.9764
0.9998

Table 9: Probability of misclassifying additional observations as


outliers confidence level for various sample sizes (two tails) when
sampling is from standard normal distribution
Outside
rate

Methods

0.05

SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF

0.025

0.005

10

20

0.0760
0.0506
0.7092
0.2946
0.0654
0.7628
0.0373
0.0136
0.5925
0.2127
0.0400
0.6543
0.0087
0.0005
0.3779
0.0998
0.0156
0.4402

0.0847
0.0551
0.8604
0.3464
0.1233
0.8690
0.0441
0.0381
0.7616
0.2367
0.0697
0.7818
0.0091
0.0044
0.5022
0.0902
0.0187
0.5492

Sample size
30
50
0.0956
0.0662
0.9180
0.3834
0.1826
0.9284
0.0522
0.0438
0.8434
0.2610
0.1007
0.8603
0.0119
0.0054
0.6143
0.0943
0.0295
0.6497

0.1008
0.0713
0.9714
0.4420
0.1715
0.9745
0.0606
0.0521
0.9340
0.3026
0.1797
0.9380
0.0174
0.0057
0.7678
0.1132
0.0520
0.7833

80

100

0.1044
0.0818
0.9925
0.5377
0.2218
0.9946
0.0713
0.0653
0.9797
0.3764
0.2618
0.9783
0.0192
0.0086
0.8779
0.1410
0.0815
0.8860

0.1176
0.1275
0.9968
0.5673
0.2375
0.9964
0.0938
0.1235
0.9889
0.3894
0.2986
0.9899
0.0430
0.0102
0.9122
0.1491
0.0929
0.9179

SUMMARY

Main focus : modify the sequential fences using different robust scales for an
accurate identification of outliers and reduce the misclassifying non-contaminated
observations as outliers
ESD method - reduce swamping effect when there is no outlier but computer
intensive - suitable for initial screening
SIQRSF, SnSF, QnSF, MADSF - less likely to be subjected to masking effect but
vulnerable to swamping
GSF method - detecting most of the outliers and less likely to incorrectly declaring
an uncontaminated observation as outlier although the number of outliers is
increased
SDSF - still effective in lowering the rates of misidentifying uncontaminated
observation as outlier in a large data.
From the simulations, when using GSF method, the proportion of misclassifying an
uncontaminated observation showed a decreasing trend as the nominal rate
decreases. Only one exception when sample size is 100 with three outliers where
SDSF method was found better in lowering the rate of misclassifying
uncontaminated observations at the lower tail.
The research should be continued by increasing the accuracy in detecting all the
outliers and reducing the proportion of misclassification in the consideration of
larger sample size or skewed distributions.

Second Contribution:
ADJUSTED SEQUENTIAL FENCES FOR
DETECTING UNIVARIATE OUTLIERS IN
Problem Statement
SKEWED DISTRIBUTION

A major problem arise which regarding to the construction of fences


in skewed distributions
The sequential fences method proposed by Schewertman and de
Silva (2007) is limited for normal or approximately normal data only.
At heavy tailed symmetric distributions, many uncontaminated data
points will fall beyond the fence whereas regular data from light
tailed distributions will scarcely exceed the outlier cut off values.
The similar situations occur in skewed distributions. The thicknesses
of tails on both sides of data distribution are different.
Consequently, this outlier identification approach will not function
properly when data are skewed.

OBJECTIVES
A technique of incorporating the skewness in the construction of
sequential fences is proposed.
We concentrate on the univariate and right skewed underlying
distribution data.
The objectives are:
to identify the multiple outliers in symmetric and skewed
distributions.
to improve the outlier detection method in skewed distribution
using sequential fences.
to improve the coverage of the sequential fences in outliers
detection.

REVIEWS OF OUTLIERS DETECTION


TECHNIQUES FOR SKEWED DATA

ADJUSTED SEQUENTIAL FENCES FOR


DETECTING UNIVARIATE OUTLIERS IN
SKEWED DISTRIBUTION
Skewness can be applied to evaluate the shape and
asymmetry of a distribution.
By using measure of skewness, it can reveal the curve of
the distribution is symmetric or being distorted with
positive or negative skewness.
Hence, an adjustment is made to the sequential fences
method by considering the skewness of the distribution
to deal with the outlier identification problem in skewed
distributions.

METHODOLOGY

Incorporating Skewness into Sequential Fences

NUMERICAL RESULTS &


DISCUSSION

Example 1: TIMES BETWEEN FAILURES DATA

A real data with the average time between failures of a valve which a
chemical engineer wishes to control which is taken from Montgomery
(2009).
The data set contains 20 times between failures.
Initially, the twenty observations do not fail the Anderson Darling goodness
of fit test.
According to Easyfit software, the suitable distribution for the data is
generalized extreme value distribution with skewness value of 1.9079.
Although this data is positively skewed, it does not fail the normality test.

Summary of the result of the detection of outliers in


times between failures data using SDSF and ASF

Example 2: Oil Yield for the Belle Ayr Liquefaction data


The Belle Ayr liquefaction data in Montgomery et al. (2001
Table B.11, Appendix B) which is about thermal liquefaction of
coal with 27 observations.
Based on the value of Anderson Darling statistics, the test is
failed to reject the null hypothesis that the data come from
normal distribution.
Based on Easyfit software, the suitable distribution that fit the
data is Dagum distribution whereby the skewness of the
underlying distribution is 1.0266.
This shows that the underlying distribution of this data is slightly
right skewed but does not fail the normality test.

Summary of the result of the detection of


outliers in Oil yield for the Belle Ayr liquefaction
data using SDSF and ASF

SIMULATION STUDY

Efficacy of the proposed method was examined by presenting


Error rates for various sample size at different nominal rates for
symmetric and skewed distribution
Proportion of the methods in correctly detecting the outlying
observations
Proportion for each of the method in mistakenly identify the
outlier at both tails when there is no existence of outliers and
lower tail when the outliers are present at the upper tail
Proportion of the next regular observation incorrectly classified
as outliers after the real outliers have been detected

0.16

1.0

SDSF
ASF

SDSF
ASF

0.14

0.8

0.12

0.10

outside rate

outside rate

0.6

0.08

0.4

0.06

0.04
0.2

0.02

0.0

0.00
0

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

0.110

0.40

SDSF
ASF

0.105

0.50

SDSF
ASF

SDSF
ASF

0.100
0.45

0.35

0.095
0.090

0.40

0.085
0.30

0.080

0.35

0.075
0.070

0.25

0.055
0.050

outside rate

0.30

0.060

outside rate

outside rate

0.065

0.20

0.045

0.20

0.15

0.040

0.25

0.035
0.15

0.030
0.10

0.025

0.10

0.020
0.015

0.05
0.05

0.010
0.005
0.000

0.00

0.00

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Plots of average of proportion of misclassifying uncontaminated


observation as outlier
at lower and upper tail for various nominal outside rates when there is no
outlier

0.09

0.10

0.100

(a) Proportion of
misclassificatio
n
at lower tail

(c) Proportion of
misclassification
at upper tail

(b) Proportion of
correctly identify outlier
at upper tail
0.11

1.00

0.10

0.99

0.090
0.085

0.09

0.98

0.080
0.075

0.08

proportion of misidentification

0.065
0.060
0.055
0.050
0.045
0.040
0.035

proportion of misidentification

0.97

0.070

proportion of misidentification

SDSF
ASF

SDSF
ASF

SDSF
ASF

0.095

0.96

0.95

0.94

0.07

0.06

0.05

0.04

0.93

0.030

0.03

0.025

0.92

0.020

0.02

0.015

0.91

0.010

0.01

0.005

0.90

0.000
0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.00

0.00
0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

For simulations with contamination of multiple outliers, in normal and skewed


distributions, the similar general pattern as previous observations appeared.
ASF method is substantially better than SDSF by showing a consistent pattern
that ASF identified most of the real outliers with lower rates of misidentification at
lower tail and less likely to incorrectly classified the next observation as outlier.

0.10

Summary

In this chapter, we focused on the identification of univariate outliers


in symmetric and moderately skewed distribution.

The results on real data and simulation study indicate that the ASFs
performance in correctly identify outliers is stable with lower error of
misidentified uncontaminated observations as outliers as the
skewness and sample size of the data increase.

SDSF method is better in declaring outliers but many


uncontaminated observations are misclassified as outliers at the
same time.

In future, the idea of adjusted sequential fences method can be


extended to detect multivariate outliers and applied to more highly
skewed distributions.

Third Contribution:
SPLIT SAMPLE SEQUENTIAL FENCES BASED
Problem
Statement
ON BOOTSTRAP
CUT OFF POINTS FOR
IDENTIFYING OUTLIERS AND PARAMETER
ESTIMATIONS

Initial Screening of data


Initial screening of data is important before start a data
analysis (Tabachnick & Fidell, 2001).
Monte Carlo simulation is a computerized method which
involves random sampling from probability distributions in
order to calculate for risk in quantitative analysis and decision
making.
However, researchers do not aware about the accuracy of the
generated data and just proceed the data analysis process by
interpreting and making conclusions from the results (Fitrianto
& Midi, 2011).
This can lead to erroneous conclusions.
Thus, the data generating process does not give warranty the
data is free from outliers. It is necessary to screen for the data
before conduct a data analysis.

Bootstrap Resampling
Bootstrapping is a statistical technique for
making an estimation of the sampling
distribution of an estimator by
sampling with replacement from the
actual sample.
The technique allocates measures of
precision to sample estimates with bias
and root of mean square error.

OBJECTIVES
Adjustment to the sequential fences SDSF involving
bootstrap resampling is proposed. The objectives are
i) to increase the coverage of the data that lie on either
side of the tails so that it can be applied to different types
of distributions and various sizes of data
ii) to modify the algorithm of the sequential fences
technique by incorporating with the screening of data
iii) to show accuracy comparison between proposed
method with SDSF and Tukeys boxplot in parameter
estimations after outliers detection is done.

LITERATURE REVIEWS
DATA SCREENING
Tabachnick and Fidell (2001) recommended a
procedure for screening a data with appropriate
sequences.
According to the past literatures such as Beckman and
Cook (1983), Ahmad et al. (2011), Fitrianto and Midi
(2011), and Tabachnick and Fidell (2011), outlier
detection is a segment of data screening procedures
which should be conducted prior any statistical analysis.
Thus, screening of data is a vital initial step before begin
a statistical analysis.

BOOTSTRAPPING IN OUTLIER DETECTION


AND PARAMETER ESTIMATION

Efron (1979) introduced a bootstrap


technique which is a useful statistical
technique that provides broader heading of
resampling.
Singh and Xie (2003) proposed a bootstrap
based outlier detection plot or known as
bootlier plot which is a non-parametric
graphical tool to identify the outliers.

ESTIMATION OF ROBUST
ESTIMATORS

METHODOLOGY
Step 1: Screening of Data to Generate Clean Data
Flow chart of the Generating Clean Data (GCD)
algorithm

Generate data
Screening data with outlier detection method, GSF

Is there any
outlier?
No

Further data analysis

Remove the data


Yes

Step 2: Detecting Outliers Using Split Sample


Sequential Fences with Determination of Cut Off
Points based on Bootstrap Resampling
Flow chart of the proposed split sample method and determination of cut off
points based on bootstrap resampling method (SSFB) algorithm

Contaminated sample
Bootstrap Resampling with B=2000
Compute lower and upper fences for each
bootstrap resample
Arrange the computed fences in
ascending order
Obtain the median of the fences

Check for outliers

To check the efficacy of the proposed approach, robust estimators are computed.
Trimmed mean & trimmed standard deviation for one sided & two sided
The determination of number of observation to be trimmed is based on the number
outliers detected by a particular outlier detection method.
Flow chart of the trimmed estimators based on bootstrap resampling (TEB)
algorithm
Contaminated sample
Compute Robust Estimators for sample,
Bootstrap Resampling with B=2000
Compute Robust Estimators for each bootstrap resample,
1*,2*,3*,,B* based on the outliers detected
Mean of the Robust Parameters, E=1Bi=1Bi*
Interpretation

Trimmed mean for one sided and two sided can be


obtained as follows:

Similar trimming procedures are carried out for


estimating trimmed standard deviation which are
simulated as follows:

Performance of Bootstrap Based Split Sample


Sequential Fences in Comparison with Existing
Sequential Fences and Tukeys Boxplot

Methods

SDSF: SDSF without Bootstrap


SDSFB: SDSF with Bootstrap
TB: Tukeys boxplot without Bootstrap
TBB: Tukeys boxplot with Bootstrap
SSFB: Proposed Split sample sequential
fences with Bootstrap

RESULT OF OUTLIERS
DETECTION

The results reveal that in the absence of outliers, most of the


methods had similar result where no observation was misidentified
as outlier.
The proposed SSFB procedure also did not flag any observation as
outlier even in medium and large size with multiple outliers.
SSFB method was substantially better in not misclassifying
uncontaminated data as outlier while SDSF, SDSFB, TB and TBB
methods were good in correctly identifying all the outliers but more
likely to mislabel the uncontaminated observation as outlying
observation.
When there were multiple outliers, the SSFB still can identify the
outliers correctly in symmetric and positively skewed distribution.
The results of the simulation also revealed a good sight that
determination of fences using bootstrap can decrease the
occurrence of misclassification uncontaminated data.

Bias and Root Mean Square Error


(RMSE) of
one-sided trimmed mean

Bias and Root Mean Square Error


(RMSE) of
two-sided trimmed mean

Bias and RMSE of


trimmed standard deviation

SUMMARY
Overall, SSFB performed substantially better
compared to SDSFB and TBB by showing closer
estimation values to the sample estimators.
In summary, the results of bootstrap resampling
signify that the SSFB technique performs well in
identification of outliers and better estimation of
robust estimators in most of the scenarios.
In future, research regarding SSFB approach
can be extended from univariate case to identify
outliers in bivariate or multivariate case.

CONCLUSION AND FUTURE


WORKS
The presence of outlier can distort the data analysis
and cause misleading conclusion.
Hence, outlier detection is vital and has received
considerable attention from researchers.
Although the existing outliers labeling methods are
usually simple to use, some regular observations
may turn out to be falsely detected as outliers and
some real outliers are overlooked.
In this research, sequential fences is studied and
modified to become a better method in identifying
outliers which can be applied in wider range of data
such as symmetric and skewed distribution data.

2) Skewness of the underlying distribution is considered and applied


to evaluate the shape and asymmetry of a distribution.

By using measure of skewness, able to know the curve of the distribution


is symmetric or being distorted with positive or negative skewness.
Adjustment of sequential fences using skewness (ASF).
ASF and SDSF are applied to detect outlier in symmetric and moderately
skewed distributions with different parameters.
In the simulations, SDSF was substantially superior in correctly identified
the outlier in nearly 100%.
SDSF caused large number of uncontaminated data values are flagged as
outliers when outlier is absent.
From overall results, the ASF approach actually precisely identified outliers
more regularly compared to the SDSF procedure.
ASF also gave better performance which less likely to misidentify noncontaminated observations as outliers compared to SDSF.
Hence, an effort can be made by extending the approach to identify
multivariate outliers and more highly skewed distributions in future.

3) GSF as the initial screening of simulated data and modified the fences to increase
the coverage of the data that lie on either side of the tails so that the proposed
fences can be applied to wider types of distribution and variety sample sizes.
Trimmed mean and standard deviatons are computed to validate the method.
A new construction method of sequential fences was propose and is called Sequential
Fences based on Bootstrap Resampling (SSFB).
Bootstrap resampling is used as a statistical technique which can provide broader
heading of resampling and estimates of the sampling distribution by drawing a large
amount of random sample from a population.
Based on the number of outliers detected, parameters of the population such as
trimmed mean and trimmed standard deviation involving one-sided and two-sided
trimming are estimated to prove the superiority of the proposed approach.
For outlier detection, SSFB performed well in identifying the outlier without mistakenly
labeling the regular observations as outliers whereas SDSFB and TBB are better in
detecting most of the outliers but many uncontaminated observations are inaccurately
flagged outliers.
Besides that, with the implement of bootstrap resampling technique into SDSF and
Tukeys boxplot (TB), SDSFB and TBB are found to be more productive in outlier
detection as it can be seen that the rate of misclassification of the outliers is reduced.
From overall results, the performance of SSFB is better compared to SDSFB and TBB.
The proposed approach provides closer estimation values to the sample estimators.
Meanwhile, the outstanding performances of SSFB over SDSFB and TBB are clearer
in data sets with larger number of outlying observations and skewed distribution data.
In summary, SSFB technique is effective in outliers identification and estimation of
robust estimators in most of the scenarios.

In conclusion, our researchs objectives have been


achieved.
There are some recommendations that can be further
studied in future. For instance, GSF can be improved
in reducing the swamping effect in a large sample size.
This research should be continued by constructing
fences with the ability to detect all outliers and reduce
the proportion of misclassification in the consideration
of larger sample size and skewed distributions.
In future, the idea of adjusted sequential fences
method can be extended to detect multivariate outliers
and applied to more highly skewed distributions.
Besides, research regarding SSFB approach can be
extended from univariate case to identify outliers in
bivariate or multivariate case.

You might also like