Professional Documents
Culture Documents
FOR IDENTIFYING
UNIVARIATE OUTLIERS
Name
Matric number
: GS41099
Supervisor
Co-Supervisor
INTRODUCTION
Outliers
An observation that appears to be markedly deviated
from other observations
Outside rate - Probability that an observation from
uncontaminated normal population is beyond a
specified limit or boundary
By mistake in recording or due to the malfunction of
measuring instrument, natural variability which comes
from the outside of the sample
Great influence on the parametric data analyses and
resulted in misleading results
Outside rate
Probability that an observation from
uncontaminated normal population is
beyond a specified limit or boundary.
The outside rate per sample is the
probability that at least one sample
observation is incorrectly assorted as an
outlier.
Tukeys Boxplot
Sequential fences
Sequential fences by Schwertman and de Silva (2007)
SDSF q2
t df , nm
kn
( IQR)
OBJECTIVES
The study is focused on modification of sequential fences, the
general objectives of this study are
i) to propose method for outlier identification in symmetric
distribution with higher accuracy and lower misclassification of
non-contaminated observations as outliers
ii) to increase the accuracy in detecting the real outliers in
asymmetric distribution with minimum swamping and masking
effect
iii) to provide an efficient sequential fences in identifying outliers
in wider types of distributions with new algorithm and parameters
estimation.
LITERATURE REVIEW
First Contribution:
OUTLIERS DETECTION USING
SEQUENTIAL FENCES WITH DIFFERENT
ROBUST SCALES
Problem Statement
The popular Tukeys boxplot is too liberal and cause
many unusual observations are overlooked.
The sequential fences method (SDSF) uses
interquartile range to measure the dispersion of the
data.
A natural question comes to our mind is whether it can
be developed using an alternative robust scale that
can measure the dispersion of the data in the
sequential fences method to improve the performance
of sequential fences approach in detecting the outliers.
When total number of outliers is unknown, swamping
and masking effect might happen.
OBJECTIVES
Based on the technique proposed by Schwertman and
de Silva (2007), we have proposed modified methods by
substituting the interquartile range (IQR) with appropriate
robust scales
i) to increase the effectiveness in correctly identifying the
extreme values under certain outside rates
ii) to minimise the misclassification of uncontaminated
observations as outliers
iii) to compare the modified sequential fences methods
with ESD and SDSF
LITERATURE
REVIEW
1.
Tukey (1977)
Box plot.
2.
Rosner (1983)
Generalized extreme deviation (ESD) test.
3.
Kimber (1990)
fences procedure with the use of a constant multiple of lower semi-interquartile range
and the upper semi-interquartile range.
METHODOLO
Ginis mean difference (GMD) GY
GMD
1
n
2
x x
i
i j
GSF q2
P ( X m) e
t df , nm
kn
(GMD)
2
m 1
n nm
n nm
1 n nm
m 1 !
2!
n nm
Semi-interquartile range:SIQR
SIQRSF q2
Qn 2.2219 xi x j ; i j
kn
t df , nm
kn
QnSF q2
( SIQR)
SnSF q 2
Qn:
t df , nm
MAD Med i
( Sn)
t df , nm
kn
(Qn)
xi Med j x j
MADSF q2
t df , nm
kn
( MAD)
n i
(n i )t p ,v
1 t 2p ,v
n i 1
, i 1, 2, 3, , r
(a) GSFUsing
with
outside rate
Gini with alpha=0.25
0.25
0.65
(b) GSF
with outside rate
Using Gini with outside rate=0.20
0.20
0.65
m=1
0.60
m=1
0.60
m=2
m=2
m=3
m=4
m=5
m=6
0.55
m=3
m=4
m=5
m=6
0.55
0.50
0.50
m=6
m=5
m=4
m=3
0.45
m=6
m=5
m=4
m=3
0.45
m=2
m=2
m=1
m=1
0.40
0.40
0
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
10
11
12
(c) SDSF
with outside rate
Using IQR with outside rate=0.25
0.25
14
15
16
17
18
19
20
21
22
23
24
25
id
id
0.65
13
(d)SDSFUsingwith
outside rate 0.20
IQR with outside rate=0.20
x
0.65
m=1
m=1
0.60
0.60
0.55
m=2
m=2
m=3
m=4
m=5
m=6
m=3
m=4
m=5
m=6
0.55
0.50
0.50
m=6
m=5
m=4
m=3
0.45
m=6
m=5
m=4
m=3
0.45
m=2
m=2
m=1
m=1
0.40
0.40
0
10
11
12
13
id
14
15
16
17
18
19
20
21
22
23
24
25
10
11
12
13
id
14
15
16
17
18
19
20
21
22
23
24
25
(c)SIQRSF
with outside rate 0.25
Using SIQR with outside rate=0.25
(i) SIQRSFUsingwith
outside rate 0.20
SIQR with outside rate=0.20
0.60
0.60
m=1
m=1
0.55
m=2
m=3
m=4
m=5
m=6
0.50
0.55
m=2
m=3
m=4
m=5
m=6
0.50
m=6
m=5
m=4
m=3
m=2
m=6
m=5
m=4
m=3
m=2
m=1
0.45
m=1
0.45
0.40
0.40
0
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
10
11
12
id
13
14
15
16
17
18
19
20
21
22
23
24
25
id
(d)QnSFUsingwith
outside rate 0.25
Qn with outside rate=0.25
(j)QnSFUsingwith
outside rate 0.20
Qn with outside rate=0.25
0.65
0.65
m=1
0.60
m=2
m=3
m=4
m=5
m=6
0.55
0.50
m=1
0.60
m=2
m=3
m=4
m=5
m=6
0.55
0.50
m=6
m=5
m=4
m=3
0.45
m=6
m=5
m=4
m=3
0.45
m=2
m=1
0.40
0
10
11
12
13
id
14
15
16
17
18
19
20
21
22
23
24
25
m=2
m=1
0.40
0
10
11
12
13
id
14
15
16
17
18
19
20
21
22
23
24
25
(e)SnSF with
outside rate 0.25
Using Sn with outside rate=0.25
(k)SnSF with
outside rate 0.20
Using Sn with outside rate=0.20
x
0.60
0.60
m=1
m=1
m=2
m=2
m=3
m=4
m=5
m=6
0.55
m=3
m=4
m=5
m=6
0.55
0.50
0.50
m=6
m=5
m=4
m=3
m=6
m=5
m=4
m=3
0.45
m=2
0.45
m=2
m=1
m=1
0.40
0.40
0
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
id
id
(f)MADSF
with outside rate 0.25
Using MAD with outside rate=0.25
(l)MADSF
with outside rate 0.20
Using MAD with outside rate=0.20
0.60
0.60
m=1
m=1
0.55
m=2
m=3
m=4
m=5
m=6
0.50
0.55
m=2
m=3
m=4
m=5
m=6
0.50
m=6
m=5
m=4
m=3
m=2
m=6
m=5
m=4
m=3
m=2
m=1
0.45
m=1
0.45
0.40
0.40
0
10
11
12
13
id
14
15
16
17
18
19
20
21
22
23
24
25
10
11
12
13
id
14
15
16
17
18
19
20
21
22
23
24
25
Obs.
number
Mean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
19
8
6
4
3
20
11
5
9
2
10
13
2
1
17
0.501
0.50626
0.51089
0.5156
0.5197
0.5163
0.5126
0.5095
0.5063
0.5091
0.5119
0.5148
0.5176
0.5151
0.5120
Standard
Test statistics
Deviation
s
Ri
0.047293
0.0421437
0.0380802
0.0334422
0.0298032
0.0275465
0.0244371
0.0222134
0.0197996
0.0180192
0.016258
0.0142897
0.0122467
0.0108386
0.0076158
2.1145
1.9756
2.0979
1.9616
1.6882
1.8756
1.6924
1.7349
1.5783
1.5589
1.5931
1.5940
1.4187
1.7398
1.3131
Critical value
Critical value
at significance at significance
level 0.25 i
level 0.20 i
2.1213
2.1901
2.0985
2.1672
2.0743
2.1428
2.0484
2.1167
2.0884
2.0589
2.0264
1.9912
1.9527
1.9102
1.8630
1.8098
1.7491
1.6785
1.5943
* Existing method
Example 2
Data of Newcomb's third series of measurements on the
passage time of light, made from July 24, 1882 till Sept. 5, 1882,
with the data made available in Stigler (1977) is considered.
Three randomly chosen data has been contaminated with
addition of constant 2.5 standard deviations fixed shift.
The contaminated observations are observation 6, 9
and 10 which are two outliers situated at lower tail and
one outlier situated at upper tail.
(a) GSF
with outside rate 0.25
Using Gini with alpha=0.25
(b)SDSF with
outside rate 0.25
Using IQR with alpha=0.25
x
m=1
45
45
m=1
40
m=2
40
35
35
30
30
25
25
20
20
m=3
m=2
15
15
m=1
10
10
m=1
5
0
10
20
30
40
50
60
70
10
20
30
40
id
id
45
40
m=1
35
m=2
m=3
m=4
m=5
m=6
30
25
20
m=6
m=5
m=4
m=3
m=2
15
m=1
10
5
0
10
20
30
40
id
50
60
70
50
60
70
Obs.
numbe
r
Mean
Standard
deviation
s
Test
statistics
Ri
Critical value
at significance
level 0.25 i
Critical value
at significance
level 0.20 i
27.409091 5.8123933
3.0264485
2.6075519
2.6757258
27.138462 5.4223983
3.1606792
2.6018441
2.6700555
10
27.40625
5.013375
2.8735632
2.5960354
2.6642843
55
27.634921 4.7051845
2.4154376
2.5901223
2.6584088
31
27.451613
2.1166398
2.5841012
2.6524254
4.5111062
* Existing method
SIMULATION
STUDY
Upper Tail
0.05
0.025
0.005
0.05
0.025
0.005
0.0344
0.0606
0.0503
0.4531
0.1935
0.0666
0.4759
0.0208
0.0366
0.0204
0.3380
0.1289
0.0396
0.3715
0.0031
0.0035
0.0023
0.2250
0.0510
0.0086
0.2621
0.0259
0.0598
0.0499
0.4564
0.2003
0.0667
0.4777
0.0206
0.0376
0.0202
0.3482
0.1306
0.0365
0.3783
0.0035
0.0102
0.0019
0.2279
0.0500
0.0095
0.2699
No outlier (n=100)
Lower Tail
Nominal
outside rate
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
Upper Tail
0.05
0.025
0.005
0.05
0.025
0.005
0.0351
0.0405
0.0237
0.9612
0.3519
0.2680
0.9649
0.0326
0.0243
0.0207
0.9174
0.2322
0.1633
0.9221
0.0044
0.0135
0.0036
0.7490
0.0775
0.0474
0.7546
0.0463
0.0428
0.0354
0.9589
0.3515
0.2685
0.9608
0.0269
0.0220
0.0212
0.9181
0.2354
0.1674
0.9187
0.0021
0.0038
0.0020
0.7434
0.0829
0.0487
0.7507
0.5
0.5
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.4
0.4
0.3
0.2
0.3
0.2
0.1
0.1
0.0
0.0
0.00
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
1.0
0.9
0.8
0.8
0.7
0.7
0.6
0.5
0.4
0.3
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.00
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.9
1.0
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Nominal
outside rate
0.05
0.025
0.005
0.05
0.025
0.005
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.0636
0.0619
0.0450
0.6531
0.1806
0.0576
0.6759
0.0254
0.0368
0.0251
0.5380
0.1000
0.0331
0.5714
0.0222
0.0125
0.0003
0.3250
0.0466
0.0077
0.3619
0.8436
0.7539
0.8206
0.8256
0.8163
0.8346
0.8336
0.5425
0.5519
0.7763
0.7635
0.7640
0.7265
0.7543
0.3615
0.4173
0.5475
0.6813
0.4499
0.4236
0.6711
0.05
0.025
0.005
0.05
0.025
0.005
0.2292
0.1458
0.0854
0.9612
0.3519
0.2654
0.9649
0.1168
0.1054
0.0526
0.9174
0.2321
0.1611
0.9221
0.0043
0.0095
0.0035
0.7490
0.0774
0.0471
0.7546
0.8818
0.8610
0.8571
0.9685
0.6129
0.7180
0.8638
0.6242
0.775
0.7844
0.8976
0.4910
0.5909
0.8149
0.4582
0.7157
0.7416
0.8085
0.1496
0.2028
0.7667
1.0
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.9
0.8
0.8
0.7
0.7
Proportion of correctly classified as outlier
0.9
0.6
0.5
0.4
0.3
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0.0
0.00
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Plots for the proportion of misclassified outliers at lower tail (a) and proportion of
correctly classified outliers at upper tail (b) versus nominal outside rates (with
single outlier) for sample size 100
0.07
Lower Tail
Nominal
outside
rate
0.05
0.025
0.005
0.05
0.025
0.005
0.05
0.025
0.005
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.0363
0.0549
0.0051
0.6531
0.1586
0.0492
0.6752
0.0227
0.0364
0.0138
0.5380
0.1048
0.0273
0.5710
0.0032
0.0124
0.0021
0.3250
0.0380
0.0065
0.3612
0.7239
0.9363
0.9452
0.9474
0.8566
0.9370
0.9494
0.6631
0.8543
0.8590
0.8614
0.7943
0.8727
0.8834
0.5284
0.5870
0.5931
0.7788
0.5903
0.7633
0.7495
0.7092
0.8943
0.8949
0.8496
0.6871
0.8428
0.8968
0.6551
0.7939
0.8473
0.8175
0.5794
0.8328
0.8497
0.5235
0.5196
0.5245
0.7546
0.5068
0.6794
0.7464
Lower Tail
Nominal
outside
rate
0.05
0.025
0.005
0.05
0.025
0.005
0.05
0.025
0.005
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.0405
0.0894
0.0241
0.9612
0.3515
0.2622
0.9649
0.0207
0.0368
0.0128
0.9174
0.2315
0.1575
0.9221
0.0035
0.0064
0.002
0.749
0.0773
0.0466
0.7546
0.9066
0.9865
0.9974
1.0000
1.0000
1.0000
1.0000
0.8306
0.967
0.9923
1.0000
1.0000
1.0000
1.0000
0.7052
0.8593
0.9844
1.0000
0.9999
0.9999
1.0000
0.8275
0.9639
0.9875
1.0000
1.0000
1.0000
1.0000
0.7575
0.9111
0.964
1.0000
1.0000
1.0000
1.0000
0.7032
0.6843
0.9443
1.0000
0.9997
0.9998
1.0000
Lower Tail
Nominal
outside
rate
0.05
0.025
0.005
0.05
0.025
0.005
0.05
0.025
0.005
0.05
0.025
0.005
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.0534
0.0689
0.0043
0.6531
0.0377
0.1287
0.6712
0.0232
0.0448
0.0121
0.5380
0.0208
0.0807
0.5655
0.0028
0.0186
0.0014
0.3250
0.0042
0.0302
0.3543
0.7059
0.9746
0.9874
0.9981
0.9726
0.8892
0.9126
0.5004
0.9525
0.9587
0.9972
0.9569
0.8356
0.9131
0.3052
0.8293
0.9043
0.9965
0.8022
0.5707
0.8530
0.7034
0.9583
0.9691
0.9934
0.9536
0.8867
0.9126
0.5002
0.9122
0.9337
0.9896
0.9407
0.8346
0.8731
0.3001
0.829
0.9041
0.939
0.8022
0.5707
0.8530
0.6021
0.9262
0.9573
0.9916
0.9375
0.8755
0.8124
0.4967
0.8354
0.9033
0.9742
0.8759
0.8296
0.7313
0.2999
0.8283
0.8692
0.8896
0.8020
0.5706
0.7453
Lower Tail
Nominal
outside
rate
ESD
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.05
0.025
0.005
0.05
0.025
0.005
0.05
0.025
0.005
0.05
0.025
0.005
0.3685
0.1489
0.2826
0.9612
0.3497
0.2972
0.9649
0.1296
0.0978
0.1228
0.9174
0.2302
0.1557
0.9221
0.0186
0.0158
0.0149
0.749
0.077
0.045
0.7546
0.7331
0.9939
0.9925
1.0000
1.0000
1.0000
1.0000
0.6835
0.9881
0.9836
1.0000
1.0000
1.0000
1.0000
0.5907
0.6843
0.9927
1.0000
0.9999
0.9999
1.0000
0.7278
0.9897
1.0000
1.0000
1.0000
1.0000
1.0000
0.5892
0.9865
1.0000
1.0000
1.0000
1.0000
1.0000
0.5848
0.8593
0.9918
0.9984
0.9848
0.9963
0.9999
0.6154
0.9826
1.0000
1.0000
1.0000
1.0000
1.0000
0.4982
0.9359
1.0000
1.0000
1.0000
1.0000
1.0000
0.3765
0.7016
0.7804
0.9476
0.8997
0.9764
0.9998
Methods
0.05
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
SDSF
GSF
SIQRSF
SnSF
QnSF
MADSF
0.025
0.005
10
20
0.0760
0.0506
0.7092
0.2946
0.0654
0.7628
0.0373
0.0136
0.5925
0.2127
0.0400
0.6543
0.0087
0.0005
0.3779
0.0998
0.0156
0.4402
0.0847
0.0551
0.8604
0.3464
0.1233
0.8690
0.0441
0.0381
0.7616
0.2367
0.0697
0.7818
0.0091
0.0044
0.5022
0.0902
0.0187
0.5492
Sample size
30
50
0.0956
0.0662
0.9180
0.3834
0.1826
0.9284
0.0522
0.0438
0.8434
0.2610
0.1007
0.8603
0.0119
0.0054
0.6143
0.0943
0.0295
0.6497
0.1008
0.0713
0.9714
0.4420
0.1715
0.9745
0.0606
0.0521
0.9340
0.3026
0.1797
0.9380
0.0174
0.0057
0.7678
0.1132
0.0520
0.7833
80
100
0.1044
0.0818
0.9925
0.5377
0.2218
0.9946
0.0713
0.0653
0.9797
0.3764
0.2618
0.9783
0.0192
0.0086
0.8779
0.1410
0.0815
0.8860
0.1176
0.1275
0.9968
0.5673
0.2375
0.9964
0.0938
0.1235
0.9889
0.3894
0.2986
0.9899
0.0430
0.0102
0.9122
0.1491
0.0929
0.9179
SUMMARY
Main focus : modify the sequential fences using different robust scales for an
accurate identification of outliers and reduce the misclassifying non-contaminated
observations as outliers
ESD method - reduce swamping effect when there is no outlier but computer
intensive - suitable for initial screening
SIQRSF, SnSF, QnSF, MADSF - less likely to be subjected to masking effect but
vulnerable to swamping
GSF method - detecting most of the outliers and less likely to incorrectly declaring
an uncontaminated observation as outlier although the number of outliers is
increased
SDSF - still effective in lowering the rates of misidentifying uncontaminated
observation as outlier in a large data.
From the simulations, when using GSF method, the proportion of misclassifying an
uncontaminated observation showed a decreasing trend as the nominal rate
decreases. Only one exception when sample size is 100 with three outliers where
SDSF method was found better in lowering the rate of misclassifying
uncontaminated observations at the lower tail.
The research should be continued by increasing the accuracy in detecting all the
outliers and reducing the proportion of misclassification in the consideration of
larger sample size or skewed distributions.
Second Contribution:
ADJUSTED SEQUENTIAL FENCES FOR
DETECTING UNIVARIATE OUTLIERS IN
Problem Statement
SKEWED DISTRIBUTION
OBJECTIVES
A technique of incorporating the skewness in the construction of
sequential fences is proposed.
We concentrate on the univariate and right skewed underlying
distribution data.
The objectives are:
to identify the multiple outliers in symmetric and skewed
distributions.
to improve the outlier detection method in skewed distribution
using sequential fences.
to improve the coverage of the sequential fences in outliers
detection.
METHODOLOGY
A real data with the average time between failures of a valve which a
chemical engineer wishes to control which is taken from Montgomery
(2009).
The data set contains 20 times between failures.
Initially, the twenty observations do not fail the Anderson Darling goodness
of fit test.
According to Easyfit software, the suitable distribution for the data is
generalized extreme value distribution with skewness value of 1.9079.
Although this data is positively skewed, it does not fail the normality test.
SIMULATION STUDY
0.16
1.0
SDSF
ASF
SDSF
ASF
0.14
0.8
0.12
0.10
outside rate
outside rate
0.6
0.08
0.4
0.06
0.04
0.2
0.02
0.0
0.00
0
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
0.110
0.40
SDSF
ASF
0.105
0.50
SDSF
ASF
SDSF
ASF
0.100
0.45
0.35
0.095
0.090
0.40
0.085
0.30
0.080
0.35
0.075
0.070
0.25
0.055
0.050
outside rate
0.30
0.060
outside rate
outside rate
0.065
0.20
0.045
0.20
0.15
0.040
0.25
0.035
0.15
0.030
0.10
0.025
0.10
0.020
0.015
0.05
0.05
0.010
0.005
0.000
0.00
0.00
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.100
(a) Proportion of
misclassificatio
n
at lower tail
(c) Proportion of
misclassification
at upper tail
(b) Proportion of
correctly identify outlier
at upper tail
0.11
1.00
0.10
0.99
0.090
0.085
0.09
0.98
0.080
0.075
0.08
proportion of misidentification
0.065
0.060
0.055
0.050
0.045
0.040
0.035
proportion of misidentification
0.97
0.070
proportion of misidentification
SDSF
ASF
SDSF
ASF
SDSF
ASF
0.095
0.96
0.95
0.94
0.07
0.06
0.05
0.04
0.93
0.030
0.03
0.025
0.92
0.020
0.02
0.015
0.91
0.010
0.01
0.005
0.90
0.000
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.00
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
Summary
The results on real data and simulation study indicate that the ASFs
performance in correctly identify outliers is stable with lower error of
misidentified uncontaminated observations as outliers as the
skewness and sample size of the data increase.
Third Contribution:
SPLIT SAMPLE SEQUENTIAL FENCES BASED
Problem
Statement
ON BOOTSTRAP
CUT OFF POINTS FOR
IDENTIFYING OUTLIERS AND PARAMETER
ESTIMATIONS
Bootstrap Resampling
Bootstrapping is a statistical technique for
making an estimation of the sampling
distribution of an estimator by
sampling with replacement from the
actual sample.
The technique allocates measures of
precision to sample estimates with bias
and root of mean square error.
OBJECTIVES
Adjustment to the sequential fences SDSF involving
bootstrap resampling is proposed. The objectives are
i) to increase the coverage of the data that lie on either
side of the tails so that it can be applied to different types
of distributions and various sizes of data
ii) to modify the algorithm of the sequential fences
technique by incorporating with the screening of data
iii) to show accuracy comparison between proposed
method with SDSF and Tukeys boxplot in parameter
estimations after outliers detection is done.
LITERATURE REVIEWS
DATA SCREENING
Tabachnick and Fidell (2001) recommended a
procedure for screening a data with appropriate
sequences.
According to the past literatures such as Beckman and
Cook (1983), Ahmad et al. (2011), Fitrianto and Midi
(2011), and Tabachnick and Fidell (2011), outlier
detection is a segment of data screening procedures
which should be conducted prior any statistical analysis.
Thus, screening of data is a vital initial step before begin
a statistical analysis.
ESTIMATION OF ROBUST
ESTIMATORS
METHODOLOGY
Step 1: Screening of Data to Generate Clean Data
Flow chart of the Generating Clean Data (GCD)
algorithm
Generate data
Screening data with outlier detection method, GSF
Is there any
outlier?
No
Contaminated sample
Bootstrap Resampling with B=2000
Compute lower and upper fences for each
bootstrap resample
Arrange the computed fences in
ascending order
Obtain the median of the fences
To check the efficacy of the proposed approach, robust estimators are computed.
Trimmed mean & trimmed standard deviation for one sided & two sided
The determination of number of observation to be trimmed is based on the number
outliers detected by a particular outlier detection method.
Flow chart of the trimmed estimators based on bootstrap resampling (TEB)
algorithm
Contaminated sample
Compute Robust Estimators for sample,
Bootstrap Resampling with B=2000
Compute Robust Estimators for each bootstrap resample,
1*,2*,3*,,B* based on the outliers detected
Mean of the Robust Parameters, E=1Bi=1Bi*
Interpretation
Methods
RESULT OF OUTLIERS
DETECTION
SUMMARY
Overall, SSFB performed substantially better
compared to SDSFB and TBB by showing closer
estimation values to the sample estimators.
In summary, the results of bootstrap resampling
signify that the SSFB technique performs well in
identification of outliers and better estimation of
robust estimators in most of the scenarios.
In future, research regarding SSFB approach
can be extended from univariate case to identify
outliers in bivariate or multivariate case.
3) GSF as the initial screening of simulated data and modified the fences to increase
the coverage of the data that lie on either side of the tails so that the proposed
fences can be applied to wider types of distribution and variety sample sizes.
Trimmed mean and standard deviatons are computed to validate the method.
A new construction method of sequential fences was propose and is called Sequential
Fences based on Bootstrap Resampling (SSFB).
Bootstrap resampling is used as a statistical technique which can provide broader
heading of resampling and estimates of the sampling distribution by drawing a large
amount of random sample from a population.
Based on the number of outliers detected, parameters of the population such as
trimmed mean and trimmed standard deviation involving one-sided and two-sided
trimming are estimated to prove the superiority of the proposed approach.
For outlier detection, SSFB performed well in identifying the outlier without mistakenly
labeling the regular observations as outliers whereas SDSFB and TBB are better in
detecting most of the outliers but many uncontaminated observations are inaccurately
flagged outliers.
Besides that, with the implement of bootstrap resampling technique into SDSF and
Tukeys boxplot (TB), SDSFB and TBB are found to be more productive in outlier
detection as it can be seen that the rate of misclassification of the outliers is reduced.
From overall results, the performance of SSFB is better compared to SDSFB and TBB.
The proposed approach provides closer estimation values to the sample estimators.
Meanwhile, the outstanding performances of SSFB over SDSFB and TBB are clearer
in data sets with larger number of outlying observations and skewed distribution data.
In summary, SSFB technique is effective in outliers identification and estimation of
robust estimators in most of the scenarios.