Professional Documents
Culture Documents
3 Centro de Investigao e Desenvolvimento em Matemtica e Aplicaes (CIDMA), Universidade de Aveiro, Aveiro, Portugal
ERCIM 2010
ERCIM 2010 1 / 23
M. Eduarda Silva and Isabel Silva Detection of outliers in integer-valued AR models
Content
Motivation
Outliers in INAR(1) models
Wavelet analysis
Algorithm for outlier time point detection
Simulation study
Application
Remarks and future work
ERCIM 2010 2 / 23
M. Eduarda Silva and Isabel Silva Detection of outliers in integer-valued AR models
Motivation
Problem in time series modeling: to detect outliers
INAR(1) processes [McKenzie, 1985, 1988; Al-Osh and Alzaid, 1987; Du and Li, 1991]
Xt = Xt1 + et ,
Xt1
= Yt,j + et ,
j=1
257 observations, = 0.3, = 1, S = 100, = 5 = 6
INAR(1) with finitely many AO 7
INAR(1) with AO
Yt = Xt + Ii=1 t,si i , 6
INAR(1)
I N, si N, si 6= sj for i 6= j,
3
i N,
2
k,m = 1, if k = m,
1
k,m = 0, if k 6= m.
0
0 50 100 150 200 250
t = et + Ii=1 t,si i , 6
5
(et )tZ i.i.d. non-negative
4
integer-valued r.v.,
3
I N, si N, si 6= sj for i 6= j,
2
i N,
1
k,m = 1, if k = m,
0
k,m = 0, if k 6= m.
0 50 100 150 200 250
Our objective
We propose a method based on wavelets to identify the time point of the occurrence
of the outlier (AO and IO) for Poisson INAR(1) processes.
Wavelet analysis
Wavelets are a family of basis functions that are used to express and approximate other
functions.
Wavelet analysis consists in decomposing a signal into shifted and scaled versions of the
mother wavelet.
Wavelet analysis is able to look at the data at different scales or resolutions, revealing
aspects of the data such as changes in variance, level changes and other discontinuities.
Thus, wavelets are useful for outlier detection [Bilen and Huzurbazar, 2002; Gran and Veiga, 2010].
Wavelet analysis
Consider a real-valued function () defined on R satisfying
Z
(u) du = 0,
Z
2 (u) du = 1,
Z
(f )|2
|
0< df < (f ) is the Fourier transform of ()).
(
0 f
Wavelets consist of dilations and translations of (), called mother wavelet:
1 tb
a,b (t) = ,
a a
b R( translation parameter), a R+ ( dilation parameter).
Wavelet analysis
Discrete Wavelet Transform (DWT): Multiresolution Analysis
DWT: parameters take only discrete values.
If a = 2m and b = n2m , m, n Z, then there exist such that
1 t
m,n (t) = m n
2m 2
constitute an orthonormal basis for L 2 (R).
In practice
Pyramid algorithm or two-channel subband coder [Mallat, 1989]: filters of different
cutoff frequencies are used to analyze the signal at different scales:
The signal is passed through a series of high pass filters to analyze the high frequencies
Detail coefficients (low-scale high-frequency components),
The signal is passed through a series of low pass filters to analyze the low frequencies
Approximation coefficients (high-scale low-frequency components).
Wavelet analysis
0.6
1/ 2, 1 < t 0;
0.4
0.2
(t) = 0 < t 1;
0
1/ 2, 0.2
0.4
0, otherwise. 0.6
0.8
1 0 1
The high- and low-pass filters for Haar correspond to moving differences and
moving averages.
Expect large wavelet coefficients in the presence of outliers.
In practice, in the presence of one outlier first level decomposition is enough.
What we did:
Simulated 20000 replications of Poisson INAR(1) process for various , and N.
For each replication, fitted the model and computed the standardized residuals, zi .
Obtained the detail coefficients for zi and chose the maximum in absolute value.
0
50 100 150 200 250
Standardized residual INAR(1) with AO
5
5
50 100 150 200 250
|1st level Detail coefficient for Haar|
5
0
20 40 60 80 100 120
0
50 100 150 200 250
Standardized residual INAR(1) with IO
5
5
50 100 150 200 250
|1st level Detail coefficient for Haar|
5
0
20 40 60 80 100 120
Simulation study
Simulation study
Illustration of the simulation result for one outlier
AO IO
( , ) N % Correct Average False % Correct Average False
(.3, 1) 128 3 = 4 13.9 0.081 16.5 0.091
128 5 = 6 86.6 0.044 53.9 0.049
128 10 = 12 99.6 0.002 99.5 0.012
512 3 = 4 25.7 0.187 11.4 0.206
512 5 = 6 84.6 0.133 46.0 0.179
512 10 = 12 99.8 0.070 99.9 0.077
(.3, 5) 128 3 = 9 33.8 0.045 15.4 0.041
128 5 = 14 58.7 0.016 62.4 0.023
128 10 = 27 100 0 99.8 0
512 3 = 9 28.2 0.065 11.3 0.071
512 5 = 14 88.2 0.050 48.2 0.056
512 10 = 27 99.9 0.024 99.9 0.019
Simulation study
Illustration of the simulation result for one outlier
AO IO
( , ) N % Correct Average False % Correct Average False
(.7, 1) 128 3 = 6 36.6 0.068 38.0 0.051
128 5 = 10 92.6 0.067 94.1 0.021
128 10 = 19 100 0 100 0.002
512 3 = 6 28.7 0.183 28.4 0.137
512 5 = 10 98.2 0.098 88.8 0.106
512 10 = 19 100 0.909 100 0.035
(.7, 5) 128 3 = 13 85.0 0.120 32.9 0.025
128 5 = 21 98.1 0.019 84.8 0.004
128 10 = 41 100 0 100 0
512 3 = 13 80.2 0.192 27.3 0.057
512 5 = 21 85.8 0.281 92.2 0.039
512 10 = 41 100 0 100 0.006
Simulation study
Illustration of the simulation result for three outliers
AO IO
( , ) N % Correct Average False % Correct Average False
(.3, 1) 128 3 = 4 21.2 0.034 10.6 0.047
128 5 = 6 37.8 0.006 31.1 0.007
128 10 = 12 93.8 0 87.4 0.000
512 3 = 4 22.3 0.116 10.6 0.131
512 5 = 6 51.4 0.077 37.5 0.108
512 10 = 12 99.4 0.004 98.9 0.013
(.3, 5) 128 3 = 9 16.6 0.018 9.6 0.013
128 5 = 14 58.6 0 31.3 0.001
128 10 = 27 94.1 0 89.1 0
512 3 = 9 18.2 0.061 9.1 0.042
512 5 = 14 81.7 0.021 43.0 0.027
512 10 = 27 99.7 0.002 99.3 0.003
Simulation study
Illustration of the simulation result for three outliers
AO IO
( , ) N % Correct Average False % Correct Average False
(.7, 1) 128 3 = 6 54.5 0.151 20.0 0.013
128 5 = 10 67.7 0.002 62.9 0.004
128 10 = 19 99.2 0 96.2 0
512 3 = 6 58.9 0.344 23.2 0.112
512 5 = 10 99.2 0.035 79.1 0.032
512 10 = 19 100 0.047 100 0.002
(.7, 5) 128 3 = 13 31.7 0.090 17.8 0.006
128 5 = 21 80.3 0.015 53.9 0
128 10 = 41 99.1 0 97.3 0
512 3 = 13 37.6 0.178 20.3 0.033
512 5 = 21 82.6 0.180 80.3 0.014
512 10 = 41 100 0.084 100 0.001
Application
Dataset: Number of different IP addresses (in periods of 2 min length) at the server
of the Department of Statistics of the University of Wrzburg on November 29th,
2005, between 10 a.m. and 6 p.m. [Wei , 2007]
8 N = 241
7 x = 1.32, 2 = 1.39
November 29th, 2005, 10 a.m. to 6 p.m.
k10.05 = 3.52
Number of different IP addresses
5
Result from the application of the
4
algorithm: t = 224
3
2
Barczy et al. (2010): = 6.79
1
0
0 25 50 75 100 125 150
time (periods of 2 minutes)
175 200 225 241 true x224 1
Parametric procedure that detects the time points of the outliers for Poisson
INAR(1) models.
The choice of the threshold:
why does the threshold depend on ?
what if the arrivals are not Poisson?
to use corrected thresholds for integer-valued data based on the universal threshold
[Kolaczyk, 1999]
References