You are on page 1of 11

This article was downloaded by: [University College London]

On: 13 October 2014, At: 04:17


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer
House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical Association


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/uasa20

A Multiple-Index Model and Dimension Reduction


a
Yingcun Xia
a
Yingcun Xia is Associate Professor, Department of Statistics and Applied Probability,
National University of Singapore, Singapore, 117546 . The author thanks an associate
editor and two referees for their valuable comments. This work was supported by NUS
FRG R-155-000-063-112, the National University of Singapore, and the National Natural
Science Foundation of China (grant 10471061).
Published online: 01 Jan 2012.

To cite this article: Yingcun Xia (2008) A Multiple-Index Model and Dimension Reduction, Journal of the American
Statistical Association, 103:484, 1631-1640, DOI: 10.1198/016214508000000805

To link to this article: http://dx.doi.org/10.1198/016214508000000805

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of
the Content. Any opinions and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied
upon and should be independently verified with primary sources of information. Taylor and Francis shall
not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other
liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or
arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
A Multiple-Index Model and Dimension Reduction
Yingcun X IA

Dimension reduction can be used as an initial step in statistical modeling. Further specification of model structure is imminent and important
when the reduced dimension is still greater than 1. In this article we investigate one method of specification that involves separating the linear
component from the nonlinear components, leading to further dimension reduction in the unknown link function and, thus, better estimation
and easier interpretation of the model. The specified model includes the popular econometric multiple-index model and the partially linear
single-index model as its special cases. A criterion is developed to validate the model specification. An algorithm is proposed to estimate the
model directly. Asymptotic distributions for the estimators of the parameters and the nonparametric link function are derived. Air pollution
data in Chicago are used to illustrate the modeling procedure and to demonstrate its advantages over the existing dimension reduction
approaches.
KEY WORDS: Asymptotic distribution; Convergence of algorithm; Dimension reduction; Local linear smoother; Semiparametric model.

1. INTRODUCTION G(·) can be estimated much more accurately than m(·) in model
(A). For ease of exposition, let
Suppose that Y is a response and X is a p-dimensional co-    
variate vector. An essential association between Y and X is BA = β̄1 , . . . , β̄dA , BB = β̃1 , . . . , β̃dA −1 ,
the conditional mean function M(x) = E(Y |X = x), leading dB = dA − 1.
to a general model Y = M(X) + ε, where E(ε|X) = 0 al-
Downloaded by [University College London] at 04:17 13 October 2014

most surely. Due to the “curse of dimensionality,” the gen- Model (B) is very general, including the partially linear model
eral model with p > 1 is difficult to estimate well and, thus, (Speckman 1988; Wang 2003) and the single-index model
hard to use in practice. To lessen the effect of dimensional- (Ichimura 1993) as its special cases. An important special case
ity, a popular approach is dimension reduction of the condi- of model (B) is when dB = 1, that is, Y = γ0 X + g(β1 X) + ε,
tional mean (Samarov 1993; Hristache, Juditski, Polzehl, and where g(·) is a univariate unknown function, which again in-
Spokoiny 2001; Cook and Li 2002; Xia, Tong, Li, and Zhu cludes the partially linear single-index model (Carroll et al.
1997; Yu and Ruppert 2002) as its special case. Note that the
2002; Yin and Cook 2002; among others), which searches for
partially linear single-index model (Carroll et al. 1997) takes
dA linear combinations of X: β̄1 X, . . . , β̄dA X with dA < p,
a slightly different form: Y = a X1 + g(b X2 ) + ε, where
such that M(X) can be well approximated by a lower dimen- X = (X1 , X2 ), in which the linear variables X1 and nonlinear
sional function m(β̄1 X, . . . , β̄dA X) or, more ideally, M(X) = variables X2 need to be specified before modeling. In contrast,
m(β̄1 X, . . . , β̄dA X), leading to the dimension reduction model model (B) allows the data to choose the linear component and
in the regression mean function: the nonlinear components and is, thus, more data driven.
  It is easy to see that model (B) is not unique. For example, we
Model (A): Y = m β̄1 X, . . . , β̄dA X + ε. can rewrite G(B   
0 X) := G(B0 X) − c B0 X and γ0 := γ0 + B0 c
This model has received much attention and was investigated for any q × 1 vector c. There are several methods to make
intensively. For example, all the references cited previously the model identifiable that do not impose any restriction to the
concern this model. Dimension reduction is an initial step for model flexibility. The next proposition presents one of those
model construction. methods.
If dA > 1, model (A) needs further specification in order to Proposition 1.1 (Identification). Suppose X has a continu-
be estimated better; see, for example, Samarov (1993), Car- ous density function with nonsingular covariance matrix  0 .
roll, Fan, Gijbels, and Wand (1997), and Samarov, Spokoiny, The gradient ∇M(x) = ∂M(x)/∂x and  = E[{∇M(X) −
and Vial (2005). It is very likely that one of the dimension re- E∇M(X)}{∇M(X) − E∇M(X)} ] exist. If  has dB nonzero
duction components β̄1 X, . . . , β̄dA X, or its linear combination eigenvalues, then we have
c1 β̄1 X + · · · + cdA β̄dA X ≡ γ0 X, affects the response linearly, (I) Model (B) can always be rewritten in such a way that
and the other orthogonal dA − 1 combinations affect the re- (i1) B
0  0 B0 = IdB and the first nonzero element in each
sponse nonlinearly, leading to the model: column of B0 is positive; (i2) γ0  0 B0 = 0; and (i3) ma-
  def
Model (B): Y = G β̃1 X, . . . , β̃dA −1 X + γ0 X + ε, trix 0 = E[{∇G(B 
0 X) − E∇G(B0 X)}{∇G(B0 X) −

 
E∇G(B0 X)} ] is a full ranked diagonal matrix, where
where E(ε|X) = 0 almost surely and G is an unknown link ∇G(u) = ∂G(u)/∂u.
function. A well-known model in econometrics proposed by (II) If model (B) satisfying (i1)–(i3) also holds with γ0 , B0 ,
Ichimura and Lee (1991) and Horowitz (1998) has a similar and G replaced by γ̄0 , B̄0 , and Ḡ, respectively, then there
form. Following them, we call model (B) the multiple-index exists a rotation matrix Q : dB × dB such that
model. Compared with model (A), the dimension of the non-
parametric function in model (B) is reduced by 1. Therefore, γ̄0 = γ0 , B̄0 = B0 Q , Ḡ(u) = G(Qu).
Furthermore, if the nonzero eigenvalues of  differ from
one another, then Q = IdB .
Yingcun Xia is Associate Professor, Department of Statistics and Applied
Probability, National University of Singapore, Singapore, 117546 (E-mail:
staxyc@nus.edu.sg). The author thanks an associate editor and two referees © 2008 American Statistical Association
for their valuable comments. This work was supported by NUS FRG R-155- Journal of the American Statistical Association
000-063-112, the National University of Singapore, and the National Natural December 2008, Vol. 103, No. 484, Theory and Methods
Science Foundation of China (grant 10471061). DOI 10.1198/016214508000000805

1631
1632 Journal of the American Statistical Association, December 2008

Throughout this article we assume that model (B) is rewritten 2. INITIAL ESTIMATORS
according to (i1)–(i3), which, as pointed out in Proposition 1.1,
Because the dimension dB is unknown, we can try a work-
do not represent any restriction.
Model (B) identifies the linear component and the nonlinear ing dimension q: 1 ≤ q ≤ p; the choice of q will be discussed
components in a data-driven way. This identification is itself later. Recall that M(x) = E(Y |X = x). By model (B), we have
important in statistics and in many other disciplines of science M(x) = γ0 x + G(B 0 x) and, thus,
in order to understand different mechanisms between the linear ∂M(x)
= γ0 + B0 ∇G(B
def
and nonlinear factors. See, for example, Grenfell, Finkenstädt, ∇M(x) = 0 x). (1)
∂x
Wilson, Coulson, and Crawley (2000), Gao and Tong (2004),
Samarov et al. (2005), and Maestri et al. (2005). The well- It follows that ∇M(X) − E∇M(X) = B0 {∇G(B 0 X) −
known partially linear model Y = β1 X1 + G(X2 ) + ε might 
E∇G(B0 X)}. The average of its outer product is
be the first semiparametric model that tried to separate the lin-
def  
ear and nonlinear components; see Speckman (1988) and Wang  = E {∇M(X) − E∇M(X)}{∇M(X) − E∇M(X)}
(2003). The partially linear single-index model (Carroll et al.
1997) is one of the recent models with a similar motivation. = B0 0 B
0, (2)
For these models, however, the selection of linear and nonlin-
where 0 is defined in Proposition 1.1. Therefore, B0 can
ear components is rather arbitrary. Tong (1990) used plots of
be estimated by the eigenvectors of . By (1), we have
Downloaded by [University College London] at 04:17 13 October 2014

the response against each covariate to identify linear compo-


E{∇M(X)} = γ0 + B0 E{∇G(B 0 X)}. Because the model is
nents and nonlinear components in a time series setting. More
written according to the identification format, (I −  0 B0 B
1/2
recently, Samarov et al. (2005) tried to identify the linear and 0 ×
 0 ) is a projection matrix, and (I −  0 B0 B
1/2 1/2 1/2 1/2
0  0 ) 0 ×
nonlinear variables using nonparametric kernel smoothing.
Another advantage of separating the linear components and 1/2
E{∇M(X)} =  0 γ0 . Thus, γ0 can be obtained easily by
nonlinear components as in model (B) is for model interpreta-
−1/2  1/2  1/2
I −  0 B0 B
tion. It is known that model (A), or the general dimension re- 1/2
γ0 =  0 0 0  0 E{∇M(X)}. (3)
duction model (Li 1991), is sometimes not so informative and
is hard to visualize when dA > 1. In contrast, for model (B) To implement the previous idea, one can use multivariate lo-
the plot of the response against the linear component has clear cal polynomial smoothing to estimate the gradients (Fan and
meaning. On the other hand, because the dimension in the non- Gijbels 1996). Here we consider local linear smoothing only,
linear part is further reduced by 1, its interpretation is relatively while a higher order polynomial may not be preferred due to
easier. One real dataset will be employed to demonstrate this its instability. The details can be stated as follows. Consider
point later. a multiple-density kernel function K(u1 , . . . , up ) and band-
In terms of estimation, because the structure of the dimen- width (h1 , . . . , hp ). For simplicity of bandwidth selection and
sion reduction directions are specified in model (B), it is ex- ease of exposition, after the following standardization, a com-
pected that the estimation efficiency can be significantly im- mon bandwidth is used for all variables, that is, h1 = · · · =
proved. Note that a naive estimation, which applies a linear re- hp = h. Let Kh (u) = h−p K(u/ h), where u = (v1 , . . . , vp ) .
gression to the response first and then dimension reduction to Suppose that {(Xi , Yi ), i = 1, 2, . . . , n} is a random sample
the fitted residuals of the linear regression, cannot guarantee from (X, Y ). Let SX = n−1 ni=1 (Xi − X̄)(Xi − X̄) , with
the consistency. As an example, consider X = (X1 , X2 , X3 ) , X̄ = n−1 ni=1 Xi . Standardize Xi = (Xi1 , . . . , Xip ) by set-
iid
where Xk = ηk + η0 with η0 , η1 , η2 , η3 ∼ Uniform(−.5, .5), ting
and model Y = (X1 + X2 )3 + ε. The linear regressor is L(X) =
−1/2
β0 + β  X, where X̃i := SX (Xi − X̄). (4)

(β0 , β) = arg min E{Y − β0 − β X} 2
For any x, the principle of the local linear smoother suggests
β0 ,β
minimizing
= (0, 1.25, 1.25, −.15) .

n
The residual E(Y |X) − L(X) = (X1 + X2 )3 − 1.25(X
1 + X2 ) +
n−1 {Yi − a − b X̃ix }2 Kh (X̃ix ), (5)
.15X3 has a two-dimensional dimension reduction space. In- i=1
stead, the true nonlinear part in the model is one dimensional. with respect to a and b to estimate M(x) and ∂M(x)/∂x, re-
Thus, the method cannot estimate the model consistently. Other spectively, where X̃ix = X̃i − x. Denote the solution of (a, b) in
relevant estimation methods such as those of Härdle and Stoker (5) at x = X̃j by (âj , b̂j ).
(1989), Li (1991), Samarov (1993), Newey and Stoker (1993),  
Let b̄ = nj=1 ρ(fˆ(X̃j ))b̂j / nj=1 ρ(fˆ(X̃j )), where fˆ(x) =
Hristache et al. (2001), Yin and Cook (2002), Xia et al. (2002), 
Donkers and Schafgans (2003), and Banerjee (2007) cannot n−1 ni=1 Kh (X̃ix ) and ρ(v) is a trimming function introduced
be applied directly to model (B). In this article we propose a for technical purposes to handle the notorious boundary points.
method to estimate model (B) directly. We also develop a cri- The trimming function ρ(·) is any bounded function with
terion to determine the dimension dB and to choose between bounded second-order derivatives on R such that ρ(v) > 0 if
model (A) and model (B). The asymptotic theories for the esti- v > ω0 > 0; ρ(v) = 0 if v ≤ ω0 . One example of the trim-
mation and model selection are obtained. An algorithm for the ming function is ρ(v) = 1 if v ≥ ω0 ; exp(−(v − ω0 )−1 + (ω0 −
model estimation is also proved to be convergent. v)−1 )/{1 + exp(−(v − ω0 )−1 + (ω0 − v)−1 )} if ω0 > v > ω0 ;
Xia: Multiple-Index Model and Dimension Reduction 1633

0 if v ≤ ω0 . In practice, ω0 can be very small, and, thus, no ob- quadratic programming problems both of which have simple
servations are trimmed off. Analogous to  in (2), we consider analytic solutions. For any matrix B = (β1 , . . . , βq ), define op-
an average of outer products of b̂j ’s, erators (·) and M(·), respectively, as

n (B) = (β1 , . . . , βq ) and M( (B)) = B.
ˆ = n−1
 ρ(fˆ(X̃j )){b̂j − b̄}{b̂j − b̄} .
j =1 The following algorithm implements the estimation.

Calculate the first q eigenvectors of . ˆ In accordance with Step 0 (Initializing). Calculate SX and standardize Xi to X̃i
the requirements in Proposition 1.1, each eigenvector is writ- as in (4). Let B̃(1) be the matrix calculated in Section 2
ten such that its first nonzero element is positive. Denote them and let γ̃(1) = (I − B̃(1) B̃
(1) )b̄. Set t = 1, B(1,0) = B̃(1) ,
by B̃(1) . The estimators of B0 and γ0 are then, respectively, and γ(1,0) = γ̃(1) . Let h0 be the bandwidth used in the
initial estimation.
−1/2 −1/2  
B̂(1) = SX B̃(1) , γ̂(1) = SX I − B̃(1) B̃
(1) b̄. Step I (Outer loop). Set τ = 0 and ht = max{cn ht−1 , t },
Note that multiple kernel smoothing is used above, which is where 1 > cn n−1/(p+4) and t is another bandwidth
known to be inefficient. To improve the efficiency, one can em- discussed later.
ploy the idea of an adaptive kernel as in Hristache et al. (2001) Step I.1 (Inner loop). Fix B = B(t,τ ) and γ = γ(t,τ ) and
and Xia et al. (2002). However, the estimation combining the calculate the solutions of (aj , dj ), j = 1, . . . , n, to
Downloaded by [University College London] at 04:17 13 October 2014

two ideas is not as efficient as the method introduced in Xia et the minimization problem in (6):
 (t,τ )
al. (2002); see the discussion in Xia (2006). aj
(t,τ )
3. REFINED ESTIMATORS dj ht


If model (B) holds, then the gradients ∂G(B 
n
  1
0 x)/∂x for all x = Kht B
(t,τ ) X̃ij 
are in a common dB -dimensional subspace, {B0 u : u ∈ RdB }, as i=1
B(t,τ ) X̃ij / ht
shown in (1). To use this information, we can replace b in (5),
  −1 
n
which is an estimate of the gradient, by Bd(x) in order to be 1  
in line with the local linear approximation M(Xi ) ≈ γ0 Xi + ×  Kht B
(t,τ ) X̃ij
B(t,τ ) X̃ij / ht
G(B  
0 x) + {B0 ∇G(B0 x)} Xix for Xi close to x, where Xix = 
i=1

Xi − x. Because G(B  
0 x) and ∇ G(B0 x) are unknown, they
1
× Yi − X̃
i γ(t,τ ) .
can be estimated by minimizing the local linear approximation B
(t,τ ) X̃ij / ht
error Step I.2 (Inner loop). Let
n
n−1 {Yi − γ  Xi − a − d B Xix }2 Kh (Xix ), 
n
 
fˆB(t,τ ) (x) = n−1 Kht B
(t,τ ) X̃ix and
i=1
i=1
with respect to a and d. Because (γ , B) is common for all x,  
= ρ fˆB(t,τ ) (X̃j ) /fˆB(t,τ ) (X̃j ).
(t,τ )
it should be estimated by minimizing the approximation errors ρj
for all x = Xj , j = 1, . . . , n. As a consequence, we propose to (t,τ ) (t,τ )
estimate (γ0 , B0 ) by minimizing Fix aj = aj and dj = dj and calculate the solu-
tion of (γ , B) or (γ , (B)) in (6):

n 
n
 (t,τ +1)

n−2 Inj ρj {Yi − γ  Xi − aj − d 
j B Xij } wij ,
2
(6) γ
n
 
Inj ρj Kht B
(t,τ )
j =1 i=1 (t,τ +1) = (t,τ ) X̃ij
b
j,i=1
with respect to γ , aj , dj = (dj 1 , . . . , dj q ) , j = 1, . . . , n, and    −1
B : B  0 B = Iq , where Xij = Xi − Xj . The weight func- X̃i X̃i
×
tion wij should be adaptive to the structure, that is, wij = (t,τ )
X̃ij
(t,τ )
X̃ij
Kh (B Xij ). Here we abuse the notation and say K(·) is the
same function defined previously but with p replaced by q. The 
n
 
Kht B
(t,τ )
ˆ ˆ ˆ × Inj ρj (t,τ ) X̃ij
 function is ρj = ρ(fB (Xj ))/fB (Xj ), with fB (x) =
trimming
n−1 ni=1 Kh (B Xix ). Note that we introduce another function
j,i=1

Inj = 1(|Xj | ≤ n). Hereafter, for any matrix A, |A| denotes its X̃i (t,τ )
largest singular value, which is the Euclidean norm if A is a × (t,τ ) Yi − aj ,
X̃ij
vector. It is easy to see that, under some mild conditions, the
observations trimmed off by Inj are negligible. These two trim- where X̃(t,τ
ij
)
= d(t,τ
j
)
⊗ X̃ij .
ming functions are used here for technical purposes. The pre- Step I.3 (Inner loop). Calculate
ceding estimation procedure is similar to the minimum average     (t,τ +1) 
(t,τ +1) = M b(t,τ +1) M b and
(conditional) variance estimation method (Xia et al. 2002).
 (t,τ +1)  −1/2
The minimization problem in (6) can be solved by fixing B(t,τ +1) = M b (t,τ +1) ;
(aj , dj ), j = 1, . . . , n, and fixing (γ , B) alternatively. As a   (t,τ +1)
consequence, the minimization can be decomposed into two γ(t,τ +1) = I − B(t,τ +1) B (t,τ +1) γ .
1634 Journal of the American Statistical Association, December 2008

If a convergence criterion is satisfied, stop; otherwise, Then the dimension for model (A) is selected as
set τ := τ + 1 and go to step I.1.
d̂A = arg min CVA(q).
Step II (Outer loop). Repeat steps I.1–I.3 until convergence. q≥0
Let (γ(t+1,0) , B(t+1,0) ) be the final value of (γ(t,τ ) , B(t,τ ) ).
If a convergence criterion is met, stop; otherwise, set Xia et al. (2002) proved that this selection is consistent; that is,
t := t + 1 and go to step I. d̂A → dA in probability.
Suppose the estimator of B0 in model (B) is B̂B with working
Let (γ̃ , B̃) be the final value of (γ(t,0) , B(t,0) ). Then the fi- dimension q. Following the idea of Speckman (1988), calculate
−1/2 −1/2
nal estimators are γ̂ = SX γ̃ and B̂ = SX B̃, respectively. the leave-one-out estimator of γ0 and G(·), respectively, by
In the calculation, as usual the convergence is regarded as be-
n −1
ing achieved whenever the changes of (γ(t,τ ) , B(t,τ ) ) are very 
γ̂j = (Xi − m̂j (Xi ))(Xi − m̂j (Xi ))
B B
small, for example, |γ(t,τ +1) − γ(t,τ ) | + |B(t,τ +1) B
(t,τ +1) − i=1,i =j
 −6
B(t,τ ) B(t,τ ) | < 10 , in the last few iterations. The estimated
link function G(u) and the gradient are 
n

 × (Xi − m̂B
j (Xi )){Yi − m̃j (Xi )}
B
Ĝ(u) i=1,i =j

∇G(u)h  B

n and Ĝj (x) = m̃B j (x) − γ̂j m̂j (x), where m̂B j (x) =
 
Downloaded by [University College London] at 04:17 13 October 2014

ˆ
{nfB̂B ,j (x)} −1 n 
i=1,i =j KhB (B̂B Xix )Xi , m̃j (x) = {n ×
B
= Kh (B̂ Xi − u) n
i=1 ˆ
fB̂B ,j (x)} −1  ˆ
i=1,i =j KhB (B̂B Xix )Yi , and fB̂B ,j (x) = n
−1 ×

   −1 n 
i=1,i =j KhB (B̂B Xix ). The CV value for model (B) is defined
1 1
× as
(B̂ Xi − u)/ h (B̂ Xi − u)/ h
 n


n  CVB(q) = n−1 ρ(fˆ(Xj )){Yj − γ̂j Xj − Ĝj (B̂ Xj )}2 .
1
× Kh (B̂ Xi − u) (Yi − γ̂  Xi ), j =1
(B̂ Xi − u)/ h
i=1
Note that the same trimming function ρ(fˆ(Xj )) is used for the
where h is the bandwidth used in the last iteration of the algo- two models in order to compare their CV values.
rithm. Finally, our selection criterion is as follows.
Bandwidths h0 , t , t = 1, . . . , need to be selected in the al-
gorithm. After fixing γ(t,0) and B(t,0) , the bandwidths are ac- If CVB(d̂A − 1) < CVA(d̂A ), then we select model (B);
tually selected for the estimation of E(Y |X = x) and E(Y − otherwise model (A).
 X|B X = u), t = 1, 2, . . . , respectively. Most existing
γ(t,0) (t,0) The dimension of model (B) is estimated by
bandwidth selection methods can be used. Details can be found 
in Fan and Gijbels (1996) and Yang and Tschernig (1999). In d̂ − 1 if CVB(d̂A − 1) ≤ CVA(d̂A )
d̂B = A
the following calculation cn = 1/1.5, and the simple rule of d̂B = d̂A otherwise.
thumb of Silverman (1986) is used to select the bandwidth;
Equivalently, the linear component in model (B) exists if
that is, after standardizing Xi , we take h0 = n−1/(p+4) and
CVB(d̂A − 1) ≤ CVA(d̂A ), and the nonlinear components ex-
t = n−1/(q+4) for all t.
ist if d̂B ≥ 1.
4. MODEL SELECTION As noticed by an anonymous referee, there are other alter-
natives for the selection of dimension and models. For exam-
One purpose of this article is to further specify a dimension
ple, one can select d̂B = arg minq≥0 CVB(q) and select model
reduction model such that it can be estimated better. Therefore,
it is essential to validate the specification and select one model (B) if CVB(d̂B ) < CVA(d̂B + 1). One can also select d̂A =
between (A) and (B). First, we need to find the dimension in arg minq≥0 CVA(q) and d̂B = arg minq≥0 CVB(q) and select
model (A). There are a number of methods for this purpose; see, model (B) if CVB(d̂B ) < CVA(d̂A ). Simulations suggest that
for example, Yin and Cook (2002, 2004) and Xia et al. (2002). these methods also work with about the same efficiency as the
Here the method in Xia et al. (2002) is used. The method is preceding method.
based on a semiparametric cross-validation (CV) criterion. Sup-
pose for a working dimension q, the estimated directions for 5. ASYMPTOTICS
model (A) are B̂A = (β̂1 , . . . , β̂q ). The estimation details can Some asymptotic properties are presented in this section.
be found in Xia et al. (2002). For simplicity, we use the delete- Their proofs can be obtained from the author on request or
one-observation N–W estimator downloaded from the supplemental materials web-
n   n
  site at http://www.amstat.org/publications/jasa/supplemental_
m̂Aj (B̂A x) = K (B̂ X
hA A ix i )Y KhA (B̂
A Xix ). materials. For any B : p × q such that B  0 B = Iq , let fB (u)
i=1,i =j i=1,i =j be the density function of B X, μB (u) = E(X|B X = u),
The CV value for model (A) is defined as and wB (u) = E(XX |B X = u). For ease of exposition, de-

n note μB (B x) and fB (B x) by μB (x) and fB (x), respec-
CVA(q) = n −1
ρ(fˆ(Xj )){Yj − m̂A  2
j (B̂A Xj )} .
tively. Let νB (x) =
 x − μB (x). For ease of exposition, we as-
j =1 sume that μ0 = K(v1 , . . . , vq ) dv1 · · · dvq = 1 and μ2,k =
Xia: Multiple-Index Model and Dimension Reduction 1635

K(v1 , . . . , vq )vk2 dv1 · · · dvq = 1 for k = 1, . . . , q; otherwise, where σ 2 (u) = E(ε 2 |B 0 X = u). Furthermore, if dB ≤ 3 then
√ √
redefine K(v1 , . . . , vq ) := μ−1 0 K(v1 / μ2,1 , . . . , vq / μ2,q )/
we have
√ 
μ2,1 · · · μ2,q . For any square matrix A, denote its Moore– √ γ̂ − γ0
Penrose inverse matrix by A+ . n
(B̂ − B0 )
Proposition 5.1 (Convergence of the algorithm). Suppose as-  
D Ip −B0 (IdB ⊗ γ0 )
sumptions (C1)–(C6) in the Appendix hold and let q = dA . −→ N 0,
0 IpdB
Denote the estimators at outer loop t and inner loop τ by 
  
(B(t,τ ) , γ(t,τ ) ). Then, for any t > 1 and τ ≥ 1, there is a rota- + + Ip −B0 (IdB ⊗ γ0 )
tion matrix Q such that × W0 W2 W0 ,
0 IpdB

γ(t,τ +1) − γ0
where
(B(t,τ +1) − B0 Q)  
   νB0 (X)
+ γ(t,τ ) − γ0 W0 = E ρ fB0 (X)
= {D̃0 D0 + o(1)} + D̃+
0 n ∇G(B
0 X) ⊗ νB0 (X)
(B(t,τ ) − B0 Q)
  
log n   νB0 (X)
+ O ht + 4
+ o n−1/2 ×
∇G(B 0 X) ⊗ νB0 (X)
Downloaded by [University College London] at 04:17 13 October 2014

q
nht
and
almost surely, where D̃0 and D0 are two semipositive definite
 
matrices (details can be found in the proof). Moreover, the   νB0 (X)
eigenvalues of D̃+ W2 = E ρ 2 fB0 (X)
0 D0 are all smaller than 1; that is, the algo- ∇G(B
0 X) ⊗ νB0 (X)
rithm is contractive.  
νB0 (X)
Remark 5.1. It is known that the basis of a linear space is × ε2 .
∇G(B 0 X) ⊗ ν B 0 (X)
not unique. The estimator B̂ only converges to one of the bases.
If the nonzero eigenvalues of M0 in the Appendix differ from Remark 5.3. Corollary 5.1 indicates that, if dB ≤ 3, then the
one another and the model is written according to (i1)–(i3), then root-n consistency for the estimators of the parameters can be
Q = IdB . Proposition 5.1 only indicates the convergence of the achieved. If higher order local polynomial smoothing is used,
inner loop. For the outer loop, because ht → h as t → ∞, its the root-n consistency can also be achieved for dB > 3. How-
convergence is also implied by the proposition. ever, the model with dB > 3 is not attractive in practice be-
Theorem 5.1 (Consistency of the estimators). Suppose as- cause of the “curse of dimensionality.” Instead, dB = 1 is more
sumptions (C1)–(C6) in the Appendix hold. If q = dB and the appealing. In this case, the asymptotic distribution is similar
final bandwidth is hB , then there is a rotation matrix Q such to that in Carroll et al. (1997), where the variance and co-
that variance matrix is W+ +
0 W2 W0 . The difference in the variance–
 covariance matrices results from the identification requirement
log n in our model.
|γ̂ − γ0 | + |B̂ − B0 Q| = O h4B + d + n1/2
nhBB
Remark 5.4. To utilize the asymptotic distribution for sta-
in probability. tistical inference of the parameters, we need to estimate the
variance–covariance matrix. Replace the values G(B 0 Xj ),
Remark 5.2. Note that, for model (A), the consistency rate ∇G(B X
0 j ), and νB0 (X j ), respectively, by G j , ∇j , and μ j with
for the directions is Op {h4A + log n(nhdAA )−1 + n1/2 }. Because 
dA = dB + 1, the consistency rate for the estimation of model Gj
(B) can be faster than that for model (A). In other words, iden- ∇j hB
tifying model (B) correctly can indeed improve the estimation
n    −1
efficiency.  1 1
= KhB (B̂ Xij )
B̂ Xij / hB B̂ Xij / h
Corollary 5.1 (Asymptotic distributions). Suppose assump- i=1
tions (C1)–(C6) in the Appendix hold, q = dB , and the nonzero 
n 
 1
eigenvalues of M0 differ from one another. Suppose the model × KhB (B̂ Xij )  {Yi − X
i γ̂ }
B̂ Xij / hB
is written according to (i1)–(i3). If the density function of B
0X
i=1

is positive at u = (v1 , . . . , vdB ) , then we have 
and μj = {nfj }−1 ni=1 KhB (B̂ Xij )Xi . Let fj = n−1 ×

n 
1 i=1 KhB (B̂ Xij ) and νj = μj − Xj . Then W0 and W2 can
Bd
(nhdBB )1/2 Ĝ(u) − G(u) − ∇ι,ι
2
G(u)h2B be estimated, respectively, by
2
ι=1
   
   −1
n
νj νj
D σ 2 (u) Ŵ0 = n Inj ρ(fj )
→ N 0, K (u) dv1 · · · dvdB ,
2
∇j ⊗ νj ∇j ⊗ νj
fB0 (u) j =1
1636 Journal of the American Statistical Association, December 2008

and Example 6.1 (Simulated data). Let us first check the estima-

n    tion consistency by model
νj νj
Ŵ2 = n−1 Inj ρ(fj )
∇j ⊗ νj ∇j ⊗ νj β1 X
j =1 Y = cγ0 X + + σ ε,
.5 + (1.5 + β2 X)2
× (Yj − Gj )2 .
where γ0 = (1, −1, 0, 0, 0, 0, 0, 0, −1, 1) , β1 = (0, 0, .5, .5,
Finally, the variance–covariance matrix can be estimated by .5, .5, 0, 0, 0, 0) , and β2 = (0, 0, .5, −.5, .5, −.5, 0, 0, 0, 0) .
  The predictor X = (X1 , . . . , X10 ) =  0 (ξ1 , . . . , ξ10 ) , with
1/2
ˆ = Ip −B̂(IdB ⊗ γ̂ ) Ŵ+ Ŵ2 Ŵ+
 iid
0 IpdB 0 0 ξi , i = 1, . . . , 10 ∼ N(0, 1). The covariance matrix is  0 =
 (.5|i−j | )1≤i,j ≤10 . The constant c in the model is 1 or 0 depend-
Ip −B̂(IdB ⊗ γ̂  )  ing on whether the linear part exists or not. If c = 0, the model
× .
0 IpdB was used by Li (1991). Note that the model is also a special
ˆ is a consistent estimator of the variance- case of models (A) and (A∗ ) with dA = dA∗ = 3 when c = 1 and
It is easy to see that 
dA = dA∗ = 2 when c = 0.
covariance matrix in Corollary 5.1. The standard errors (S.E.)
With c = 1 and different sample size n, we estimate the para-
for √elements in (γ̂ , B̂) are asymptotically the main diagonals of
ˆ meters γ0 and B0 = (β1 , β2 ). The estimation errors are defined
/ n, respectively. These standard errors can be used for the
as |γ0 − γ̂ | and |(γ̂ , B̂)(γ̂ , B̂) − (γ0 , B0 )(γ0 , B0 ) | for model
Downloaded by [University College London] at 04:17 13 October 2014

statistical inference about the parameters separately; see Exam-


ples 6.2 and 6.3. (B) and |B̂A B̂ 
A − (γ0 , B0 )(γ0 , B0 ) | for models (A) and (A ).

With 200 replications for each size n, the calculation results are
Theorem 5.2 (Consistency of model selection). Suppose as- shown in Figure 1, (a) and (c). As the sample size increases,
sumptions (C1)–(C6) in the appendix hold. For every work- the estimation errors of model (B)
ing dimension q, we use bandwidth h ∝ n−1/(q+4) . If model √ drop rapidly. Multiplying
the errors by a factor of root n, n/10, the values still tend
(B) is true, then we have P {CVB(d̂A − 1) < CVA(d̂A )} → 1 as to decrease when n is small and remain roughly constant as n
n → ∞; otherwise, we have P {CVB(d̂A −1) > CVA(d̂A )} → 1. increases. This dynamical pattern supports that the estimation
Moreover, the selected dimension is consistent, that is, error has a consistency rate of root n. Figure 1, (a) and (c), also
P (d̂B = dB ) → 1 as n → ∞. indicates clearly that model (B) can indeed give much more ef-
ficient estimators than model (A), while both of them are much
Theorem 5.2 indicates that our selection criterion for models better than the pHd method (Li 1992).
(A) and (B) is consistent and that the selected dimension for With c = 0 or c = 1, the proposed selection method can se-
model (B) is also consistent. Theorem 5.2 also implies that the lect the model and dimension quite accurately as shown in Fig-
estimation method searches for the dimension reduction space ure 1, (b) and (d). The frequencies of correct selection tend to
in the conditional mean exhaustively (Xia 2007). 100% as the sample size increases, lending support to the con-
sistency of the selection method. When the model has dimen-
6. NUMERICAL STUDIES sion 2 (i.e., c = 0), the χ 2 test (Li 1991) has the same efficiency
In the following calculations, we use the quadratic kernel in selecting the dimension as our method. However, when the
K(u) = μ−1 dimension is 3 (i.e., c = 1), the χ 2 test is much worse than our
0 (1 − |u| /μ2 ) I (|u|/μ2 < 1)/μ 2 , where μ0 =
2 2
 −1  method.
(1 − |u| )I (|u| < 1) du and μ2 = μ0 (1 − |u|2 )I (|u| <
2

1)v12 du. We use the trimming function in Section 2 with Example 6.2 (Simulated data). In this example we check the
ω0 = .01. Thus, all observations have equal weight. The band- asymptotic distribution by the model
widths by this rule of thumb are used for the selection of h0 and
Y = X1 − X2 − 2 exp{−(X3 + X4 + X5 )2 } + σ ε,
t , while cn = 1/1.5; see Sliverman (1986) and Scott (1992).
A computer code is available at www.stat.nus.edu/˜ycxia/mim.m. where Xk = ηk + η0 , k = 1, 2, 3, 4, 5, with η0 , . . . , η5 being in-
Besides comparing model (A) and the estimation method in Xia dependent and distributed uniformly on [−.5, .5]. Thus, the cor-
et al. (2002), we also consider the general dimension reduction relation coefficient between any two covariates is .5. In this ex-
model: ample γ0 = (γ01 , . . . , γ05 ) = (1, −1, 
√ 0, 0, 0) and B0 = β1 =
   
(β01 , . . . , β05 ) = (0, 0, 1, 1, 1) / 3.
Model (A∗ ): Y = m∗ β1 X, . . . , βd∗ X, ε ,
A With different sample size n and magnitude of noise σ ,
where ε is independent of X; see Li (1991). There are a number 500 random samples are generated and used to estimate the
of methods to estimate model (A∗ ) and to select the dimension model. The calculation results are listed in Table 1. The es-
dA∗ . In the following, we mainly consider the sliced inverse re- timates of γ0 and B0 are quite accurate and stable by check-
gression (SIR) and the principal Hessian direction (pHd) meth- ing the mean and standard deviation of the estimates. When
ods; see Li (1991, 1992). For pHd applied to the residuals of σ = 1 and n = 100, the frequency that the model and its dimen-
a linear regression, the preceding model has a roughly similar sion are correctly selected is .995; for the other three cases the
form to that of model (B). Therefore, pHd must have better per- frequencies are all 100%. Next, we check the asymptotic dis-
formance than the other inverse regression methods. A code in tribution by calculating the frequencies of accepting γ0k = 0
R (http://www.r-project.org/ ) for the inverse regression meth- and β0k = 0, that is, frequencies of |γ̂0k | < 1.96{ ˆ k,k /n}1/2
ods is used in the following calculations. and |β̂0k | < 1.96{ ˆ p+k,p+k /n} , respectively. Based on the
1/2
Xia: Multiple-Index Model and Dimension Reduction 1637
Downloaded by [University College London] at 04:17 13 October 2014

Figure 1. Calculation results for Example 6.1. The left panels are the results for σ = .2; the √
right panels for σ = .5. In (a) and (c), the dashed
lines with circles, squares, diamonds, and stars represent the estimation errors multiplied by n/10 of (γ0 , B0 ) using model (A∗ ) with pHd,
model (A) with MAVE, and model (B) with the new method, and of γ0 using model (B) with the new method, respectively. In (b) and (d), the
solid lines with stars and crosses are the frequencies of correct selections of model (B) and its dimension for c = 1 and c = 0, respectively. The
solid lines with circles are the frequencies of correct selections of the dimension using the χ 2 test (with a 5% significance level) of Li (1991)
with c = 0 (upper panel) and c = 1 (lower panel), respectively.

asymptotic distribution in Corollary 5.1 and Remark 5.4, the from a volcanic eruption or carbon monoxide gas from motor
frequencies should be around 95% as n is large enough. The vehicle exhaust. The primary pollutants can be controlled by
frequencies reported in the table lend support to the asymptotic reducing the emission of harmful gas. Secondary pollutants are
distributions. generated in the air when primary pollutants react or interact
Example 6.3 (Real data). A variety of atmospheric pollutants with weather conditions; see also the WHO report (2003). An
are known or suspected to have serious effects on public health important example of secondary pollutants is ozone, one of the
and the environment; see the report of World Health Organiza- many secondary pollutants that make up photochemical smog.
tion (WHO) (2003). The main pollutants include nitrogen diox- Therefore, it is of great interest to model the dependence of
ide (NO2 ), carbon monoxide (CO), sulfur dioxide (SO2 ), par- the ozone concentration on the primary pollutants and weather
ticulate matter (PMx ; particles with diameter smaller than x mi- conditions, whereby we can control the ozone concentration
crometers), ozone (O3 ), among others. Pollutants can be clas- by controlling the primary pollutants. For this purpose, we use
sified as either primary or secondary pollutants. Primary pollu- the data collected in Chicago from 1987 to 1997 (available at
tants are substances directly produced by a process, such as ash http://www.ihapss.jhsph.edu/data/data.htm). The weather con-

Table 1. Mean, standard deviation (in parentheses) of estimates, and frequencies of accepting γ0k = 0 and β0k = 0 at 5% level
(in square brackets) for Example 6.2

σ n γ01 γ02 γ03 γ04 γ05 β01 β02 β03 β04 β05
.5 100 .993 −.99 −.008 .001 −.003 .002 .004 .572 .565 .561
(.186) (.183) (.175) (.169) (.167) (.092) (.093) (.081) (.082) (.087)
[.004] [0] [.946] [.946] [.952] [.984] [.976] [0] [0] [0]
400 1.002 −1.002 .002 .001 .001 −.001 .001 .574 .576 .576
(.080) (.079) (.076) (.083) (.082) (.039) (.039) (.036) (.036) (.034)
[0] [0] [.954] [.932] [.930] [.976] [.974] [0] [0] [0]
1.0 100 .973 −1.007 .029 .000 −.015 .017 .001 .517 .540 .540
(.352) (.347) (.333) (.342) (.336) (.180) (.184) (.174) (.168) (.151)
[.202] [.188] [.942] [.918] [.944] [.952] [.944] [.182] [.134] [.142]
400 .995 −1.003 .003 −.001 −.001 .002 .003 .572 .563 .570
(.164) (.159) (.157) (.153) (.158) (.083) (.084) (.069) (.073) (.071)
[0] [0] [.938] [.952] [.936] [.968] [.966] [0] [0] [0]
1638 Journal of the American Statistical Association, December 2008

Table 2. Selection procedure for dimension and models

Model and methods


Working
dimension q (A): CVA(q) (B): CVB(q − 1) (A∗ ): SIR p value (A∗ ): pHd p value
1 .5943 .5744∗ .0000 .0000
2 .4484 .4450 .0000 .0000
3 .4431 .4392 .3299 .0000
4 .4565 .4410 .7472 .0000
5 .4699 .4474 .9207 .0000
6 .4819 .4583 .9515 .0002
7 .4981 .4611 .9862 .1583
8 .5065 .4725 — —
∗ CV value of linear regression model.

ditions include daily temperature (T), humidity (H), and their conditions have a linear effect on the level of O3 as shown in
difference between the highest and lowest values, denoted by panel 7 of Figure 2. Third, weather conditions play important
Downloaded by [University College London] at 04:17 13 October 2014

VT and VH, respectively. roles (and seem to predominate) in the nonlinear part. There are
Let Y be the daily average ozone levels and X = (PM10 , thresholds for weather conditions (at about 0 in panels 8 and 9
SO2 , NO2 , CO, T, H, VT, VH) . All the variables (includ- of Fig. 2), suggesting certain weather conditions are necessary
ing Y ) are standardized separately such that each has mean 0 for the chemical reaction between the primary pollutants to cre-
and variance 1 in order to make it easy to compare the esti- ate ozone. These findings support the chemistry-based claim
mated coefficients in the dimension reduction directions. First, that ozone is generated in the air when primary pollutants react
we apply the model selection method to models (A) and (B). Ta- or interact under certain weather conditions.
ble 2 lists the calculation results. Based on the calculation, we
choose d̂A = 3 for model (A). However, model (B) with dB = 2 7. CONCLUSION
is preferred because CVB(2) = .4392 < CVA(3) = .4431 ≤ Dimension reduction is a useful tool for statistical inference.
min{CVA(1), . . . , CVA(8)}. If we use SIR and the χ 2 test, the However, it is still far from the goal of statistical modeling.
selected dimension is also 3 for model (A∗ ); see Table 2. If we By separating the linear component from the nonlinear com-
use pHd, the selected number of dimensions is 7, which does ponents, this article considered one method of specification,
not seem so reasonable. leading to model (B). Based on the investigation (theory, nu-
Next, we fit the data to models (A) and (A∗ ) with dA = dA∗ = merical performance, and real data analysis) in this article, the
3 and model (B) with dB = 2. The estimation results are shown specification can achieve three goals. First is better estimation
in Table 3 and Figure 2. of the model. See Remark 5.2 for the detailed discussion and
Based on the previous estimation and plots in Figure 2, we Example 6.1 for numerical comparison. Second is easier inter-
have the following observations. First, the plots of model (B) pretation of the estimated model. See Example 6.3, where the
show a much clearer relationship between the level of O3 and specified model can give a much clearer hint to the possible re-
the covariates than the plots of models (A) and (A∗ ), indicating lationship between the response and the dimension reduction
that specifying the model structure as in model (B) is very help- directions than dimension reduction model (A) or model (A∗ ).
ful in recovering the relationships between the response and its Third is identification of the linear component and nonlinear
covariates. Second, primary pollutants combined with weather components, which is the interest of the other disciplines.

Table 3. Estimated coefficients (and corresponding S.E.) of models (A), (B), and (A∗ )

Variables
Model PM10 SO2 CO NO2 T H VT VH
Model (A) β̄1 .009 .089 .581 −.410 −.260 −.583 −.229 −.162
β̄2 −.379 −.467 −.266 −.239 .313 −.633 .029 .101
β̄3 .194 .188 −.020 −.026 .869 .159 .118 .362
Model (B) γ0 .113 .097 .039 .147 −.079 .226 .073 .029
(.019) (.015) (.021) (.018) (.019) (.023) (.041) (.047)
β̃1 −.247 −.112 −.222 .167 −.129 .622 .603 .286
(.121) (.069) (.104) (.096) (.084) (.168) (.251) (.341)
β̃2 −.010 .043 −.137 .111 .545 .488 .489 .438
(.089) (.087) (.111) (.109) (.075) (.203) (.274) (.293)
Model (A∗ ) β1 .208 −.096 .072 −.424 .757 −.322 .091 .277
β2 .445 .264 −.242 .299 .048 .500 .461 .345
β3 −.329 −.228 −.143 .458 .434 .409 .429 .262
Xia: Multiple-Index Model and Dimension Reduction 1639
Downloaded by [University College London] at 04:17 13 October 2014

Figure 2. Calculation results for Example 6.3. The three upper panels are plots of Y against the three directions of the dimension reduction
space in model (A). The three panels in the middle are plots of Y against the three directions of the dimension reduction space in model (A∗ ).
The three lower panels are plots of the linear part and two nonlinear parts in model (B).

This article only considered the specification of dimension (C6) (Bandwidths) Bandwidths h0 ∝ n−0 and t ∝ n−1 , where
reduction in the conditional mean function (Samarov 1993; 0 = 1/(p + 4) and 1 = 1/(dB + 4).
Hristache et al. 2001; Cook and Li 2002; Xia et al. 2002; In (C1), the requirement of E|X|4 < ∞ is used to handle the bound-
Yin and Cook 2002; among others). A parallel specification ary points in the proof with trimming function Inj . If we adopt the
for general dimension reduction (Li 1991; Cook 1998; among fixed trimming scheme of Härdle, Hall, and Ichimura (1993), this re-
others) would be more relevant and needs to be investigated. quirement can be removed. Requiring (C2) is to meet the conditions for
Other methods of specification are also possible. These specifi- the uniform consistency results in a uniform consistency result. Härdle
et al. (1993) even required the existence of all moments of Y . Lower
cations will push dimension reduction one step closer to statis-
order of smoothness than (C3) is sufficient if we are only interested
tical modeling. in the estimation consistency. Condition (C4) indicates that the dimen-
sion dB cannot be further reduced. The popular kernel functions such
APPENDIX: ASSUMPTIONS AND PROOF as the Epanechnikov kernel and the quadratic kernel are included in
OF PROPOSITION 1.1 (C5). The Gaussian kernel can be used with some modifications to the
proofs. Bandwidths satisfying (C6) can be found easily; see, for exam-
We need the following conditions to prove our theoretical results.
ple, Yang and Tschernig (1999). Actually, a wider range of bandwidths
(C1) (Design of X) The density function f (x) of X has bounded is allowed; see the proofs.
second-order derivatives on Rp and E|X|4 < ∞; functions Proof of Proposition 1.1
μB (u) and wB (u) have bounded derivatives with respect to u (I) Without loss of generality, assume that  0 = Ip ; otherwise, we
and B in a small neighborhood of B0 .
can standardize the covariates. Let γ̄0 = γ0 − B0 (B −1 
0 B0 ) B0 γ0 and
(C2) (Moments of Y ) Response Y has moment E|Y |5 < ∞.

B̄0 = B0 (B0 B0 ) −1/2
(C3) (Conditional function) Function E(Y − γ  X|B X = u) has QQ0 , where Q is the eigenvector of
bounded fourth-order derivatives with respect to u, γ , and B E[{∇G(B 0 X) + (B  B )−1 B γ }{∇G(B X) + (B B )−1 B γ } ]
0 0 0 0 0 0 0 0 0
for (γ , B) in a small neighborhood of (γ0 , B0 ). and Q0 = diag(e1 , . . . , eq ) with ek = ±1 having the same sign as
(C4) (Nonlinear dimension) Matrix M0 = E[ρ(f (X)){∇G(B 0 × the first nonzero element in the kth column of B0 (B 0 B0 )
−1/2 Q. Let
¯
X) − ∇}{∇G(B  X) − ∇}
¯  ] has full rank of dB , where ∇¯ =  
Ḡ(u) = G(B0 B0 QQ0 u) + γ0 B0 QQ0 u. We have
0
E{ρ(f (X))∇G(B 0 X)}/Eρ(f (X)). Y = γ̄0 X + Ḡ(B̄
0 X) + ε,
(C5) (Kernel function) K(u) is a multivariate density function with
bounded second-order derivatives and a compact support. and γ̄0 and B̄0 satisfy (i1)–(i3).
1640 Journal of the American Statistical Association, December 2008

(II) Note that  = B0 QQ0 (B0 QQ0 ) , where Q Q is the and Semiparametric Methods in Econometrics and Statistics: Proceedings of
eigenvalue–eigenvector decomposition of 0 . Therefore, B0 must be the Fifth International Symposium in Economic Theory and Econometrics,
a basis in the space spanned by the first dB eigenvectors of . Any eds. W. A. Barnett, J. Powell, and G. Tauchen, Cambridge, U.K.: Cambridge
University Press.
two bases differ by a rotating matrix. Thus, the second equation of (II) Li, K. C. (1991), “Sliced Inverse Regression for Dimension Reduction,” Journal
holds. By (1), we have γ0 = E∇M(X) − B0 E∇G(B 0 X). Thus, by of the American Statistical Association, 86, 316–342.
(i2), (1992), “On Principal Hessian Directions for Data Visualization and
Dimension Reduction: Another Application of Stein’s Lemma,” Journal of
γ0 = (I − B0 B 
0 )γ0 = (I − B0 B0 )E{∇M(X)} the American Statistical Association, 87, 1025–1039.
Maestri, R., et al. (2005), “Linear and Non-Linear Indices of Heart Rate Vari-
is unique because B0 B 0 is unique by the second equation of (II). ability in Chronic Heart Failure: Mutual Interrelationships and Prognostic
Therefore, the first equation of (II) holds. The last equation of (II) fol- Value,” Computers in Cardiology, 3, 981–984.
lows from the first two. Newey, W. K., and Stoker, T. M. (1993), “Efficiency of Weighted Average
Derivative Estimation and Index Models,” Econometrica, 61, 1199–1223.
[Received November 2007. Revised July 2008.] Rao, C. R., and Mitra, S. K. (1971), Generalized Inverse of Matrices and Its
Applications, New York: Wiley.
REFERENCES Samarov, A. M. (1993), “Exploring Regression Structure Using Nonparametric
Functional Estimation,” Journal of the American Statistical Association, 88,
Banerjee, A. N. (2007), “A Method of Estimating the Average Derivative,”
836–847.
Journal of Econometrics, 136, 65–88.
Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997), “Generalized Par- Samarov, A., Spokoiny, V., and Vial, C. (2005), “Component Identification and
tially Linear Single-Index Models,” Journal of the American Statistical Asso- Estimation in Nonlinear High-Dimensional Regression Models by Structure
ciation, 92, 477–489. Adaptation,” Journal of the American Statistical Association, 100, 429–445.
Downloaded by [University College London] at 04:17 13 October 2014

Cook, R. D. (1998), Regression Graphics, New York: Wiley. Scott, D. W. (1992), Multivariate Density Estimation: Theory, Practice and
Cook, R. D., and Li, B. (2002), “Dimension Reduction for Conditional Mean Visualization, New York: Wiley.
in Regression,” The Annals of Statistics, 30, 455–474. Sliverman, B. W. (1986), Density Estimation for Statistics and Data Analysis,
Delecroix, M., Hristache, M., and Patilea, V. (2005), “On Semiparametric M- London: Chapman and Hall.
Estimation in Single-Index Regression,” Journal of Statistical Planning and Speckman, P. (1988), “Kernel Smoothing in Partially Linear Models,” Journal
Inference, 136, 730–769. of the Royal Statistical Society, Ser. B, 50, 413–436.
Donkers, B., and Schafgans, M. M. A. (2003), “A Derivative Based Estima- Tong, H. (1990), Non-Linear Time Series: A Dynamical System Approach, New
tor for Semiparametric Index Models,” CentER Discussion Paper 22, Tilburg York: Oxford University Press.
University. Wang, Q. (2003), “Dimension Reduction in Partly Linear Error-in-Response
Fan, J., and Gijbels, I. (1996), Local Polynomial Modeling and Its Applications, Models With Validation Data,” Journal of Multivariate Analysis, 85, 234–
London: Chapman & Hall. 252.
Fan, J., and Huang, T. (2005), “Profile Likelihood Inferences on Semiparamet- World Health Organization (2003), report of a WHO/HEI working group, Bonn,
ric Varying-Coefficient Partially Linear Models,” Bernoulli, 11, 1031–1057. Germany. Available at www.euro.who.int/document/e78992.pdf.
Gao, J., and Tong, H. (2004), “Semiparametric Nonlinear Time Series Model Xia, Y. (2006), “Asymptotic Distributions of Two Estimators of the Single-
Selection,” Journal of the Royal Statistical Society, Ser. B, 66, 321–336. Index Model,” Econometric Theory, 22, 1112–1137.
Grenfell, B. T., Finkenstädt, B. F., Wilson, K., Coulson, T. N., and Crawley, (2007), “A Constructive Approach to the Estimation of Dimension Re-
M. J. (2000), “Ecology—Nonlinearity and the Moran Effect,” Nature, 406, duction Directions,” The Annals of Statistics, 35, 2654–2690.
847. Xia, Y., Tong, H., Li, W. K., and Zhu, L. (2002), “An Adaptive Estimation of
Härdle, W., and Stoker, T. M. (1989), “Investigating Smooth Multiple Regres- Dimension Reduction Space,” Journal of the Royal Statistical Society, Ser. B,
sion by Method of Average Derivatives,” Journal of the American Statistical 64, 363–410.
Association, 84, 986–995. Yang, L., and Tschernig, R. (1999), “Multivariate Bandwidth Selection for Lo-
Härdle, W., Hall, P., and Ichimura, H. (1993), “Optimal Smoothing in Single- cal Linear Regression,” Journal of the Royal Statistical Society, Ser. B, 61,
Index Models,” The Annals of Statistics, 21, 157–178. 793–815.
Horowitz, J. (1998), Semiparametric Methods in Econometrics, New York: Yin, X., and Cook, R. D. (2002), “Dimension Reduction for the Conditional
Springer. k-th Moment in Regression”, Journal of the Royal Statistical Society, Ser. B,
Hristache, M., Juditski, A., Polzehl, J., and Spokoiny, V. (2001), “Structure 64, 159–175.
Adaptive Approach for Dimension Reduction,” The Annals of Statistics, 29, (2004), “Asymptotic Distribution of Test Statistic for the Covariance
1537–1566. Dimension Reduction Methods in Regression,” Statistics and Probability Let-
Ichimura, H. (1993), “Semiparametric Least Squares (SLS) and Weighted SLS ters, 68, 421–427.
Estimation of Single-Index Models,” Journal of Econometrics, 58, 71–120. Yu, Y., and Ruppert, D. (2002), “Penalized Spline Estimation for Partially Lin-
Ichimura, H., and Lee, L. F. (1991), “Semiparametric Least Squares Estimation ear Single-Index Models,” Journal of the American Statistical Association,
of Multiple Index Models: Single Equation Estimation,” in Nonparametric 97, 1042–1054.

You might also like