You are on page 1of 28

A ROBUST MULTIPLE IMPUTATION MODEL

UNDER NON-RANDOMLY MISSING DATA WITH


OUTLIERS

DIBAL, NICHOLAS PINDAR

DECEMBER 2014
A ROBUST MULTIPLE IMPUTATION MODEL UNDER
NON-RANDOMLY MISSING DATA WITH OUTLIERS

By

DIBAL, NICHOLAS PINDAR


(899008084)

B.Sc. (Hons) Statistics, Ahmadu Bello University, Zaria, Nigeria, 1987.

M.Sc. Statistics, University of Lagos, Lagos, Nigeria, 1990.

THESIS SUBMITTED TO THE SCHOOL OF POSTGRADUATE


STUDIES, UNIVERSITY OF LAGOS IN PARTIAL FULFILMENT
OF THE REQUIREMENT FOR THE AWARD OF THE DEGREE
OF DOCTOR OF PHILOSOPHY (Ph.D.) IN STATISTICS

Department Of Mathematics

University Of Lagos

December 2014
CERTIFICATION
This is to certify that the thesis:

A ROBUST MULTIPLE IMPUTATION MODEL UNDER


NON-RANDOMLY MISSING DATA WITH OUTLIERS

Submitted to the

School of Postgraduate Studies, University of Lagos


For the award of the degree of
DOCTOR OF PHILOSOPHY (Ph.D.)
Is a record of original research carried out

By

DIBAL, NICHOLAS PINDAR

(899008084)

In The Department of Mathematics

___________________________________ _____________ _____________

Authors Name Signature Date

___________________________________ _____________ _____________

1st Supervisors Name Signature Date

___________________________________ _____________ _____________

2nd Supervisors Name Signature Date


Dedication
This Thesis is dedicated to the loving memory of my Late Parents, Mr Pindar Y. Dibal and
Mary Pindar Dibal who were part of this dream but never live to see this date.
Acknowledgements

To God Almighty I gave all the glory and thanks for making it possible for me to successfully

complete this program. I will forever remain grateful for everything Lord.

I would like to express my profound gratitude to my supervisor, Professor Ray Okafor for his

guidance, thoroughness, constructive criticism, patience and support that made this work

possible and plausible. I sincerely appreciate the opportunity he gave me to work under him.

My profound gratitude also goes to my second supervisor, Dr Hamadu Dallah who took

interest and devoted much of his time to see to the success of this research work. I appreciate

the in-depth suggestions and comments that helped greatly in bringing the quality of this

work to this standard. I must sincerely say that it has really been a rewarding experience

working under their guidance. I will always remain grateful.

I acknowledge with thanks the immense contributions made by all staff of the department of

Mathematics, University of Lagos for creating friendly and conducive environment that made

my stay in University of Lagos worthwhile. I cherish and will always keep the bond of

friendship we have built.

I wish to offer my sincere thanks and gratitude to my wife, Maryam Nicholas and children;

Daniel, Musa and Ibrahim for helping to carve out time necessary for understanding and

sacrifice throughout the period of this program. I appreciate you all for everything.

To all my outstanding great friends; Dr Johnson Olumuyiwa Agunsoye, Joshua Ali, Maria
Lorena Aguilera, Baba Salihu, Bukar Shuwa, Kama Mohammed, Fatoki Olayode,
B.A.Nkemnole, Akarawak, Ehige. It has really been more than friendship, I will forever
remain grateful for meeting such wonderful people like you in life.

I also owe a significant debt of gratitude to all those who helped to review the work, your

comments, suggestions and constructive criticism greatly helped to improve the quality of the

work to this standard.


Table of Contents
List of figures

List of Tables

1 CHAPTER ONE

1.0 Introduction and Background of Study 1

1.2 Statement of the Problem 5

1.3 Aims and Objectives of Study 6

1.4 Scope and Delimitation of Study 6

1.5 Significance of Study 6

1.6 Operational Definitions of Terms 7

1.7 Definition of Terms 7

2 CHAPTER TWO

LITERATURE REVIEW

2.1 Missing Data 9

2.2 Methods of Handling Missing Data 10


2.2.1 Ad hoc Methods of Handling Missing Data 11

2.2.2 Listwise Deletion 11

2.2.3 Pairwise Deletion 12

2.2.4 Mean Substitution 13

2.2.5 Simple Hot-Deck (HD) 13

2.2.6 Regression Based Imputation 14

2.3 Overview of Statistically Principled Methods 14

2.3.1 Direct maximization likelihood estimation (DML) 14

2.3.2 Bayesian Modeling with MCMC methods 15

2.3.3 Expectation-Maximization (EM) Algorithm 16

2.3.4 Multiple Imputation 16

2.3.5 Combining the m Parameter Estimates 18

2.4 Missingness Mechanism 20

2.4.1 Missing Completely at Random 21

2.4.2 Missing at Random 22

2.4.3 Missing not at Random 23

2.5 Outliers 24

2.6 Contamination Model 26

2.7 Handling Outliers and Missing Values Simultaneously 26

3 CHAPTER THREE

METHODOLOGY

3.0 Introduction 28

3.1 The Imputation Model 28

3.2 Multiple Imputation Procedure 29

3.3 Parameter Estimation 31

3.3.1 Combining the m Parameters 31


3.3.2 Properties of the Point Estimator 32

3.3.3 Finite Sample Properties 33

3.3.4 Asymptotic Properties of Multiple Imputation 39

3.4 Notations and Settings 44

3.5 The S-estimator 46

3.6 The generalized S-estimator 47

3.7 Multivariate Location and covariance 48

3.8 The Breakdown point 50

3.8.1 Consistency of Estimators 52

3.8.2 Asymptotic Normality 53

3.8.3 Asymptotic Normality of 53

3.8.4 Consistency of 58

3.9 Missingness Mechanism and Distribution of Outliers 59

3.10 Proposed Imputation Method 61

3.10.1 Parameter Estimates 63

3.10.2 The Computation Algorithm 64

3.11 The Simulation Study and Design 66

3.12 Performance Measures 67

3.13 Software 69

3.14 R-code for Simulation and Computation 69

4 CHAPTER FOUR

RESULTS AND DISCUSSION

4.1 Introduction 70

4.2 Simulated Sample Data for the Study 71

4.30 Diagnostic Plots 77

4.3.1 Graphical Presentation of the Confidence Intervals 78


4.3.2 The Profile Plot 81

8.3.3 The Scatterplot Matrix 83

4.3.4 The Robust versus Classical distances 84

4.3.5 The Chi-square Quantiles 87

4.3.6 The Tolerance Ellipse 89

4.4 Sample Printout for Location and Scale 89

4.5.1 Covariance Test 92

4.5.2 LRT Distances and Convergence Rate 93

4.5.3 Sensitivity Analysis 95

4.6 Application to Practical Problem 96

5 CHAPTER FIVE

SUMMARY OF FINDINGS, CONCLUSION AND CONTRIBUTION TO


KNOWLEDGE

5.0 Summary of Findings 101

5.1 Conclusion 103

5.2 Contributions to Knowledge 104

5.4 Suggestions for Further Research 104

6 References 105

7 Appendix 1: Settings and Construction of 95% and 99% Confidence

Intervals for the Population Mean 113

Appendix 2: Code for Bar Plots 119

Appendix 3: Graphical Display of Confidence Intervals 122

Appendix 4: Sensitivity Analysis 123

Appendix 5: Test Statistics for equality of variances 127

Appendix 6: Profile Plot 129

Appendix 7: Statistics on Poverty and Inequality 130


List of figures
Page

Figure 1: Multiple Imputation Steps 18

Figure 2: Graphical Representation of Confidence Intervals 79

Figure 3: Multiple Bar Charts 80

Figure 4: Profile plot of the Simulated Variables 81

Figure 5: Scatter matrix of the Distribution of the Variables I 82

Figure 6: Scatter matrix of the Distribution of the Variables II 83

Figure 7 Robust Distances versus Mahalanobis Distances 84

Figure 8: Robust and Classical Distances Side by Side 85

Figure 9: Chi-square Q-Q plot 87

Figure 10: Robust and Classical Tolerance Ellipse 88

Figure 11: Chi-square QQ plot for Poverty Data 99

Figure 12: Robust Distance for Poverty Data 99

Figure 13: Robust versus Mahalanobis Distances 100


List of Tables

Page

Table 1: Combination of Missing mechanisms and Distribution of Outliers 61

Table 2: Coverage probabilities, Average widths and Standard Errors

for population means 72

Table 3: Covariance Test Comparing the Sample and Assumed Population 91

Table 4: LRT and Convergence 93

Table 5: Sensitivity Analysis between MNAR and MAR 95

Table 6: Summary of Findings 102


ABSTRACT

In all fields of knowledge, decision making and problem solving is essential and this critical
function needs accurate information obtained from reliable data. The issues of missing values
and outliers are of great concern as either ignoring or handling them inappropriately may
cause the forecasting models to describe neither the bulk of the data nor the outliers leading
to incorrect decision. The properties of multiple imputation in contaminated multivariate data
under non-ignorable missingness mechanism were investigated. The generalised S-estimator
(GSE) algorithm is utilized here for the robust estimation of location and scale parameters.
The behaviour of the proposed method was studied under different design conditions and
theoretically the proposed method was shown to be unbiased and consistent under elliptical
models. Monte Carlo simulation studies with four factors, that is; response rate,
contamination levels, missingness mechanism and sample size yielding 135 different design
conditions was evaluated and the findings support the theoretical results empirically. High
rate of missing values, sample size and contamination level was found to have significant
influence on the performance of the estimators. Shorter confidence intervals and robust
standard errors were achieved with the process converging rapidly. The simulation results
show that under missing not at random (MNAR) mechanism, the proposed method performed
better than when values are missing at random (MAR) or missing completely at random
(MNAR) in all sample sizes. Finally the paradigm was applied to two sets of multivariate
incomplete contaminated real life data on poverty and inequality, and Masakwa seedling
survival rate and the results obtained support the simulation findings.
UNIVERSITY OF LAGOS
SCHOOL OF POSTGRADUATE STUDIES
DEPARTMENT OF MATHEMATICS

APPLICATION FOR APPROVAL OF TITLE OF DOCTORAL THESIS AND


SUPERVISORS

SECTION A: Particulars of Candidate

Name: DIBAL, Nicholas Pindar

Matriculation Number: 899008084

Department Mathematics

Qualifications: B.Sc. (Hons) Statistics (Second Class Lower), A.B.U., Zaria, 1987
M.Sc. (Statistics), University of Lagos, 1990

Degree in View: Ph.D. (Statistics)

Date of first Registration: January 15th, 2010

Field of Study: Sample Survey


Proposed Title of Thesis: A Robust Multiple Imputation Model under Non-randomly
Missing Data with Outliers

SECTION B: Ph. D. Course Work Results

COURSE CODE COURSE TITLE UNITS GRADE GRADE POINT


MAT 930 Advanced Theory of Statistics I 3 A 5
MAT 935 Selected Topics in Applied Statistics 3 A 5
MAT 972 Advanced Topics in Statistical Theory II 3 A 5
MAT 973 Advanced Topics in Applied Statistics I 3 B 4
Total Units 12 CGPA 4.75

Date of Approval of Course Work: 16th March, 2011


Proposed Supervisors:

1st SUPERVISOR

Name: Professor Ray Okafor


Designation: Professor
Department: Mathematics

2nd SUPERVISOR
Name Dr. Hamadu Dallah
Designation Associate Professor
Department Actuarial Science

Proposed Title of Thesis: Robust Multiple Imputation under Non-randomly Missing Data
with Outliers

Recommendations

The Departmental Postgraduate Committee at its meeting held on August 28, 2014
considered the above application and recommends same to the Board of the School of
Postgraduate Studies for approval.

______________________________ _______________________

Dr. O. J. Fenuga Prof. S. O. Ajala


Chairman, Departmental PG Committee Head of Department
SECTION C

Originality of the work:

The candidates work is original. The candidate worked on a new Robust Multiple Imputation
method that is consistent and efficient in handling contaminated multivariate data under non-
ignorable missingness.

Evidence of Competence in the field of study

The candidate has demonstrated a good understanding and high level of competence in
handling the joint problem of outliers and missing values under non-ignorable missingness.
He has also demonstrated good knowledge and understanding of the development of R code
for simulation and application to practical problems .

Interim assessment of candidates critical judgement:

The candidate has developed a Robust Multiple Imputation model that performs very well in
a contaminated data under non-ignorable missingness. The proposed method has shown to be
robust, consistent and efficient, and can be generalized to other more complex multivariate
models such as mixed models and generalized linear models.

Potential worth of the content of the research material for purpose of publication:

The content of the work done is publishable in both local and international journals. At least
three papers have been prepared and are under the process of publication.

Potentiality for contribution to knowledge:

The research work has three major contributions to knowledge which include; the proposed
Robust Multivariate Multiple Imputation is useful in enlarging the theory and application of
multiple imputation in handling missing data and outliers, the developed simulation R-code
can easily be generalized and applied to handle more complex multivariate models such as
generalised linear models and mixed linear models.

SECTION D

An assessment of progress in the research during the period, including any serious
delay or very rapid progress in the students work:

During the period of the candidates study, he has shown high level of competence and
exhibited great commitment and hard work in his research work. With the same tempo, he
should be able to finish the programme within few months from now.
Section E Particulars of Supervisors

1st SUPERVISOR

i Name Prof. Ray Okafor


ii Designation: Professor
iii Department: Mathematics
iv Signature ______________________
v Date ______________________

2nd SUPERVISOR

i Name: Dr. Hamadu Dallah


ii Designation Associate Professor
iii Department: Actuarial Science
iv Signature: _______________________
v Date: _______________________
UNIVERSITY OF LAGOS
SCHOOL OF POSTGRADUATE STUDIES
DEPARTMENT OF MATHEMATICS

APPLICATION FOR APPROVAL OF TITLE OF THESIS AND SUPERVISORS

ATTESTATION

It is hereby confirmed that on August 28, 2014, DIBAL, Nicholas Pindar with matriculation
number 899008084 successfully defended before the Departments Postgraduate Committee,
the Ph.D. proposal titled Robust Multiple Imputation under Non-randomly Missing Data
with Outliers.

The Departmental Postgraduate Committee at its meeting of August 28, 2014 considered the
above application and recommends same to the Board of the School of Postgraduate Studies
for approval.

_____________________________ __________________________
Professor S. O. Ajala Dr. O. J. Fenuga
Head of Department Departmental PG Coordinator

_____________________________ ___________________________
Professor R. Okafor Professor S. A. Okunuga

____________________________ ___________________________
Professor J. O. Olaleru Professor S. S.Okoya

____________________________ ___________________________
Dr. I. O. Abiala Mr A. Adeniyan

_____________________________ ___________________________
Dr M. O. Adamu Dr H. Akewe
Abstract
In all fields of knowledge, decision making and problem solving is essential and this critical
function needs accurate information obtained from reliable data. The issues of missing values
and outliers are of great concern as either ignoring or handling them inappropriately may
cause the forecasting models to describe neither the bulk of the data nor the outliers leading
to incorrect decision. The properties of multiple imputation in contaminated multivariate data
under non-ignorable missingness mechanism were investigated. The generalised S-estimator
(GSE) algorithm is utilized here for the robust estimation of location and scale parameters.
The behaviour of the proposed method was studied under different design conditions and
theoretically the proposed method was shown to be unbiased and consistent under elliptical
models. Monte Carlo simulation studies with four factors, that is; response rate,
contamination levels, missingness mechanism and sample size yielding 135 different design
conditions was evaluated and the findings support the theoretical results empirically. High
rate of missing values, sample size and contamination level was found to have significant
influence on the performance of the estimators. Shorter confidence intervals and robust
standard errors were achieved with the process converging rapidly. The simulation results
show that under missing not at random (MNAR) mechanism, the proposed method performed
better than when values are missing at random (MAR) or missing completely at random
(MNAR) in all sample sizes. Finally the paradigm was applied to two sets of multivariate
incomplete contaminated real life data on poverty and inequality, and Masakwa seedling
survival rate and the results obtained support the simulation findings.
LIST OF CORRECTIONS MADE
Page
1 Recast Abstract

2 Do editorial work The entire work has been


edited
3 Delete research questions It has been deleted from
the work

4 Label figures appropriately 30

5 Follow SPGS format for summary of findings 34-35

6 Recast contribution to knowledge 36

7 Rework references 37-40


Acknowledgement
Professor Ray Okafor and Dr Hamadu Dallah
Finally, many great people helped to make this dream a reality. I wish to offer my
sincere thanks to my wife and children for helping to carve out time necessary for
understanding and sacrifice such as the one they gave me. I appreciate your
patience throughout this process. Outstanding great friends and students made
provided input and plenty of constructive criticisms along the way that contributed
to my thinking. I also owe a significant debt of gratitude to all those who helped
review the work to improve the quality of the work.

Yakubu Bukar (blessed memory) Danjuma Bako Mshelia, Simon Malgwi, Prof.
Bello Mshelia, Mr Isah Jonah , Bitrus Salihu Hena, Timothy A. Mbaya , Charity
Haruna, Dolapo Aramide, Ifeoma Avemaria Obasi, Diana,

Dr Johnson Olumuyiwa Agunsoye, Maria Lorena Aguilera, Joshua Ali, Baba


Salihu, Bukar Shuwa, Kama Mohammed, Fatoki Olayode
UNIVERSITY OF LAGOS
SCHOOL OF POSTGRADUATE STUDIES
DEPARTMENT OF MATHEMATICS
RECOMMENDATION FOR APPROVAL OF PANEL OF EXPERTS
PARTICULARS OF CANDIDATE
Name: DIBAL Nicholas Pindar
MATRIC NO: 899008084
QUALIFICATIONS: B. Sc. (Hons) Statistics (A.B.U.,Zaria), 22 ,1987
M. Sc. Statistics (UNILAG), 1990
DATE OF FIRST REGISTRATION: 15th January, 2010
DEGREE IN VIEW: Ph.D. (Statistics)
STATUS: Full Time
APPROVED TITLE OF THESIS: A Robust Multiple Imputation Model Under
Non-Randomly Missing Data with Outliers
DATE OF APPROVAL: 17th December, 2014
FIELD OF STUDY: Sample Survey
SUPERVISORS: 1) Prof. R. Okafor
Department of Mathematics
University of Lagos, Akoka
2) Dr H. Dallah (Associate Professor)
Department of Actuarial Science
University of Lagos, Akoka
PROPOSED PANEL OF EXPERTS:
EXTERNAL: 1) Professor Adebayo T. Bamiduro
Department of Actuarial and Math. Sciences
Redeemers University, Ede, Osun State
e-mail: adebayoobamiduro@yahoo.com
bamiduroa@run.edu.ng
Tel: 08074011169
2) Professor Peter A. Osanaiye
Department of Statistics
University of Ilorin, Iloron, Kwara State
e-mail: peterosanaiye@yahoo.com
Tel: 08035619093
3) Professor O. E. Asiribo
Department of Statistics
Federal University of Agriculture
Abeokuta, Ogun State
e-mail: asiribo@yahoo.com
Tel: 08033526707, 08024489565
INTERNAL: Dr Ismail A. Adeleke (Senior Lecturer)
Department of Actuarial Science
University of Lagos, Akoka
Tel: 08035648121
RECOMMENDATION: The Departmental Postgraduate Committee at its meeting held. December, 2014
Considered the above application and recommend same to the Board of the School of Postgraduate for approval.

_______________________ __________________________
Prof. S. O. Ajala Dr O. J. Fenuga
Head, Department of Mathematics Chairman, Departmental Postgraduate committee
UNIVERSITY OF LAGOS
SCHOOL OF POSTGRADUATE STUDIES
DEPARTMENT OF MATHEMATICS
RECOMMENDATION FOR APPROVAL OF PANEL OF EXAMINERS
Section A: PARTICULARS OF CANDIDATE
Name: DIBAL Nicholas Pindar
MATRIC NO: 899008084
QUALIFICATIONS: B. Sc. (Hons) Statistics (A. B. U.,Zaria). 22, 1987
M.Sc. Statistics (UNILAG), 1990
DEGREE IN VIEW: Ph.D. Statistics
DATE OF FIRST REGISTRATION: 15th January, 2010
STATUS: Full Time
APPROVED TITLE OF THESIS: A Robust Multiple Imputation Model Under
Non-Randomly Missing Data with Outliers
FIELD OF STUDY: Sample Survey
DATE OF APPROVAL: 24th December, 2014
SUPERVISORS: 1) Prof. R. Okafor
2) Dr H. Dallah (Associate Professor)
Section B: PROPOSED PANEL OF EXAMINERS:
1 EXTERNAL EXAMINER: Prof.
Department of Mathematics
University of
Tel: 080
e-mail:
2 INTERNAL EXAMINER 1: Prof.
Department of Mathematics
University of
Tel:
e-mail:
3 INTERNAL EXAMINER 2: Prof.
Department of Mathematics
University of
Tel:
e-mail:
POSTGRADUATE REPRESENTATIVE: Prof
Department of
University of Lagos, Akoka
Tel: 080
e-mail:
RECOMMENDATION
The Departmental Postgraduate Committee at its meeting held on December, 2014 considered the above
application and recommends same to the Board of Postgraduate Studies for approval.

_________________________ _____________________________________
Prof S. O. Ajala Dr O. J. Fenuga
Head, Department of Mathematics Chairman, Departmental Postgraduate Committee
UNIVERSITY OF LAGOS
SCHOOL OF POSTGRADUATE STUDIES
DEPARTMENT OF MATHEMATICS

RECOMMENDATION FOR THE APPROVAL OF EXAMINERS REPORT

NAME: DIBAL Nicholas Pindar


MATRIC NO: 899008084
QUALIFICATIONS: B.Sc. (Hons) Statistics (A. B. U., Zaria). 22, 1987
M.Sc. Statistics (UNILAG). 1990
DEGREE IN VIEW: Ph.D. (Statistics)
DATE OF FIRST REGISTRATION: 15th January, 2010
STATUS: Full Time
APPROVED TITLE OF THESIS: A Robust Multiple Imputation Model Under
Non-Randomly Missing Data with Outliers
DATE OF APPROVAL: 24th December, 2014
FIELD OF STUDY: Sample Survey
SUPERVISORS: 1) Prof. R. Okafor
Department of Mathematics
University of Lagos, Akoka
2) Dr H. Dallah
Department of Actuarial Science
University of Lagos, Akoka
PANEL OF EXAMINERS:
EXTERNAL Prof.
Department of
University of

INTERNAL : 1) Prof.
Department of
University of Lagos, Akoka

2) Prof
Department of
University of Lagos, Akoka

SPGS REPRESENTATIVE: Prof


Department of
University of Lagos, Akoka

DATE OF ORAL EXAMINATION: January


EXAMINERS REPORT
i Standard of Presentation: The standard of presentation was very good.
ii Methodology: The methodology of the research was appropriate to the field of study.
iii Knowledge of the Field Demonstrated by the Candidate: The candidate demonstrated an in-depth
knowledge of the field of study.
iv Contributions To Knowledge Made Therein:
1 the thesis developed
2 the thesis
3 the thesis
v Performance at Oral Examination: The candidates performance at the oral examination was very
good and impressive. He answered all the questions put to him confidently
vi Any other comments: The candidate should effect minor corrections pointed out to him by the
examiners to the satisfaction of the Internal Examiners.

RECOMMENDATION
We recommend that the Thesis be accepted and the degree of Doctor of Philosophy (Ph.D.) in Statistics be
awarded to the candidate DIBAL Nicholas Pindar subject to item (vi) above.

SIGNATURES
Prof. Dr Adeleke Ismail
INTERNAL EXAMINER INTERNAL EXAMINER

Prof.
EXTERNAL EXAMINER

01/01/2015 Dr Dr
Date P.G. Representative Chairman of Panel

Certification
We certify that the thesis has been corrected in accordance with the comments of the examiners to our
satisfaction. We therefore recommend that the degree of Doctor of Philosophy (Ph.D.) in Statistics be awarded
to the candidate DIBAL Nicholas Pindar

Signatures

Prof. Dr.
INTERNAL EXAMINER INTERNAL EXAMINER

07/01/2015
DEPARTMENTAL RECOMMENDATION:
The departmental postgraduate committee at its meeting of 10/01/2015 considered and recommend the above
Ph.D. Examiners Report to the Board of the School of Postgraduate Studies for approval.

Dr. O. J. Fenuga Prof. S. O. Ajala


Chairman, Departmental PG Committee Head of Department

You might also like