You are on page 1of 322

L a n g u a g e Te s t i n g

LTE 17
and Evaluation 17

Ute Knoch · Diagnostic Writing Assessment


The diagnostic assessment of writing is an important aspect of language test-
ing which has often been neclected in the literature. However, it is an area
which poses special challenges to practioners both in the classroom and in
large-scale testing situations. This book presents a study which set out to
develop and validate a rating scale specifically designed for the diagnostic Ute Knoch
assessment of writing in an academic English setting. The scale was devel-
oped by analysing a large number of writing performances produced by both

Diagnostic Writing
native speakers of English and learners of English as an additional language.
The rating scale was then validated using both quantitative and qualitative
methods. The study showed that a detailed data-based rating scale is more

Assessment
valid and more useful for diagnostic purposes than the more commonly used
impressionistic rating scale.

The Development and Validation


of a Rating Scale

Ute Knoch is a research fellow at the Language Testing Research Centre at


the University of Melbourne. Her research interests are in the area of language
assessment, second language acquisition, and language pedagogy.
www.peterlang.de PETER LANG
LANG

Internationaler Verlag der Wissenschaften


L a n g u a g e Te s t i n g

LTE 17
and Evaluation 17

Ute Knoch · Diagnostic Writing Assessment


The diagnostic assessment of writing is an important aspect of language test-
ing which has often been neclected in the literature. However, it is an area
which poses special challenges to practioners both in the classroom and in
large-scale testing situations. This book presents a study which set out to
develop and validate a rating scale specifically designed for the diagnostic Ute Knoch
assessment of writing in an academic English setting. The scale was devel-
oped by analysing a large number of writing performances produced by both

Diagnostic Writing
native speakers of English and learners of English as an additional language.
The rating scale was then validated using both quantitative and qualitative
methods. The study showed that a detailed data-based rating scale is more

Assessment
valid and more useful for diagnostic purposes than the more commonly used
impressionistic rating scale.

The Development and Validation


of a Rating Scale

Ute Knoch is a research fellow at the Language Testing Research Centre at


the University of Melbourne. Her research interests are in the area of language
assessment, second language acquisition, and language pedagogy.
www.peterlang.de PETER LANG
LANG

Internationaler Verlag der Wissenschaften


Diagnostic Writing Assessment
L a n g u a g e Te s t i n g
and Evaluation
Series editors: Rüdiger Grotjahn
and Günther Sigott

Volume 17

PETER LANG
Frankfurt am Main · Berlin · Bern · Bruxelles · New York · Oxford · Wien
Ute Knoch

Diagnostic Writing
Assessment
The Development and Validation
of a Rating Scale

PETER LANG
Internationaler Verlag der Wissenschaften
Bibliografische Information der Deutschen Nationalbibliothek
Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der
Deutschen Nationalbibliografie; detaillierte bibliografische
Daten sind im Internet über http://dnb.d-nb.de abrufbar.

Gedruckt auf alterungsbeständigem,


säurefreiem Papier.

ISSN 1612-815X
ISBN 978-3-631-58981-6
© Peter Lang GmbH
Internationaler Verlag der Wissenschaften
Frankfurt am Main 2009
Alle Rechte vorbehalten.
Das Werk einschließlich aller seiner Teile ist urheberrechtlich
geschützt. Jede Verwertung außerhalb der engen Grenzen des
Urheberrechtsgesetzes ist ohne Zustimmung des Verlages
unzulässig und strafbar. Das gilt insbesondere für
Vervielfältigungen, Übersetzungen, Mikroverfilmungen und die
Einspeicherung und Verarbeitung in elektronischen Systemen.

www.peterlang.de
ABSTRACT

Alderson (2005) suggests that diagnostic tests should identify strengths and weak-
nesses in learners' use of language, focus on specific elements rather than global
abilities and provide detailed feedback to stakeholders. However, rating scales
used in performance assessment have been repeatedly criticized for being impre-
cise, for using impressionistic terminology (Fulcher, 2003; Upshur & Turner,
1999; Mickan, 2003) and for often resulting in holistic assessments (Weigle,
2002).

The aim of this study was to develop a theoretically-based and empirically-


developed rating scale and to evaluate whether such a scale functions more relia-
bly and validly in a diagnostic writing context than a pre-existing scale with less
specific descriptors of the kind usually used in proficiency tests. The existing
scale is used in the Diagnostic English Language Needs Assessment (DELNA)
administered to first-year students at the University of Auckland. The study was
undertaken in two phases. During Phase 1, 601 writing scripts were subjected to a
detailed analysis using discourse analytic measures. The results of this analysis
were used as the basis for the development of the new rating scale. Phase 2 in-
volved the validation of this empirically-developed scale. For this, ten trained rat-
ers applied both sets of descriptors to the rating of 100 DELNA writing scripts. A
quantitative comparison of rater behavior was undertaken using FACETS (a
multi-faceted Rasch measurement program). Questionnaires and interviews were
also administered to elicit the raters' perceptions of the efficacy of the two scales.

The results indicate that rater reliability and candidate discrimination were gener-
ally higher and that raters were able to better distinguish between different aspects
of writing ability when the more detailed, empirically-developed descriptors were
used. The interviews and questionnaires showed that most raters preferred using
the empirically-developed descriptors because they provided more guidance in the
rating process. The findings are discussed in terms of their implications for rater
training and rating scale development, as well as score reporting in the context of
diagnostic assessment.

5
ACKNOWLEDGEMENTS

This book would not have been possible without the help and support of many
individuals. I would like to thank the following people:
x Professor Rod Ellis for his patient support and ex-pert guidance through-
out the preparation of this research. Our discussion of all aspects of the re-
search was enormously helpful. I am especially grateful for the long hours
he spent reading and checking my drafts.
x Janet von Randow for her incredible enthusiasm and helpfulness at all
stages of this study, for providing access to the DELNA materials and for
her wonderful morning teas.
x A special thanks needs to be reserved for Associate Professor Catherine
Elder for sparking my interest in language assessment.
x Carol Myford and Mike Linacre who answered my copious questions
about FACETS. I appreciate their comments with regard to several of the
statistics used in this study.
x The raters who agreed to take part in my study for patiently undertaking
the task of marking and remarking the one hundred writing scripts, show-
ing both good humour and a real sense of responsibility and dedication
throughout.

This publication is supported by a grant from the Research and Research Training
Committee, Faculty of Arts, The University of Melbourne and by a Grant-in-Aid
from the School of Languages and Linguistics, Faculty of Arts, The University of
Melbourne.

7
TABLE OF CONTENTS

Chapter 1: Introduction 11

Chapter 2: Performance Assessment of Writing 17

Chapter 3: Rating scales 37

Chapter 4: Measuring Constructs and Constructing 71

Chapter 5: Methodology – Analysis of writing scripts 109

Chapter 6: Results – Analysis of writing scripts 145

Chapter 7: Discussion – Analysis of writing scripts 177

Chapter 8: Methodology – Validation of rating scale 193

Chapter 9: Results – Validation of rating scale 221

Chapter 10: Discussion – Validation of rating scale 261

Chapter 11 : Conclusion 288

APPENDICES 311

REFERENCES 313

9
Chapter 1: Introduction

1.1 Background

In the late 1990s the University of Auckland experienced an influx of students in


both undergraduate and (to a lesser extent) postgraduate study with insufficient
language skills to cope with university expectations. Because of this, a large sum
of money was made available for the development of a diagnostic assessment
which was to be ad-ministered post admission. The aim of this assessment was to
assess all students (both native and non-native speakers of English) entering un-
dergraduate degree courses so that students at risk could be identified and then
guided to the appropriate academic English help available on campus.

The development of DELNA (Diagnostic English Language Needs Assessment)


began in 2000 and 2001. Because of time and financial constraints, some tasks
were not developed in-house. One major contributor at the time was the Language
Testing Research Centre in Melbourne, and their comparable DELA (Diagnostic
English Language Assessment). Financial constraints also made it clear that it
would not be possible to conduct detailed diagnostic assessments on all students.
Therefore, a screening proce-dure was developed so that more proficient students
could be filtered out and students considered at risk could be asked to complete a
more detailed diagnostic assessment.

The diagnostic section of the assessment, which is administered after the screen-
ing, comprises listening and reading tasks (which are developed and validated at
the University of Melbourne) and an expository writing task (which is developed
in-house). Both the reading and listening tasks each produce a single score. The
writing task, which is the focus of this study, is scored using an analytic rating
scale. The DELNA rating scale has nine traits, arranged into three groups (flu-
ency, content and form). Each trait is divided into six level descriptors ranging
from four to nine. The rating scale was adapted from a pre-existing scale used at
the University of Melbourne.

No information is available on how that scale was developed. Since its introduc-
tion to DELNA, the rating scale has been modified a number of times, mainly
through consultation with raters.

A closer inspection of the DELNA rating scale reveals that it is typical of rating
scales commonly used in performance assessment systems such as for example
IELTS (Inter-national English Language Testing System and TOEFL (Test of
English as a foreign language). The traits (or-ganisation, cohesion, style, content,

11
grammatical accuracy, sentence structure and vocabulary and spelling) are re-
presentative of traits usually encountered in rating scales of writing. The level de-
scriptors make use of a common practice in writing performance assessment: ad-
jectives (e.g. satisfactory, adequate, limited, inadequate) are used to differentiate
between the different level descriptors.

DELNA writing scores are reported to two stakeholder groups. Students receive
one score averaged from the nine traits on the rating scale. In addition, students
are also given a brief statement about their performance on each of the three cate-
gories of fluency, content and form. Departments are presented with one overall
writing score for each student.

1.2 My experience of being a DELNA rater:

I was first confronted with rating scales for writing assessment in 2001. In that
year, I first joined the team of DELNA raters at the University of Auckland and a
little later became an IELTS accredited rater. Because I was relatively inexperi-
enced at rating writing at that time, I often found that the descriptors provided me
with very little guidance. On what basis was I meant to, for example, decide that a
student uses cohesive devices ‘appropriately’ rather than ‘adequately’ or that the
style of a writing script ‘is not appropriate to the task’ rather than displaying ‘no
apparent understanding of style’? And what exactly should I look for when as-
sessing the style of a writing script? This lack of guidance by the rating scale of-
ten forced me to return to a more holistic form of marking where the choice of the
different analytic categories was mostly informed by my first impression of a
writing script.

Although I thought that my inexperience with rating writing was the cause of my
difficulties, I also realised during rater training sessions that I was not the only
one experiencing problems. We would often spend endless time discussing why a
certain script should be awarded a seven instead of a six, only to be told that the
benchmark raters had given it a seven, and even though the rater trainer did not
seem to entirely agree with this mark, that was what we would have to accept. At
other times the rater trainers told us to rely on our ‘gut feeling’ of the level of a
script. If we felt it was, for example, a six overall, we should rely on that and rate
accordingly. I often felt that this was not a legitimate way to rate and that impor-
tant information might be lost in this process.

I also felt uncomfortable with rating scales mixing different aspects of writing into
one level descriptor. For example, vocabulary and spelling might be described in
one de-scriptor or grammatical range and accuracy might be grouped together.

12
But what happens if a writer is at different developmental levels in the two traits?
Should the rater prioritize one trait or average the scores on the two?

During those early years as an IELTS and DELNA rater I was not aware of the
differences between diagnostic assessment and proficiency assessment. A number
of raters would, like me, rate both these types of assessment, often in the same
week. Although DELNA and IELTS use slightly different rating scales, both
scales are very similar in terms of the types of features they display on the de-
scriptor level. The rater training is also conducted in very similar fashion. Only in
very recent times have I become aware of the fact that diagnostic assessment is
quite different from other types of assessment. One important feature of diagnos-
tic assessment is the detailed feedback that is provided to candidates. Therefore,
relying on one’s ‘gut feeling’ when rating might cause potentially important in-
formation to be lost.

1.3 Statement of the problem:

Diagnostic assessment is an under-researched area of language assessment. It is


therefore not clear if the diagnostic assessment of writing requires a different type
of rating scale to those used in performance or proficiency testing. It is further-
more not clear if the rater training for diagnostic writing assessment should be
conducted differently1.

In 2005, Alderson published a book devoted to diagnostic assessment. In this


book, he argues that diagnostic tests are often confused with placement or profi-
ciency tests. In the introductory chapter, he lists several specific features which,
according to various authors, distinguish diagnostic tests from placement or profi-
ciency tests. Among these, he writes that diagnostic tests should be designed to
identify strengths and weaknesses in the learner’s knowledge and use of language,
that diagnostic tests usually focus on specific rather than global abilities, and that
diagnostic tests should be designed to provide feedback which students can act
upon.

Later in the book, Alderson (2005) describes the use of indirect tests (in this case
the DIALANG2 test) of writing rather than the use of performance tests (such as
the writing test in DELNA). However, indirect tests of writing are used less and
less in this era of performance testing and therefore an argument can easily be
made that diagnostic tests of writing should be direct rather than indirect.

The question, however, is how direct diagnostic tests of writing should differ from
proficiency or placement tests. One central aspect in the performance assessment
of writing is the rating scale. McNamara (2002) and Turner (2000), for example,

13
have argued that the rating scale (and the way raters interpret the rating scale)
represents the de-facto test construct. It should therefore not be assumed that rat-
ing scales used in proficiency or placement testing function validly and reliably in
a diagnostic context.

Existing rating scales of writing used in proficiency or placement tests have also
been subject to some criticism. It has, for example, been claimed that they are
often developed intuitively which means that they are either adapted from already
existing scales or they are based purely on what developers think might be com-
mon features of writing at various proficiency levels (Brindley, 1991; Fulcher,
1996a, 2003; North, 1995). Furthermore, Brindley (1998) and other authors have
pointed out that the criteria often use impressionistic terminology which is open to
subjective interpretations (Mickan, 2003; Upshur & Turner, 1995; Watson Todd,
Thienpermpool, & Keyuravong, 2004). The band levels have furthermore been
criticized for often using relativistic wording as well as adjectives and intensifiers
to differentiate between levels (Mickan, 2003).
There is also a growing body of research that indicates that raters often experience
problems when using these rating scales. Claire (2002, cited in Mickan, 2003), for
example, reported that raters regularly debate the criteria in moderation sessions
and describe problems with applying descriptors which make use of adjectives
like ‘appropriate’ or ‘sufficient’. Similarly, Smith (2000), who conducted think-
aloud protocols of raters marking writing scripts noted that raters had ‘difficulty
interpreting and applying some of the relativistic terminology used to describe
performances’ (p. 186).

The problems with existing rating scales described above might affect the raters’
ability to make fine-grained distinctions between different traits on a rating scale.
This might result in important diagnostic information being lost. Similarly, if rat-
ers resort to letting an overall, global im-pression guide their ratings, even when
using an analytic rating scale, the resulting scoring profile would be less useful to
candidates. It is therefore doubtful whether existing rating scales are suitable for a
diagnostic context.

1.4 Purpose of this study:

The purpose of this study was therefore to establish whether an empirically-


developed rating scale for writing assessment with more detailed band descriptors
would result in more valid and reliable ratings for a diagnostic context than the
pre-existing more traditional rating scale described earlier in this chapter.

The study was conducted in two main phases. During the first phase, the analysis
phase, over six hundred DELNA writing scripts at different proficiency levels

14
were analysed using a range of discourse analytic measures. These discourse ana-
lytic measures were selected because they were able to distinguish between writ-
ing scripts at different proficiency levels and because they represented a range of
aspects of writing. Based on the findings in Phase 1, a new rating scale was de-
veloped.

During the second phase of this study, the validation phase, ten raters rated one
hundred pre-selected writing scripts using first the existing descriptors and then
the new rating scale. After these two rating rounds, the raters completed a ques-
tionnaire designed to elicit their opinions about the efficacy of the new scale. De-
tailed interviews were con-ducted with seven of the ten raters. The purpose of this
phase was not only to establish the reliability and validity of the two scales based
on the rating data, but also to elicit the raters’ opinions of the efficacy of the two
scales.

1.5 Research Questions:

The study has one overarching research question:

To what extent is a theoretically-based and empirically developed rating scale of


academic writing more valid for diagnostic writing assessment than an existing,
intuitively developed rating scale?

Because the overarching research question is broad, three more specific questions
were formulated to guide the data collection and analysis:
Research question 1 (Phase 1): Which discourse analytic measures are successful
in distinguishing between writing samples at different DELNA writing levels?

Research question 2a (Phase 2): Do the ratings produced using the two rating
scales differ in terms of (a) the discrimination between candidates, (b) rater spread
and agreement, (c) variability in the ratings, (d) rating scale properties and (e)
what the different traits measure?

Research question 2b (Phase 2): What are raters’ percep-tions of the two different
rating scales for writing?

1.6 Outline of the book:

This book is organised into eleven chapters. Chapter 1, this chapter, provides an
overview of the research and its purpose. Chapters 2 to 4 provide a review of the
relevant literature. Chapter 2 gives a general introduction to performance assess-

15
ment of writing, in particular diagnostic assessment. The chapter goes on to dis-
cuss models of performance assessment of writing and how these could be rele-
vant to diagnostic assessment of writing. Specifically, the influence of the rater,
the task and the test taker on the outcome of an assessment is described. Chapter 3
reviews the literature on rating scales, which is the main focus of this study. As
part of this chapter, possible design features of rating scales for diagnostic writing
assessment are considered. The final chapter of the literature review, Chapter 4,
first considers what constructs should be assessed in a diagnostic assessment of
writing and then reviews discourse analytic measures for each of these constructs.
Chapters 5 to 7 contain the methodology, results and discussion chapters of Phase
1 of the study, the development of the rating scale. Chapter 5, the method chapter
provides a detailed description of the context of the study and an outline of the
methodology used. This chapter also contains an account of the pilot study. Chap-
ter 6 presents the results of Phase 1, the analysis of the writing scripts. The results
are discussed in the following chapter, Chapter 7. Here, the development of the
pilot scale is described and the different trait scales are presented. The following
three chapters presents the methodology (Chapter 8), results (Chapter 9) and dis-
cussion (Chapter 10) of Phase 2 of this study, the validation of the rating scale.
Chapter 9 is divided into two sections, one providing the results of the quantitative
analysis of the rating scores and the other presenting the results from the ques-
tionnaires and interviews. Chapter 10 then draws these results together and dis-
cusses the overarching research question. Chapter 11, the concluding chapter,
summarises the study as a whole and discusses the implications of the study both
at a practical and theoretical level. Suggestions for further research are offered
and limitations of the study are identified.

----
Notes:
1
Although not the focus of this study, the writing tasks used in diagnostic assessment might also
be different to those in proficiency tests of writing.
2
DIALANG is a diagnostic language test for 14 European languages based on the ‘Common
European Framework of Reference’

16
Chapter 2: Performance Assessment of Writing

2.1 Introduction

The aim of this chapter is to describe performance assessment of writing, in par-


ticular issues surrounding the diagnostic assessment of writing, which provide the
context of this study. First, performance assessment is situated in the historical
development of writing assessment, and some of the current trends in writing per-
formance assessment are discussed. Following this, diagnostic assessment, a type
of assessment which has received very little attention in the performance assess-
ment literature, is described. Because a number of aspects can influence the score
awarded to a writing performance (e.g. the rater, the task, the test taker and the
rating scale), models of performance assessment are reviewed and then research
on each of these aspects is described. The potential relevance of each of these fac-
tors to diagnostic assessment is considered. Finally, grading using the computer is
described and its relevance to diagnostic assessment is evaluated.

2.2. Historical development of writing assessment

Writing assessment, according to Hamp-Lyons (2001), dates back as far as the


Chou period in China (1111-771 B.C.). Even then, multiple raters were used to
ensure the reliability of a method of selecting officials. Direct writing assessment,
which tests writing by sampling actual examples of writing, was also practiced in
Europe at a time when the colonial powers needed an increasing number of liter-
ate administrators in countries around the world. In the United States, Harvard
University replaced the oral entrance exam with a written one in the late 1800s.
Both in Europe and the United States, there was a call for an increased level of
standardisation after these changes, which initiated an interest in statistics and
measurement theory as well as an interest in finding ways to measure true ability
(see for example Edgeworth, 1888).

Until the 1950s, writing assessment was mainly undertaken by individual teachers
in the context of their classes. However, with an increase in the number of univer-
sity enrolments came a greater demand for reliability. In re-sponse to this demand,
psychometricians developed indirect writing assessments (Grabe & Kaplan,
1996), which evaluate students’ knowledge of writing by using discrete test items
that assess knowledge of particular linguistic features, such as grammatical
choices or errors or even more specific writing behaviours such as spelling or
punctuation (Cumming, 1997). In these discrete-point tests, reliability issues were
seen as more important than questions of validity.

17
A very influential test that used multiple-choice com-ponents to measure writing
was the Test of Standard Written English (TSWE) developed by the Educational
Testing Service (ETS) for English first language writers. This test was part of a
common pre-university assessment measure in the United States (Grabe & Kap-
lan, 1996).

During the late 70s and early 80s, direct assessment of writing (or performance
assessment of writing) became standard practice in English L1 (English as a first
language) contexts and was also widely adopted by L2 (English as a second lan-
guage) teachers who favoured testing students on meaningful, communicative
tasks (e.g. letter writing). With this shift back to the direct assessment of writing,
the problems regarding content and construct validity were addressed. However a
whole range of concerns regarding the methods of collecting and evaluating writ-
ing samples as true indicators of writing ability were raised (Grabe & Kaplan,
1996). Therefore, research since that time has focussed on a number of validity
issues, especially on improved procedures for obtaining valid writing samples
(taking into account the reader, task type, rater background, rater training and the
type of rating scale used).

2.3. Current practices

In the 1980s, the skills and components model of the 1970s came under criticism
and a broadened view of language proficiency based on communicative compe-
tence was proposed by Canale and Swain (1980)1. Since then the testing of writ-
ing has commonly taken the following form: students write a brief (30-45 minute)
essay (Cumming, 1997, p.53) which is then rated either holistically or analytically
(for a description of these terms refer to Chapter 3) by trained raters using a rating
scale.

At this point it is important to investigate some major standardized writing as-


sessments around the world to gain more insight into current practices. One com-
monly administered writing test is the writing component of the International
English Language Testing System (IELTS), which was developed jointly by the
British Council and the University of Cambridge Local Examinations Syndicate
(UCLES) and is now administered around the world in conjunction with IDP (In-
ternational Development Program) Australia. The IELTS (academic) writing
component includes two writing tasks, one requiring the test taker to describe in-
formation given in a graph or table and one slightly longer argumentative essay.
Both essays are written in 60 minutes. Although the IELTS test has the advantage
of testing the writer on two writing tasks, it is marked by only one trained rater,
which might lower its reliability.

18
One of the largest direct tests of writing is administered by the Educational Test-
ing Service (ETS) as part of the TOEFL iBT (Test of English as a Foreign Lan-
guage internet based test) test battery. Students produce two pieces of writing, one
independent writing task and one integrated task (which requires test takers to
write texts based on listening or reading input). The integrated task has a time
limit of 20 minutes, whilst the independent task has a time limit of 30 minutes.
Both tasks are evaluated by two trained raters (and a third rater in case of discrep-
ancies). The TOEFL iBT has undergone extensive validity and reliability checks
which have often directly contributed to changes in rater training, topic compari-
son, essay scoring and prompt development. Both the TOEFL iBT and IELTS are
currently administered around the world and are often used as gate-keeping ex-
aminations for university entrance and immigration.

Whilst the two writing tests described above are considered to be proficiency tests
as they are designed to assess general writing ability, writing assessments for
other purposes are also administered around the world. Students are, for example,
often required to write an essay which is then used for placement purposes. Their
result might determine which course or class at a certain institution would be the
most appropriate for the students concerned. Achievement tests are often adminis-
tered at the end of a writing course to determine the progress that students have
made whilst taking the course. Finally, diagnostic writing tests might be adminis-
tered to identify the strengths and weaknesses in candidates’ writing ability. Be-
cause diagnostic assessment is the focus of this study, the following section fo-
cuses entirely on this type of test.

2.3.1 Diagnostic Assessment

Diagnostic tests are frequently distinguished from proficiency, placement and


achievement tests in the language testing literature. In the Dictionary of Language
Testing (Davies et al., 1999), the following definition for diagnostic tests can be
found:

Used to identify test takers’ strengths and weaknesses, by testing what


they know or do not know in a language, or what skills they have or do not
have. Information obtained from such tests is useful at the beginning of a
language course, for example, for placement purposes (assigning students
to appropriate classes), for selection (deciding which students to admit to a
particular course), for planning of courses of instruction or for identifying
areas where remedial instruction is necessary. It is common for educa-
tional institutions (e.g. universities) to administer diagnostic language tests
to incoming students, in order to establish whether or not they need or
would benefit from support in the language of instruction used. Relatively

19
few tests are designed specifically for diagnostic purposes. A frequent al-
ternative is to use achievement or proficiency tests (which typically pro-
vide only very general information), because it is difficult and time-
consuming to construct a test which provides detailed diagnostic informa-
tion (p. 43)

Despite repeated calls by Spolsky in the 1980s and 1990s (e.g Spolsky, 1981;
1992), Alderson (2005) argues that very little research has looked at diagnostic
assessment. He points out, in the most detailed discussion of diagnostic assess-
ment to date, that diagnostic tests are frequently confused with placement tests.
He also disapproves of the fact that a number of definitions of diagnostic tests
claim that achievement and proficiency tests can be used for diagnostic purposes.
He also criticizes Bachman’s (1990) considerations of what the content of a diag-
nostic test should look like:

When we speak of a diagnostic test... we are generally referring to a test


that has been designed and developed specifically to provide detailed in-
formation about the specific content domains that are covered in a given
program or that are part of a general theory of language proficiency. Thus,
diagnostic tests may be either theory- or syllabus-based. (p.60)

Alderson (2005) argues that the former test type in Bachman’s description is gen-
erally regarded as an achievement test and the latter as a proficiency test. There-
fore, he argues that there are no specifications in the literature of what the content
of diagnostic tests should look like.

Moussavi (2002), in his definition of diagnostic tests, argues that it is not the pur-
pose of the test so much that makes an assessment diagnostic, but rather the way
in which scores are analysed and used. Alderson (2005), however, argues that the
content of a diagnostic test needs to be more specific and focussed than that of
proficiency tests. Moreover, the profiles of performance that are produced as a
result of the test should contain very detailed information on the performance
across the different language aspects in question. He therefore believes that the
construct definition of a diagnostic test needs to be different from that of other
tests.
Summarizing the existing literature, he stresses:

(...) the language testing literature offers very little guidance on how diag-
nosis might appropriately be conducted, what content diagnostic tests
might have, what theoretical basis they might rest on, and how their use
might be validated. (p. 10)

20
After a detailed review of the existing, scarce, literature on diagnostic assessment
in second and foreign language assessment, he provides a series of features that
could distinguish diagnostic tests from other types of tests. These can be found
below.

1. Diagnostic tests are designed to identify strengths and weaknesses in a learner’s


knowledge and use of language.
2. Diagnostic tests are more likely to focus on weaknesses than on strengths.
3. Diagnostic tests should lead to remediation in further instruction.
4. Diagnostic tests should enable a detailed analysis and report of responses to
items or tasks
5. Diagnostic tests thus give detailed feedback which can be acted upon.
6. Diagnostic tests provide immediate results, or results as little delayed as possi-
ble after test-taking.
7. Diagnostic tests are typically low-stakes or no-stakes.
8. Because diagnostic tests are not high-stakes, they can be expected to involve
little anxiety or other affective barriers to optimum performance.
9. Diagnostic tests are based on content which has been covered in instruction, or
which will be covered shortly or diagnostic tests are based on some theory of lan-
guage development, preferably a detailed theory rather than a global theory.
10. Thus diagnostic tests need to be informed by SLA research, or more broadly
by applied linguistic theory as well as research.
11. Diagnostic tests are likely to be less ‘authentic’ than proficiency or other tests.
12. Diagnostic tests are more likely to be discrete-point than integrative, or more
focussed on specific elements than on global abilities.
13. Diagnostic tests are more likely to focus on language than on language skills.
14. Diagnostic tests are more likely to focus on ‘low-level’ language skills (like
phoneme discrimination in listening tests) than higher-order skills which are more
integrated.
15. Diagnostic tests of vocabulary knowledge and use are less likely to be useful
than diagnostic tests of grammatical knowledge and the ability to use that knowl-
edge in context.
16. Tests of detailed grammatical knowledge and use are difficult to construct be-
cause of the need to cover a range of contexts and to meet the demands of reliabil-
ity.
17. Diagnostic tests of language use skills like speaking, listening, reading and
writing are (said to be) easier to construct than tests of language knowledge and
use. Therefore the results of such tests may be interpretable for remediation or in-
struction.
18. Diagnostic testing is likely to be enhanced by being computer-based.

21
Alderson stresses, however, that this is a list of hypothetical features which need
to be reviewed and which he produced mainly to guide further thinking about this
much under-described area of assessment.

Alderson (2005) further points out that, whilst all definitions of diagnostic testing
emphasize feedback, there is no discussion of how scores should be reported. He
argues that feedback is probably one of the most crucial components of diagnostic
assessment. Merely reporting a test score without any detailed explanation is not
appropriate in the context of diagnostic assessment. He writes, ‘the essence of a
diagnostic test must be to provide meaningful information to users which they can
understand and upon which they or their teachers can act’ (p. 208). Also, he ar-
gues, this feedback should be as immediate as possible and not, as is often the
case for proficiency tests, two or more weeks after the test administration.

In his discussion of diagnostic testing of writing, however, Alderson focuses only


on indirect tests of writing, as he argues that these have been shown to be highly
correlated with direct tests of writing. Although he acknowledges that this justifi-
cation is becoming more dubious in this era of performance testing, he contends
that diagnostic tests seek to identify relevant components of writing ability and
therefore the justification for using indirect tests might be stronger. His book fo-
cuses mainly on the DIALANG, a computer-based diagnostic test of 14 European
languages financed by the Council of Europe.

Overall, Alderson’s (2005) review of the existing literature on diagnostic assess-


ment shows that very little work has been undertaken in this area. He concludes:

However, until such time as much more research is undertaken to enhance


our understanding of foreign language learning, we will probably be faced
with something of a bootstrapping operation. Only through the trial and
error of developing diagnostic instruments, based on both theory and ex-
perience of foreign language learning, are we likely to make progress in
understanding how to diagnose, and what to diagnose (p. 25)

Although Alderson suggests the use of indirect tests of writing for diagnostic test-
ing, these tests, as mentioned earlier, lack face validity and have generally fallen
out of favour. However, if a direct test of writing (or performance test) is used for
diagnostic purposes, a number of possible sources of variation are introduced into
the test context. The following section reviews models of performance assessment
of writing to identify these potential sources of variation. Research on each source
is then reported and the findings are evaluated in terms of their relevance to diag-
nostic assessment.

22
2.4 Models of performance assessment of writing

Because performance assessment is generally acknowledged to be more subjec-


tive than discrete-point testing, there is more room for unwanted variance in the
test score. This is well captured by Bachman et al. (1995) who noted that per-
formance testing brings with it ‘potential variability in tasks and rater judgements,
as sources of measurement error’ (p. 239). This has been recognized and the ef-
fects have been widely studied2. For example, research on test taker characteristics
has shown learners from different language backgrounds are affected differently
by the use of different rating scales (e.g. Elder, 1995), and Sunderland (1995) has
shown ways in which gender bias might manifest itself. Other studies have inves-
tigated task characteristics, like task difficulty (e.g. Wigglesworth, 2000) and rater
characteristics, like rater background, severity, bias and decision-making proc-
esses (e.g. Cumming, Kantor, & Powers, 2002; McNamara, 1996; Song & Caruso,
1996).

Taking all the above-mentioned factors into account, McNamara (1996) devel-
oped a model which organises language testing research and accounts for factors
that contribute to the systematic variance of a performance test score. McNa-
mara’s model, which is based on an earlier model by Kenyon (1992) was devel-
oped in the context of oral assessment. It is however just as valid for written test
performance. For the purpose of this literature review it has been slightly adapted
to exclude any aspects relevant only to oral performance.

Figure 1: Factors influencing the score of a writing performance (based on McNamara,


1996)

The model (Figure 1 above) places performance in a central position. The arrows
indicate that it is influenced by several factors, including the tasks, which drive
the performance and the raters who judge the performance using rating scales and

23
criteria. The final score can therefore only be partly seen as a direct index of per-
formance. The performance is also influenced by other contextual factors like, for
example, the test taking conditions. The model also accounts for the candidate and
the way his or her underlying competence will influence the performance. It is
assumed that the candidate draws on these underlying competences in a straight-
forward manner.

Skehan (1998a) refined the Kenyon-McNamara model in two ways. Firstly, he


argued that tasks need to be analysed further to account for task characteristics
and task implementation conditions. Secondly, McNamara’s model does not ac-
count for what Skehan calls the dual-coding capacities of the learner. He argues
that ‘second language learners’ abilities require not simply an assessment of com-
petences, but also an assessment of ability for use’ (p. 171) because it is possible
that test takers leave certain competences unmobilized in an assessment context.
As with McNamara’s model above, Skehan’s model was developed for oral test
performance. However, it is also valid for written test performance. Figure 2 be-
low has been adapted to exclude aspects only relevant to oral performance.

Figure 2: Model of writing performance (based on Skehan, 1998)

Skehan (1998a) points out that it is not only important to understand the individ-
ual components that influence test performance, but that it is necessary to recog-
nize an interaction between these components. He argues that for example the rat-
ing scale, which is often seen as a neutral ruler, actually has a great deal of influ-
ence on variation in test scores. There is competition between processing goals
within a performance. As shown by Skehan and Foster (1997), fluency, accuracy
and complexity compete with each other for processing capacity. If the rating
scale emphasizes each of these areas, then the final writing score might be influ-

24
enced by the processing goals the test taker emphasized at the time. This might
further be influenced by a rater prioritizing certain areas of performance. Simi-
larly, certain task qualities and conditions might lead to an emphasis on one or
two of the above-mentioned processing goals.

Fulcher (2003) further revised the model to include more detailed descriptions of
various factors that influence the score of a written performance (Figure 3 below).
In the case of the raters, he acknowledges that rater training and rater characteris-
tics (or rater background as it is called by other authors) play a role in the score
awarded to a writer. Fulcher’s model shows the importance of the scoring phi-
losophy and the construct definition of the rating scale for the outcome of the rat-
ing process. He also indicates that there is an interaction between the rating scale
and a student’s performance which results in the score and any inferences that are
made about the test taker. Fulcher further acknowledges the importance of context
in test performance by including local performance conditions. Like Skehan, Ful-
cher includes aspects that influence the task. Among these are the task orientation,
goals, and topics, as well as any context-specific task characteristics or conditions.
Finally, Fulcher’s model shows a number of variables that influence the test taker.
These include any individual differences between candidates (like perso-nality),
their actual ability in the constructs tested, their ability for real-time processing
and any task-specific knowledge or skills they might possess. Fulcher (2003) sees
this model as provisional and requiring further research.

Figure 3: Fulcher's (2003) expanded model of speaking test performance

The models discussed above were conceived in the context of proficiency testing.
Because this book addresses diagnostic assessment, it is important to review the

25
research on the different sources of score variation presented by the models and
evaluate how they might affect the scoring of a direct diagnostic test of writing.
Each of the four influences on performance in bold-face in Fulcher’s model above
will be discussed in turn in the remainder of this chapter. Research on tasks, test
takers and raters will be discussed in this chapter, whilst issues surrounding the
rating scale (being the main focus of this study) will be described in the following
chapter (Chapter 3). Because the task and the test taker are not as central as the
rater to the purpose of this study, these two issues will be described more briefly
in this chapter.

2.4.1 Tasks

Hamp-Lyons (1990) writes that the variables of the task component of a writing
test are those elements that can be manipulated and controlled to give test takers
the opportunity to produce their best performance. Amongst these she names the
length of time available to students to write, the mode of writing (if students write
by hand or use a word processor), the topic and the prompt. She argues that of the
variables named above, the topic variable is the most controversial. Some studies
have found no differences in student performance across tasks (e.g. Carlson,
Bridgeman, Camp, & Waanders, 1985) whilst others have found differences in
content quality and quantity due to topic variation (Freedman & Calfee, 1983;
Pollitt, Hutchinson, Entwhistle, & DeLuca, 1985). Hamp-Lyons argues, however,
that if there are no differences in performance found between different tasks, then
this can also be due to the fact that the scoring procedure and the raters influence
the score and any differences are lessened as a result of these factors.

A large number of studies have been undertaken to investigate the impact of task
variability in oral language. Based on Skehan’s (1998a) model (see Figure 2),
Wigglesworth (2000), for example, divided the sources of error surrounding the
task into two groups. Firstly, there are the task characteristics, which include fea-
tures internal to the task such as structure, cognitive load or familiarity of content.
Secondly, there are the task conditions like planning time, or native speaker/non-
native speaker inter-locutor in the case of a speaking test. In her study, Wig-
glesworth manipulated two task characteristics and two task conditions to see how
these affected task difficulty. She found that generally more structure made the
task more difficult. Her results for familiarity were mixed and therefore inconclu-
sive. The task conditions influenced the results in the following manner: a native
speaker inter-locutor made a task easier and planning time did not improve the
results. Yuan and Ellis (2003), however, found that pre-task planning resulted in
improved lexical and grammatical complexity and an increase in fluency, and that
online planning improved accuracy and grammatical complexity. It is important to
note that all these studies were carried out in the context of speaking and it is not

26
clear if the results can be transferred to writing. In a similar study, again in the
context of speaking but not in a testing context, Skehan (2001) investigated the
effect of a number of task characteristics on complexity, accuracy and fluency. A
summary table of his results can be seen in Table 1 below.

Table 1: Summary of the effects of task characteristics on complexity, accuracy and fluency

Task characteris- Accuracy Complexity Fluency


tic
Familiarity of in- No effect No effect Slightly greater
formation
Dialogic vs. Greater Slightly greater Lower
monologic
Degree of structure No effect No effect Greater

Complexity of out- No effect Greater No effect


come
Transformations No effect Planned condition No effect
generates greater
complexity

Relatively little research on task effects has been undertaken in the context of
writing assessment. Studies investigating whether different task prompts elicit
language which is different in quantity and quality have resulted in mixed find-
ings. For example, whilst Quellmalz, Capell and Chou (1982) found that the type
of task did not significantly influence writing quality, Brown, Hilgers and
Marsella (1991) were able to show that both prompts and the type of topic re-
sulted in a significant difference between ratings based on a holistic scale.
O’Loughlin and Wigglesworth (2003) pointed out, however, that most studies in-
vestigating task variables have used ratings as the basis of their investigations and
have not looked at the actual discourse produced. One exception is a study by
Wigglesworth (1999, cited in O’Loughlin and Wiggles-worth, 2003) in which she
investigated the effects of different tasks on both the ratings and the discourse
produced. She was able to show that the candidates produced more complex, less
accurate language when writing on the report task, and less complex but more ac-
curate language when responding to the recount tasks.
A more recent study that examined task characteristics was undertaken by
O’Loughlin and Wigglesworth (2003) in the context of the IELTS writing task.
The authors examined how quantity and manner of presentation of information in
the Academic Writing Task 1 affected the candidates’ writing. They found that
students wrote more complex texts if the task included less information, except in

27
the case of students with very high proficiency, who wrote more complex texts if
more information was given to them.
Ellis and Yuan (2004) investigated the influence of the task characteristic plan-
ning on written output. They found that pre-task planning impacted positively on
fluency and complexity, whilst online planning increased accuracy (i.e. the results
were very similar to those reported by Yuan and Ellis (2003) for oral perform-
ance). .

So what is the significance of these findings for diagnostic assessment? Although


Hamp-Lyons (1990) urges us to manipulate task components so that test takers are
provided with an opportunity to produce their best performance, it could be ar-
gued that in the context of diagnostic assessment it might be more relevant to ma-
nipulate task variables in such a way that allows us to (a) collect as much diagnos-
tic information as possible and (b) collect diagnostic infor-mation that is represen-
tative of the type of performance learners will be required to produce in the real-
life context. Therefore, the learners’ ‘best’ performance might not al-ways be the
most appropriate as it might not be re-presentative of what learners can achieve,
for example, in an examination at university. Test developers should further be
aware that if certain tasks or prompt types elicit more accurate or more complex
performances from learners, the same type of performance might not be achieved
on a different type of task in the TLU (target language use) situation and therefore
any diagnostic feedback provided to stakeholders might not reflect what the can-
didate is able to achieve in the TLU domain. It could therefore be argued that for
diagnostic purposes, it might be more useful to administer more than one writing
task or different tasks for candidates from different disciplines (e.g. at a univer-
sity).

Some research findings suggest that the differences in performance resulting from
different task characteristics are too fine to be measured by ratings (see
O’Loughlin and Wigglesworth, 2003 above). However, Alderson (2005) argues
that diagnostic assessment should focus on specific rather than global abilities and
therefore these differences might be more salient in a diagnostic context.

2.4.2 Test takers

Test takers vary not only in their linguistic skills but also in their cultural back-
ground, writing proficiency, knowledge, ideas, emotions, opinions (Kroll, 1998),
language back-ground, socio-economic status, cultural integration (Hamp-Lyons,
1990), personality and learning style (Hamp-Lyons, 2003). Because of this, writ-
ers’ performance varies from occasion to occasion. Hamp-Lyons (1990) points
out that for this reason researchers (see for example A. Wilkinson, 1983) have
criticized models of writing development for failing to account for affective fac-

28
tors, and for focussing only on descriptions of linguistic skills and cognitive abili-
ties. This view is supported by Porter (1991) who found a number of affective
variables to influence the test score awarded to a student in the context of an oral
assessment. Hamp-Lyons (2003) also found that test takers bring certain expecta-
tions to the test which are usually based on their life experiences up to that point.
It is therefore important to make sure that test takers receive as much background
information about the test as possible.

The findings reported are significant for diagnostic assessment. Firstly, Alderson
(2005) noted that diagnostic tests are usually low-stakes or no stakes and that
therefore little anxiety or few affective barriers arise on the part of the test taker.
However, it is important to investigate the TLU situation for which the diagnosis
is undertaken. For example, if a diagnostic test is designed to provide test takers
with detailed instruction to help them with an essay that will be written in a very
high-stakes testing context, then the data elicited in the context of a low-stakes
diagnostic assessment might not be representative of what the learner would be
able to achieve in the more pressured, high-stakes context. Secondly, it is possible
that although students’ extrinsic motivation might be lower in the context of a
low-stakes diagnostic test, their intrinsic motivation might be increased because
they are aware that they will receive valuable feedback on their writing ability.

2.4.3 Raters

This section on the rater first addresses the different ways in which raters can
vary. This is then followed by a discussion of research which has investigated the
reasons why raters differ in their ratings. The third part describes the most com-
mon way of dealing with rater variability: rater training. Finally, the research is
discussed in terms of its implications for diagnostic assessment.

2.4.3.1 Rater variation

A number of different studies have identified a variety of ways in which raters can
vary (McNamara, 1996; Myford & Wolfe, 2003, 2004).

The first possible rater effect is the severity effect. In this case, raters are found to
consistently rate either too harshly or too leniently if compared to other raters or
established benchmark ratings. The second rater effect is the halo effect. The halo
effect occurs when raters fail to discriminate between a number of conceptually
distinct traits, but rather rate a candidate’s performance on the basis of a general,
overall impression. The third rater effect described in the literature is the central
tendency effect. Landy and Farr (1983) described this effect as ‘the avoidance of

29
extreme (favourable or unfavourable) ratings or a preponderance of ratings at or
near the scale midpoint’ (p.63). The fourth rater effect is inconsistency or what
Myford and Wolfe (2003) term randomness. Inconsistency is defined as a ten-
dency of a rater to apply one or more rating scale categories in a way that is in-
consistent with the way in which other raters apply the same scale. This rater will
display more random variation than can be expected. The fifth rater effect is the
bias effect. When exhibiting this effect, raters tend to rate unusually harshly or
leniently on one aspect of the rating situation. For example, they might favour a
certain group of test takers or they might always rate too harshly or leniently on
one category of the rating scale in use. All these rater effects can be displayed ei-
ther by individual raters or a whole group of raters.

2.4.3.2 Rater Background

How the raters of a writing product interpret their role, the task, and the scoring
procedures constitutes one source of variance in writing assessment. Several re-
searchers (e.g. Hamp-Lyons, 2003) have shown that apart from free variance
(variance which cannot be systematically explained), raters differ in their deci-
sion-making because of their personal background, professional training, work
experience and rating background and this influences their performance. Differ-
ences have, for example, been found in the way ESL trained teachers (teachers
specifically trained to teach ESL students) and English Faculty staff (who have no
specific ESL training) rate essays (O'Loughlin, 1993; Song & Caruso, 1996;
Sweedler-Brown, 1993). Song and Caruso (1996), for example, found that English
Faculty staff seemed to give greater weight to overall content and quality of rhe-
torical features than they did to language.

Cumming (1990) compared the decision-making processes of expert and novice


raters and found that expert raters paid more attention to higher order aspects of
writing whilst novice raters focussed more on lower order aspects and used on-
line corrections of texts to help them arrive at a final decision. Other studies com-
pared the rating behavior of native speaker and non-native speaker raters. Among
the findings were that native speakers are stricter than non-native speakers
(Barnwell, 1989; Hill, 1997). Native speakers were found to adhere more strictly
to the rating scale, whilst non-native speakers are more influenced by their intui-
tions (A. Brown, 1995).

Rater occupation also seems to influence rating. Brown (1995), in the context of
oral performance, found that ESL teachers rate grammar, expression, vocabulary
and fluency more harshly than tour guides. Elder (1993), also in an oral context,
compared ESL teachers with mathematics and science teachers. She found that
ESL raters focus more on language components. There was little agreement be-

30
tween the two groups on accuracy and comprehension and most agreement on in-
teraction and communicative effectiveness. Finally, raters seem to be as much in-
fluenced by their own cultural background as they are by the students’ (Connor-
Linton, 1995; Kobayashi & Rinnert, 1996) and by more superficial aspects of the
writing script like handwriting (A. Brown, 2003; Milanovic, Saville, & Shen,
1996; S. D. Shaw, 2003; Vaughan, 1991).

2.4.3.3 Rater training

Rater training has become common practice in large-scale writing assessment.


Weigle (2002) and Alderson, Clapham and Wall (1995) show that a common set
of procedures is used for rater training which might be adapted according to cir-
cumstances. Firstly, a team of experienced raters identi-fies sample (benchmark)
scripts which represent the diffe-rent points on the scale or typical problem areas.
During the rater training session, the first few scripts are usually given to the
group of raters in combination with the marks. Then the raters often set out to rate
in groups and the final step is individual rating. After each script is rated, the rele-
vant features of the criteria are discussed. It is usually made clear to the raters that
some variation is acceptable, but raters who consistently rate too high or too low
should adjust their standard. Weigle (2002) and other authors (e.g. Congdon &
McQueen, 2000; Lumley & McNamara, 1995) also suggest regular re-
standardisation sessions.

Rater training has been shown to be effective. For example, Weigle (1994a;
1994b) was able to show that rater training is able to increase self-consistency of
individual raters by reducing random error, to reduce extreme differences between
raters in terms of leniency and harshness, to clarify understanding of the rating
criteria and to modify rater expectations in terms of both the characteristics of the
writers and the demands of the writing tasks.

To specifically address the problem of raters showing a particular pattern of


harshness or leniency in regard to a sub-section of the rating scale (e.g. fluency or
grammar), Wigglesworth (1993) trialled the approach of giving raters feedback on
their rating patterns through performance reports based on bias analysis. Bias
analysis is part of the output provided by the computer program FACETS
(Linacre, 2006; Linacre & Wright, 1993) which computes statistical analyses
based on multi-faceted Rasch measure-ment (Linacre, 1989, 1994, 2006; Linacre
& Wright, 1993). Bias analysis provides the opportunity to investigate how a cer-
tain aspect of the rating situation might elicit a recurrent biased pattern on the part
of the rater. Stahl and Lunz (1992) first used this technique in a judge-mediated
exam-ination of histo-technology. Wigglesworth (1993), in the context of an oral
test (both direct and semi-direct), found that raters displayed different behaviours

31
when rating the tape and the live version of the interview and that in general, they
seemed to be able to incorporate the feedback into their subsequent rating ses-
sions, so that ‘in many cases, bias previously evident in various aspects of their
ratings is reduced’ (p. 318). A follow-up study by Lunt, Morton and Wig-
glesworth (1994) however failed to confirm any signi-ficant changes in the pattern
of rating after giving feedback in this way. A more recent study by Elder, Knoch,
Barkhuizen and von Randow (2005), conducted in the context of writing assess-
ment, found that although for the whole group of raters the feedback resulted in
improved rating behaviour, some raters were more receptive to this type of train-
ing than others.

Most studies of rater training have shown that differences in judge-severity persist
and in some cases can account for as much as 35% of variance in students’ written
performance (Cason & Cason, 1984). Raw scores, there-fore, cannot be consid-
ered a reliable guide to candidate ability (McNamara, 1996) and double or multi-
ple rating is often recommended. In addition, in large-scale testing contexts, it
may also be necessary to use statistical programs which adjust for differences be-
tween individual raters on the basis of their known patterns of behaviour.
So how can these research findings contribute to diagnostic assessment? First,
rater variation needs not only to be minimized for the overall writing score, it also
needs to be minimized across all traits on the rating scale. This is because scores
in the context of diagnostic assessment should not be averaged, but rather reported
back to stakeholders individually. This would ensure that the diagnostic informa-
tion is as accurate and informative as possible. Secondly, as the background of
raters has an influence on the ratings, it is important that raters are trained and
monitored so that their background does not lead them to emphasize certain traits
in the rating scale over others. This would result in a distorted feedback profile.
The feedback needs to be as unbiased as possible across all traits in the scale. It
was also reported by Brown (1995) that native speaker raters seem to adhere more
closely to the rating scale than non-native speaker raters who have been found to
be more influenced by their intuitions. This might suggest that native speaker rat-
ers are more desirable in the context of diagnostic assessment, as rating based on
intuitions might result in a halo effect (i.e. very similar ratings across different
traits) which leads to a loss of diagnostic information. Thirdly, it is important that,
as part of rater training, regular bias analyses are conducted, so that raters display-
ing a bias towards a certain trait in the rating scale are identified and retrained.

2.5 Alternatives to raters: Grading using the computer

In the following section, research about automated ratings using the computer is
reviewed and the relevance of these programs to diagnostic assessment is dis-
cussed.

32
The difficulty of obtaining consistently high reliability in the ratings of human
judges has resulted in research in the field of automated essay scoring. This re-
search began as early as the 1960s (Shermis & Burstein, 2003). Several computer
programs have been designed to help with the automated scoring of essays.

The first program, called ‘Exrater’ (Corbel, 1995), is a knowledge-based system


which was designed with the sole purpose of assisting raters in the rating process.
Exrater does not attempt to identify a candidate’s level by computer-mediated
questions and answers, but rather presents the categories of the rating scale, so
that the rater can choose the most appropriate. It also does not present the full de-
scription, but rather only shows the most important statements and keywords
which are underlined. The aim is to avoid distraction to raters by having them fo-
cus on only one category at a time and not the whole rating scale. Corbel identi-
fies a number of potential problems with the program. Firstly, he predicts a halo
effect because raters might select most descriptors at the same level without
checking the more detailed descriptions which are also accessible at the click of a
button. Secondly, he argues that there might be a lack of uptake due to the un-
availability of computers when rating. Overall it can be argued that Exrater is a
helpful tool to assist raters, but it still requires the rater to perform the entire rating
process and make all decisions. Because of the risk of a halo effect, Exrater is
probably not suitable for diagnostic assessment purposes.

In the past few years, a number of computer programs have become available
which completely replace human raters. This advance has been made possible by
developments in Natural Language Processing (NLP). NLP uses tools such as
syntactic parsers which analyse discourse structure and organisation, and lexical
similarity measures which analyze the word use of a text. There are some general
advantages to automated assessment. It is generally understood to be cost effec-
tive, highly consistent, objective and impartial. However, sceptics of NLP argue
that these computer techniques are not able to evaluate communicative writing
ability. Shaw (2004) reviews four automated essay assessment programs: Project
Essay Grader, the E-rater model, the Latent semantic analysis model and the text
categorisation model.

Project Essay Grader (Page, 1994) examines the linguistic features of an essay. It
makes use of multiple linear regression to ascertain an optimal combination of
weighted features that most accurately predict human markers’ ratings. This pro-
gram started its development in the 60s. It was only a partial success as it ad-
dressed only indirect measures of writing and could not capture rhetorical, organ-
isational and stylistic features of writing.

33
The second program evaluated by Shaw (2004) is Latent Semantic Analysis
(LSA). LSA is based on word co-occurrence statistics represented as a matrix,
which is “decomposed and then subjected to a dimensionality technique” (p.14).
This system looks beneath surface lexical content to quantify deeper content by
mapping words onto a matrix and then rates the essay on the basis of this matrix
and the relations in it. The LSA model is the basis of the Intelligent Essay Asses-
sor (Foltz, Laham, & Landauer, 2003). LSA has been found to be almost as reli-
able as human assessors, but as it does not account for syntactic information, it
can be tricked. It can also not cope with certain features that are difficult for NLP
(e.g. negation).

The third program, the Text Categorisation Technique Model (Larkey, 1998), uses
a combination of key words and linguistic features. In this model a text document
is grouped into one or more pre-existing categories based on its content. This
model has been shown to match the ratings of human examiners about 65% of the
time. Almost all ratings were within one grade point of the human ratings.

Finally, e-rater was developed by the Education Testing Service (ETS) (Burstein
et al., 1998). It uses of a com-bination of statistical and NLP techniques to extract
linguistic features. The programme compares essays at different levels in its data
base with features (e.g. sentence structure, organisation and vocabulary) found in
the current essay. Essays earning high scores are those with charac-teristics most
similar to the high-scoring essays in the data base and vice versa. Over one hun-
dred automatically extractable essay features and computerized algorithms are
used to extract values for every feature from each essay. Then, stepwise linear re-
gression is used to group features in order to optimize rating models. The content
of an essay is checked by vectors of weighted content words. An essay that re-
mains focussed is coherent as evidenced by use of discourse structures, good lexi-
cal resource and varied syntactic structure. E-rater has been evaluated by Burstein
et al. (1998) and has been found to have levels of agreement with human raters of
87 to 94 percent. E-rater is used operationally in GMAT (Graduate Management
Admission Test) as one of two raters and research is underway to establish the
feasibility of using e-rater operationally as second rater for the TOEFL iBT inde-
pendent writing samples (Jamieson, 2005; Weigle, Lu, & Baker, 2007).
Based on the E-rater technology, ETS has developed a programme called Crite-
rion. This programme is able to provide students with immediate feedback on
their writing ability, in the form of a holistic score, trait level scores and detailed
feedback.

There are several reasons why computerized rating of performance essays might
be useful for diagnostic assess-ment. The main advantage of computer grading
might be the quick, immediate feedback that this scoring method can provide

34
(Weigle et al., 2007). Alderson (2005) stressed that for diagnostic tests to be ef-
fective, the feedback should be immediate, a feature which his indirect test of
writing in the context of DIALANG is able to achieve. Performance assessment of
writing rated by human raters will inevitably mean a delay in score reporting. The
second advantage might be the internal consistency of such computer programs
(see for example the feedback provided by the Criterion programme developed by
ETS). However, research comparing human and the e-rater technology has shown
(1) that e-rater was not as sensitive to some aspects of writing as human raters
were when length was removed as variable (Chodorow & Burstein, 2004), (2) that
human/human correlations were generally higher than human/e-rater correlations
(Weigle et al., 2007), and (3) that human raters fared better than automated scor-
ing systems when correlations were investigated of writing scores with grades,
instructor assessment of writing ability, independent rater assessment on disci-
pline-specific writing tasks and student self-assessment of writing (Powers,
Burstein, Chodorow, Fowles, & Kukich, 2000; Weigle et al., 2007).

There are also a number of concerns about using computerized essay rating.
Firstly, these ratings might not be practical in contexts where computers are not
readily available. Furthermore, it could be argued that writing is essentially a so-
cial act and that writing to a computer vio-lates the social nature of writing. Simi-
larly, what counts as an error might vary across different sociolinguistic contexts
and therefore human raters might be more suitable to evaluate writing (Cheville,
2004). In addition, as dia-gnostic tests should provide feedback on a wide variety
of features of a learner’s performance, current rating programs are unable to
measure the same number of features as human raters. This means that automated
scoring programs might under-represent the writing construct. For example, the
programs reviewed above were not able to evaluate communicative writing ability
or more advanced features of syntactic complexity. Taking all the above into ac-
count, it can be argued that human raters should be able to provide more useful
information for diagnostic assessment.

2.6 Conclusion

This chapter has attempted to situate diagnostic assessment within the literature
on performance assessment of writing, and research regarding the influences of a
number of variables on performance assessment was reported. Because the focus
of this study is the rating scale, research relating to rating scales and rating scale
development, as well as considerations regarding the design of a rating scale for
diagnostic assessment, are considered in the following chapter.

35
---
Notes:
1
For a more detailed discussion of this and later models refer to Chapter 3.
2
Most research cited in this section is based on studies conducted in the context of oral assess-
ment. This research is equally as relevant to writing assessment.

36
Chapter 3: RATING SCALES

3.1 Introduction
The aim of this study is to develop a rating scale that is valid for diagnostic as-
sessment. This chapter therefore begins with a definition of rating scales. To es-
tablish what options are available to a rating scale developer interested in develop-
ing a scale specific to diagnostic assessment, the chapter illustrates the different
options available during the development process. This is followed by a section
on criticisms of current rating scales. Then the chapter turns to an examination of
rating scales for diagnostic assessment. Here reasons are suggested why current
rating scales are unsuitable for the diagnostic context. Drawing on the literature
on diagnostic assessment as well as considerations in rating scale development,
five suggestions are made as to what a rating scale for diagnostic assessment
should look like.

3.2 Definition of a rating scale

A rating scale (sometimes referred to as scoring rubric or proficiency scale) has


been defined by Davies et al. (1999, p. 153) as:

A scale for the description of language proficiency consisting of a series


of constructed levels against which a language learner’s performance is
judged. Like a test, a proficiency (rating) scale provides an operational
definition of a linguistic construct such as proficiency. Typically such
scales range from zero mastery through to an end-point representing the
well-educated native speaker. The levels or bands are commonly charac-
terised in terms of what subjects can do with the language (tasks and func-
tions which can be performed) and their mastery of linguistic features
(such as vocabulary, syntax, fluency and cohesion)… Scales are de-
scriptions of groups of typically occurring behaviours; they are not in
themselves test instruments and need to be used in conjunction with tests
appropriate to the population and test purpose. Raters or judges are nor-
mally trained in the use of proficiency scales so as to ensure the measure’s
reliability.

3.3 The rating scale design process

Weigle (2002) describes a number of very practical steps that should be taken into
account in the process of scale development. Because these different steps illus-
trate the different options rating scale designers have in the design process, each is

37
described in detail below. For a rating scale to be valid, each of the different de-
sign options has to be weighed carefully.
1. What type of rating scale is desired? The scale developer should decide if
a holistic, analytic, primary trait or multi-trait rating scale is preferable
(each of these options will be described in detail below).
2. Who is going to use the rating scale? The scale developer needs to decide
between three functions of rating scales identified by Alderson (1991).
3. What aspects of writing are most important and how will they be divided
up? The scale developer needs to decide on what criteria to use as the basis
for the ratings.
4. What will the descriptors look like and how many scoring levels will be
used? There are limits to the number of distinctions raters can make. Many
large-scale examinations use between six and nine scale steps. This is de-
termined by the range of performances that can be expected and what the
test result will be used for. Developers also have to make decisions regard-
ing the way that band levels can be distinguished from each other and the
types of descriptor.
5. How will scores be reported? Scores from an analytic rating scale can ei-
ther be reported separately or combined into a total score. This decision
needs to be based on the use of the test score. The scale developer also has
to decide if certain categories on the rating scale are going to be weighted.

Although not mentioned by Weigle (2002), a sixth consideration important to rat-


ing scale development has been added to this list:

6. How will the rating scale be validated? The rating scale developer needs to
consider how the rating scale will be developed and what aspects of valid-
ity are paramount for the type of rating scale designed.

Each of these steps will now be considered in detail.

3.3.1 What type of rating scale is desired?

Traditionally, student performances were judged in comparison to the perform-


ance of others, but nowadays this norm-referenced method has largely given way
to criterion-referenced tests where the writing ability of each student is rated ac-
cording to specific external criteria like vocabulary, grammar or coherence. Four
forms of criterion-referenced assessment can be identified, namely holistic, ana-
lytic, primary trait and multiple-trait scoring (Hyland, 2003). Very comprehensive
summaries of the features as well as advantages and disadvantages of the different
scale types can be found in Cohen (1994), Weigle (2002), Bachman and Palmer
(1996), Grabe and Kaplan (1996), Hyland (2003), Fulcher (2003) and Kroll

38
(1998). These are sum-marized below. Weigle (2002) provides a useful overview
of the four different types of rating scales (Table 2):

Table 2: Types of rating scales for the assessment of writing (based on Weigle, 2002)
Specific to a particu- Generalizable to a class of
lar writing task writing tasks
Single score Primary Trait Holistic
Multiple score Multiple Trait Analytic

Holistic scoring is based on a single, integrated score of writing behavior and re-
quires the rater to respond to writing as a whole. Raters are encouraged to read
each writing script quickly and base their score on a ‘general impression’. This
global approach to the text reflects the idea that writing is a single entity, which is
best captured by a single score that integrates the inherent qualities of the writing.
A well-known example of a holistic scoring rubric in ESL is the scale used for the
Test of Written English (TWE), which was administered as an optional extra with
the TOEFL test and has now been largely replaced by the TOEFL iBT1.

One of the advantages of this scoring procedure is that test takers are unlikely to
be penalized for poor performance on one aspect (e.g. grammatical accuracy).
Generally, it can be said that the approach emphasizes what is well done and not
the deficiencies (White, 1985). Holistic rating is generally seen as very efficient,
both in terms of time and cost. It has however been criticized and has nowadays
generally fallen out of favor for the following reasons. Firstly, it has been argued
that one score is not able to provide sufficient diagnostic information to be of
much value to the stakeholders. Uneven abilities, as often displayed by L2 writing
candidates (Kroll, 1998), are lumped together in one score. Another problem with
holistic scoring is that raters might overlook one or two aspects of writing per-
formance. Furthermore, it can be argued that, if raters are allowed to assign
weightings for different categories to different students, this might produce unfair
results and a loss of reliability and ultimately of validity. A further problem spe-
cific to L2 writing is that the rating scale might lump both writing ability and lan-
guage proficiency into one composite score. This might potentially result in the
same writing score for ESL learners who struggle with their linguistic skills and a
native speaker who lacks essay writing skills. The fact that writers are not neces-
sarily penalized for weaknesses but rather rated on their strengths can also be seen
as a disadvantage as areas of weakness might be important for decision-making
regarding promotion (Bacha, 2001; Charney, 1984; Cumming, 1990; Hamp-
Lyons, 1990). Finally, it is likely that test takers who attempt more difficult forms
and fail to produce these accurately might be penalized more heavily than test
takers using very basic forms accurately. Research has shown that holistic scores

39
correlate with quite superficial characteristics like handwriting (e.g. Sloan &
McGinnis, 1982).

A common alternative to holistic scoring is analytic scoring. Analytic scoring


makes use of separate scales, each assessing a different aspect of writing, for ex-
ample vocabulary, content, grammar and organisation. Sometimes scores are av-
eraged so that the final score is more usable. A commonly used example of an
analytic rating scale is the Jacobs’ scale (Jacobs, Zinkgraf, Wormuth, Hartfiel, &
Hughey, 1981). This scale has the added value that it is a weighted scale, so that
each component is weighted in proportion to its relative importance to the overall
product determined by that testing program.

A clear advantage of analytic scoring is that it protects raters from collapsing


categories together as they have to assign separate scores for each category. Ana-
lytic scales help in the training of raters and in their standardization (Weir, 1990)
and are also more useful for ESL learners, as they often show a marked or uneven
profile which a holistic rating scale cannot capture accurately. Finally, as Weigle
(2002) points out, analytic rating is more reliable. Just as a discrete-point test be-
comes more reliable when more items are added, a rating scale with multiple
categories improves the reliability. However, there are also some disadvantages to
analytic rating. For example, there is no guarantee that raters will actually use the
separate subscales of an analytic scale separately. It is quite possible that rating on
one aspect might influence another. This is commonly referred to as the halo ef-
fect. Other authors criticize dissecting the written product into different aspects
and basing the rating on these subsections, as writing is arguably more than the
sum of its parts (White, 1985). Finally, rating using an analytic scale is more time
consuming and therefore more expensive.

Table 3 below from Weigle (2002, p.121) summarizes the advantages and disad-
vantages of holistic and analytic scales.

Table 3: A comparison between holistic and analytic rating scales (based on Weigle, 2002)
Quality Holistic Scale Analytic Scale
Reliability Lower than analytic but still accept- Higher than holistic
able

Construct Valid- Holistic scale assumes that all rele- Analytic scales more appropriate for
ity vant aspects of writing develop at L2 writers as different aspects of
the same rate and can thus be cap- writing ability develop at different
tured in a single score; holistic rates
scores correlate with superficial
aspects such as length and handwrit-
ing

40
Practicality Relatively fast and easy Time-consuming; expensive

Impact Single score may mask an uneven More scales provide useful diagnos-
writing profile and may be mislead- tic information for placement and/or
ing for placement instruction; more useful for rater
training

Authenticity White (1985) argues that reading Raters may read holistically and
holistically is a more natural process adjust analytic scores to match ho-
than reading analytically listic impression

A third scale type is primary trait scoring which was developed in the mid 1970s
by Lloyd-Jones (1977) for the National Assessment of Educational Progress
(NAEP) in an effort to obtain more information than a single holistic score. The
goal is to predetermine criteria for writing on a particular topic. It therefore repre-
sents ‘a sharpening and narrowing of criteria to make the rating scale fit the spe-
cific task at hand’ (Cohen, 1994, p. 32) and is therefore context-dependent
(Fulcher, 2003). The approach allows for attention to only one aspect of writing.
Because these scales only focus on one aspect of writing, they may not be integra-
tive enough. Also, it might not be fair to argue that the aspect singled out for as-
sessment is primary enough to base a writing score on it. Another reason why
primary trait scoring has not been readily adopted is that it takes about 60 to 80
hours per task to be developed.

The fourth and final type of rating scale is multi-trait scoring. Essays are scored
for more than one aspect, but the criteria are developed so that they are consistent
with the prompt. Validity is improved as the test is based on expectations in a par-
ticular setting. As the ratings are more task-specific, they can provide more diag-
nostic information than can a generalized rating scale. However, the scales are
again very time consuming to develop and it might be difficult to identify and
empirically validate aspects of writing that are especially suitable for the given
context. There is also no assurance that raters will not fall back on their traditional
way of rating.

Neither primary trait nor multi-trait scoring have been commonly used in ESL as-
sessment, probably because they are very time consuming to design and cannot be
reused for other tasks. Holistic and analytic rating scales have most commonly
been used in writing assessment.

3.3.2 Who is going to use the rating scale?

It is important that the format of the rating scale, the theoretical orientation of the
description and the formulation of the definitions are appropriate for the context

41
and purpose in mind. In drawing attention to this, Alderson (1991) identified three
different rating scale subcategories depending on the purpose the score will be
used for. It is important to note that, for each of these subcategories, descriptors
might be formulated in different ways. Firstly, user-oriented scales are used to
report information about typical behaviors of a test taker at a given level. This in-
formation can be useful for potential employers and others outside the education
system to clarify the circumstances in which a test taker will be able to operate
adequately (Pollitt & Murray, 1996). Descriptors are usually formulated as ‘can
do’ statements. The second type of scale that Alderson considers are assessor-
oriented scales, which are designed to guide the rating process, focussing on the
quality of the performance typically observed in a student at a certain level.
Thirdly, there are constructor-oriented scales which are produced to help the test
developer select tasks for a test by describing what sort of tasks a student can do
at a certain level. The scales describe potential test items that might make up a
discrete point test for each level. Fulcher (2003) points out that the information in
each of these scales might be different and it is therefore essential for establishing
validity that scales are used only for the purpose for which were designed for.
North (2003) argues that scales used to rate second language performance should
be assessor-oriented, which means that they should focus on aspects of ability
shown by the performance. Although this might seem obvious, he then shows that
rating scales that follow the Foreign Service Institute (FSI) family of rating scales
(described later in this chapter in more detail) often mix these different purposes
in the one scale.

3.3.3 What are the criteria based on?

The ways in which rating scales and rating criteria are constructed and interpreted
by raters act as the de facto test construct (McNamara, 2002). North (2003) how-
ever cautions that viewing the rating scale as a representation of the construct is
simplistic, as the construct is produced by a complex interplay of tasks, perform-
ance conditions, raters and rating scale. However, it is fair to say that the rating
scale represents the developers’ view of the construct. Therefore, rating scales for
writing are usually based on what scale developers think represents the construct
of writing proficiency and the act of defining criteria involves operationalizing the
construct of proficiency.

Turner (2000) suggests that although rating scales play such an important part in
the rating process and ultimately represent the construct on which the perform-
ance evaluation is based, there is surprisingly little information on how commonly
used rating scales are constructed. The same point has also been made by McNa-
mara (1996), Brindley (1998) and Upshur and Turner (1995). It is however vital
to have some knowledge of how scales are commonly constructed in order to un-

42
derstand some of the main issues associated with rating scales. Fulcher (2003)
points out that many rating scales are developed based on intuition (see also
Brindley, 1991). He describes three sub-types of intuitive methods, which are out-
lined below.

Several researchers have described models of rating scale development that are
not based on intuition. These design methods can be divided into two main
groups. Firstly, there are rating scales that are based on a theory. This could be a
theory of communicative competence, a theory of writing or a model of the deci-
sion-making of expert raters. Or scales can be based on empirical methods. The
following sections describe intuition-based, theory-based and empirically-based
methods in more detail.

3.3.3.1 Intuition-based scale development

Intuitive methods include expert judgements, committee and experiential meth-


ods. A scale is developed through expert judgement when an experienced teacher
or language tester develops a rating scale based on already existing rating scales, a
teaching syllabus or a needs analysis. Data might be collected from raters as feed-
back on the usefulness of the rating scale. The committee method is similar to ex-
pert judgement, but here a small group of experts develop the criteria and descrip-
tors together. Experiential scale design usually starts with expert judgement or
committee design. The rating scale then evolves and is refined over a period of
time by those who use it. This is by far the most common method of scale devel-
opment.

The Foreign Service Institute (FSI) family of rating scales is based on intuitive
design methods (Fulcher, 2003). These scales became the basis for many scales,
like the ILR (Interagency Language Roundtable) and ACTFL (American Council
on the Teaching of Foreign Languages) rating scales still commonly used today.
The FSI scale was developed very much in-house in a United States government
testing context to test foreign services personell and all the scales focus on several
very basic principles, all of which have been criticized. Firstly, the scale descrip-
tors are defined in relation to levels within the scale and are not based on external
criteria. The only reference point is the ‘educated native-speaker’. Already in the
late 1960s, Perren (1968) criticized this and argued that the scale should be based
on a proficient second language speaker. The ILR scale ranges from ‘no practical
ability’ to ‘well-educated native speaker’. The Australian Second Language Profi-
ciency Ratings (ASLPR) also use these criteria. The concept of the ‘educated na-
tive speaker’ has come increasingly under attack (see for example Bachman &
Savignon, 1986; Lantolf & Frawley, 1985) because native speakers vary consid-
erably in their ability.

43
Secondly, it has been contended that the scale descriptors of the FSI family rating
scales are based on very little empirical evidence. Similarly, Alderson (1991) was
able to show that some of the IELTS band descriptors described performances that
were not observed in any of the actual samples. This is a clear threat to the valid-
ity of the test.

Thirdly, the descriptors in the FSI family of rating scales range from zero profi-
ciency through to native-like performance. Each descriptor exists in relation to the
others. The criticism that has been made in relation to this point (see for example
Pienemann, Johnston, & Brindley, 1988; Young, 1995) is that the progression of
the descriptors is not based on language development as shown by researchers in-
vestigating second language acquisition. It can therefore be argued that the theo-
ries underlying the development of the rating scales have not been validated and
are probably based on the intuitions and personal theories of the scale developers.
Fourthly, rating scales in the FSI family are often marked by a certain amount of
vagueness in the descriptors. Raters are asked to base their judgements on key
terms like ‘good’, ‘fluent’, ‘better than’, ‘always’, ‘usually’, ‘sometimes’ or ‘many
mistakes’. Despite all these criticisms levelled at intuitively developed rating
scales, it is important to note that they are still the most commonly used scales in
high-stakes assessments around the world.

3.3.3.2 Theory-based rating scale design

North (2003) argues that, inevitably, the descriptors in proficiency scales are a
simplification of a very complex phenomenon. In relation to language learning, it
would be ideal if one could base the progression in a language proficiency scale
on what is known of the psycholinguistic development process. However, the in-
sights from this area of investigation are still quite limited and therefore hard to
apply (see for example Ingram, 1995). It could then be argued that if the stages of
proficiency cannot be described satisfactorily, one should not use proficiency
scales. But, as North points out, raters need some sort of reference point to follow.
Another possible response to the problem, and one taken by for example Mislevy
(1995), is that proficiency scales should be based on some sort of simplified stu-
dent model which is a basic description of selected aspects that characterize real
students. It is however clear that, unless the underlying framework of a rating
scale takes some account of linguistic theory and research in the definition of pro-
ficiency, the validity of the scale will be limited (Lantolf & Frawley, 1985).

Below, four types of theories (or models) are described which could be used as a
basis for a rating scale of writing: the four skills model, models of communicative
competence, theories/models of writing and theories of rater decision-making.

44
3.3.3.2.1 The Four Skills Model

A widely used conceptual framework seems to be the Four Skills Model proposed
by Lado (1961) and Carroll (1968). North (2003) summarizes the common fea-
tures of the model with respect to language in the following table (Table 4):

Table 4: The Four Skills Model (from North, 2003)


Spoken Language Written Language
Receptive Productive Receptive Productive
skill: skill: skill: skill:
Listening Speaking Reading Writing
Phonology/
Orthography
Lexicon
Grammar

It can be seen from Table 4 that each skill is underpinned by the three elements,
phonology/orthography, lexicon and grammar. In order to write or speak, the
learner puts lexis into appropriate grammatical structures and uses phonology or
orthography to realize the sentence or utterance. North (2003) points out that al-
though the model is not theoretically based, it is generic and is therefore poten-
tially applicable to any context.

An example of a rating scale based on the Four Skills Model is a scale proposed
by Madsen (1983). This scale shows that in the Four Skills Model communication
quality and content are not assessed (see Table 5 below).

Table 5: Example of rating scale representing Four Skills Model (Madsen, 1983)
Mechanics 20%
Vocabulary choice 20%
Grammar and usage 30%
Organisation 30%
Total 100%

The scale in Table 5 above contains the additional feature of ‘grammar and usage’
and ‘organisation’ given more weight than the other two items. It is not clear if
this was done based on any theoretical or empirical evidence.

The main advantage of adopting the categories of the Four Skills Model for rating
scales lies in their simplicity. The rating categories are simple and familiar to eve-
ryone. North (2003) points out that the main disadvantage of the model is that it

45
does not differentiate between range and accuracy of both vocabulary and gram-
mar. Grammar may be interpreted purely in terms of counting mistakes. There is
also no measurement of communicative ability in this type of rating scale.

3.3.3.2.2 Models of communicative competence

One way of dealing with the lack of communicative meaning in the Four Skills
Model is to base the assessment criteria on a model of language ability (see also
Luoma, 2004 on this topic in the context of speaking assessment). A number of
test designers (e.g. Clarkson & Jensen, 1995; Connor & Mbaye, 2002; Council of
Europe, 2001; Grierson, 1995; Hawkey, 2001; Hawkey & Barker, 2004; McKay,
1995; Milanovic, Saville, Pollitt, & Cook, 1995) have chosen to base their rating
scales on Canale and Swain’s (1983; 1980), Bachman’s (1990) or Bachman and
Palmer’s (1996) models of communicative competence which will be described in
more detail below.
One of the first theories of communicative competence was developed by Hymes
(1967; 1972). He suggested four distinct levels of analysis of language use that are
relevant for understanding regularities in people’s use of language. The first level
is what is possible in terms of language code, the grammatical level. At another
level, what a language user can produce or comprehend in terms of time and proc-
essing constraints should be examined. Another level should be concerned with
what is appropriate in different language-use situations. Finally, language use is
shaped by the conventions and habits of a community of users. This level is
shaped by the conventions and habit. Hymes also made a distinction between lan-
guage performance as in a testing situation, and more abstract models of underly-
ing knowledge and capacities which might not be tapped in most performance
situations. Hymes’ model of com-municative competence was developed for the
L1 context but it seems equally relevant for the L2 context.

Canale and Swain (1980) were the first authors to adapt Hymes’ model for the L2
context. The most influential feature of this model was that it treated different
domains of language as separate, which was ground-breaking after a decade of
research based on Oller’s hypothesis that language ability is a unitary construct
(see for example Oller, 1983; Oller & Hinofotis, 1980; Scholz, Hendricks, Spurl-
ing, Johnson, & Vandenburg, 1980). Canale and Swain (1980) proposed the fol-
lowing domains of language knowledge: grammatical competence, sociolinguistic
competence and strategic competence. Canale (1983) later extended this to in-
clude discourse competence. Socio-linguistic competence stresses the appropri-
ateness of language use, the language user’s understanding of social relations and
how language use relates to them. Discourse competence is concerned with the
ability of the language user to handle language beyond the sentence level. This
includes the knowledge of how texts are organised and how underlying meaning

46
can be extracted based on these principles. As Skehan (1998a) points out, it is im-
portant to note here that while native speakers distinguish themselves mainly in
the area of linguistic competence, some might have problems in the areas of so-
ciolinguistic and discourse competence. Strategic competence, according to Ca-
nale and Swain (1980), only comes into play if the other competences are unable
to cope.

According to Skehan (1998a), the model proposed by Canale and Swain (1980)
and later extended by Canale (1983), is lacking in a number of ways. It fails in
that it does not relate the underlying abilities to performance, nor does it account
for different contexts. It fails to account for the fact that some of the competencies
might be more important in some situations than in others. He also criticizes the
position given to strategic competence, which in this model only comes into play
when there is a communication breakdown and is therefore only used to compen-
sate for problems with other competences.

The model proposed by Canale and Swain was subsequently further developed by
Bachman (1990). His model distinguishes three components of language ability:
language competence, strategic competence and psycho-physiological mecha-
nisms / skills. Language competence in turn consists of two components, organ-
isational and pragmatic competences. Organisational competence includes the
knowledge involved in creating or recognizing grammatically correct utterances
and comprehending their propositional content (grammatical competence) and in
organising them into text (textual competence). Pragmatic competence includes
illocutionary competence and sociolinguistic competence. Bachman (1990) rede-
fines strategic competence as a “general ability which enables an individual to
make the most effective use of available abilities in carrying out a given task” (p.
106).

He recognizes some sort of interaction between the different components of the


model. This is handled by strategic competence. Bachman’s (1990) model needs
to be validated by demonstrating that the different components are in fact separate
and make up the structure of language ability (Shohamy, 1998).

In 1996, Bachman and Palmer revised their model to include the role played by
affective factors in influencing language use. A further change in this model is
that strategic competence is now seen as consisting of a set of metacognitive
strategies. ‘Knowledge structures’ (knowledge of the world) from the 1990 model
has been relabelled ‘topical knowledge’. In this model, strategic knowledge can
be thought of as a higher order executive process (Bachman & Palmer, 1996)
which includes goal-setting (deciding what to do), assessment (deciding what is
needed and how well one has done) and planning strategies (deciding how to use

47
what one has). The role and subcomponents of the language knowledge compo-
nent remain essentially unchanged from the Bachman (1990) model.

Skehan (1998a) sees the Bachman and Palmer model as an improvement on pre-
vious models in that it is more detailed in its specifications of the language com-
ponent, defines the relationships between the different components more ade-
quately, is more grounded in linguistic theory and is also more empirically based.
He finds, however, that there are problems with the operationalization of the con-
cepts, which are generally structured in the form of a list. It is difficult to find any
explanation in the model for why some tasks are more difficult than others and
how this influences accuracy, fluency and complexity. Luoma (2004) suggests
that the quite detailed specification of the language component distracts from
other components and knowledge types, which may as a result receive less em-
phasis. She therefore suggests that test developers might want to use Bachman
and Palmer’s (1996) model in conjunction with other frameworks. Like Bach-
man’s (1990) model, the Bachman and Palmer (1996) model has not been vali-
dated empirically.

The advantage of basing a rating scale on a model of communicative competence


is that these models are generic and therefore not context-dependent. This makes
results more generalisable and therefore transferable across task types. Although
models of communicative competence have been adopted as the basis of a number
of rating scales (e.g. Clarkson & Jensen, 1995; Connor & Mbaye, 2002; Council
of Europe, 2001; Grierson, 1995; Hawkey, 2001; Hawkey & Barker, 2004;
McKay, 1995; Milanovic et al., 1995), it is clear that they do not offer a sufficient
foundation for the testing of writing. It is for example not certain that the aspects
distinguished in the theoretical models can be isolated, operationalised or
weighted. Also, models of communicative competence, as the name suggests,
have been intended to be models of underlying competence and not of perform-
ance. The models have therefore difficulty coping when underlying competence is
put into use (North, 2003). This, for example, becomes apparent in the fact that
none of the models have a place for fluency. So although Connor and Mbaye
(2002) feel that the communicative competence model offers a convenient frame-
work for categorizing components of written (and spoken) discourse and that all
four competencies (grammatical, discourse, sociolinguistic and strategic) should
be reflected in the scoring criteria, Carson (2001) takes the opposite view and ar-
gues that we currently lack adequate theories to base writing assessment on.

3.3.3.2.3 Theories of writing

After examining the Four Skills model and the various models of communicative
competence as possible theories for rating scale development, we now turn to

48
theories in the area of writing research to establish if any of these models would
be useful as a theoretical basis for rating scale development.

Numerous attempts have been made to develop a theory of writing. However at


the present time no overarching model of L2 writing is available (Cumming,
1998; Cumming & Riazi, 2000). Several models have been suggested based on
research on writing as a product or on writing as a process. However, for the pur-
pose of rating scale development, only models based on writing as a product are
useful, as raters have no access to the writing processes used by students. Two
product-oriented models have been proposed: Grabe and Kaplan’s (1996) model
of text construction, and parts of a taxonomy of academic writing skills, knowl-
edge bases, and process by the same authors. Each will be discussed below.

Grabe and Kaplan (1996) propose a model of text construction. They argue that
from the research they have reviewed it becomes clear that any model of text con-
struction needs at least seven basic components: syntactic structures, semantic
senses and mapping, cohesion signalling, genre and organisational structuring to
support coherence interpretations, lexical forms and relations, stylistic and register
dimensions of text structure, and non-linguistic knowledge bases, including world
knowledge. These seven components (syntax, semantics, lexicon, cohesion, co-
herence, functional dimensions and non-linguistic resources) form the centre of
the text construction model. On the sentential level, two components are speci-
fied: syntax and semantics. On a textual, or intersentential level are cohesion and
coherence. The lexicon is connected to all four of the other components, in both
surface form and underlying organisation and is therefore placed in a central posi-
tion. On an interpersonal level, the style level, are the components of posture and
stance.

The syntactic component involves types of phrases and clauses and the ordering
of phrases and words within a sentence. The authors suggest that a researcher
might, for example, want to investigate the number of types of passive structures.
Overall, syntactic analysis at this stage will involve the counting of various con-
structions and categories. Grabe and Kaplan (1996) acknowledge that the seman-
tic component is open to alternative frameworks as there is no complete theory of
semantics currently available. Cohesion and coherence, on the text level, can be
seen as equivalent to syntax and semantics at the level of the sentence (or the
clause). The authors point out that there is no consensus on an overall theory of
cohesion, nor is there a satisfactory overall definition. It is also not completely
clear what the relationship is between cohesion and coherence.

49
The lexicon, which influences all components described above, is placed in a cen-
tral position. Vocabulary used in text construction provides the meaning and sig-
nals that are needed for syntax, semantics and pragmatic interpretations.
The third, interpersonal level of writing shows the writer’s attitudes to the reader,
the topic and the situation. Style ultimately reflects the personality of the writer.
Several parameters are available to express this personality, such as formality or
distance.

Because this model was not created with test development in mind, Grabe and
Kaplan (1996) offer no explanation on how the building blocks of the model can
be evaluated in a student’s piece of writing, nor is there any consideration of
whether some features contribute more than others to a successful piece of writ-
ing.

Another way to arrive at a theory of writing is to gather all the information that
can be collected through an ethnography of writing and categorize it into a taxon-
omy of writing skills and contexts. This is, according to Grabe and Kaplan, a use-
ful way to identify any gaps that can be further investigated. However, what be-
comes clear from this taxonomy is just how many different aspects and variables
are encompassed in writing that need to be considered when conducting research.
This taxonomy offers however no information that could be used in the writing of
descriptors of a rating scale, nor does it attempt to structure the information hier-
archically.

3.3.3.2.4 Models of decision-making by expert judges

A number of studies have utilized a protocol analysis of raters to explain more


closely their decision-making behavior. As a group, these studies show that ex-
perienced raters bring certain strategies to the rating process which centre not only
on the rating scale criteria but also reflect how they approach the rating process.
One such study was conducted by Freedman and Calfee (1983) who tried to inte-
grate their findings into a model of information processing by raters. The most
important aspect of this model is that the evaluation of a text is not based on the
actual text but on an image created by the rater, which is stored in the working
memory. One implication of this model is that the analysis of observable features
might be insufficient unless a significant relationship is established between ob-
servable textual features and the text image created in the raters’ working mem-
ory. Freedman and Calfee argue that for trained raters, the similarities between the
actual text and that represented in their working memories should be greater than
the differences.

50
Cumming (1990) showed that raters use a wide range of knowledge and strategies
and that their decision-making processes involve complex, interactive mental
processes. He identified 28 interpretation and judgment strategies used by the rat-
ers in his study and he was able to show that both expert and novice raters were
able to distinguish between language proficiency and writing ability. Based on
Cumming’s study, Milanovic, Saville and Shen (1996) devised a model of the de-
cision-making involved in holistic scoring. This model can be seen in Figure 4
below. It shows that raters first scan the script and form an overall idea of the
length, format, handwriting and organisation, followed by a quick read which es-
tablishes an indication of the overall level of the writing script. Only then do rat-
ers proceed to rating.

Figure 4: Model of decision-making in composition marking (Milanovic, Saville and Shen,


1996)

Milanovic and his co-researchers also created a list of items that raters focus on.
These include length, legibility, grammar, structure, communicative effectiveness,
tone, vocabulary, spelling, content, task realization and punctuation. Their find-
ings also give some indication of how the raters weighted these essay features.
They noticed, for example, that spelling and punctuation were not seen to be im-
portant as other features. It seems, however, that the findings with regard to
weighting are quite inconclusive and vary greatly among individual raters.

Both Vaughan (1991) and Lumley (2002; 2005) showed that raters generally fol-
low the rating criteria specified in the rating scale but if the essay does not fit the
pre-defined categories, they are forced to make decisions that are not based on the
rating scale or on any rater training they received. Consequently they are unreli-
able and might lack validity. Vaughan (1991) showed that in such cases the raters

51
based their rating on first impression or used one or two categories like grammar
and/or content to arrive at their rating. Similarly, Lumley (2002; 2005) found that
if some aspect of the script was not covered by the rating scale descriptors, the
raters used their own knowledge or intuitions to resolve uncertainties or they re-
sorted to other strategies like heavily weighting one aspect or comparing the script
with previously rated compositions. He acknowledges that scale development and
rater training might help, but found that these could not prevent this problem from
occurring. He argues therefore that it is possible that the rater, and not the rating
scale, is at the centre of the rating process.

Sakyi (2000) who also used verbal protocols, found four distinct styles among the
raters in his study: focus on errors in the text, focus on essay topic and presenta-
tion of ideas, focus on the rater’s personal reaction to the text and focus on the
scoring guide. He also noticed that certain criteria were more associated with high
and low marks (this was also observed by A. Brown, 2002 and; Pollitt & Murray,
1996). On the basis of his findings, Sakyi proposed a model of the holistic scoring
process as seen in Figure 5 below.

Figure 5: Model of holistic rating process (Sakyi, 2000)

Cumming, Kantor and Powers (2001; 2002) undertook a series of studies to de-
velop and verify a descriptive framework of the decision-making processes of rat-
ers as part of the development process of TOEFL 2000. They also investigated if
there were any differences between the decision-making processes of English-
mother-tongue (EMT) raters and ESL/EFL trained raters. In the first study, a pre-
liminary descriptive framework was developed based on the think-aloud protocols
of ten experienced raters rating essays without scoring criteria. In the second
study, this framework was applied to verbal data from another seven experienced

52
raters. In the third study, this framework was revised by analyzing think-aloud
protocols from the same raters. The results of their studies showed that raters put
more weight on rhetoric and ideas (compared to language) when scoring higher
level compositions. They also found that ESL trained raters attended more exten-
sively to language than rhetoric and ideas whilst the EMT raters divided their at-
tention more evenly. Overall, however, the research showed that the two groups
of raters rate compositions very similarly, which verified the framework. Most
participants in the study noted that their background, teaching experiences and
previous rating experiences had influenced the process of rating as well as the cri-
teria they applied. The authors argue that a descriptive framework of the rating
processes of experienced raters is necessary to formulate, field-test, and validate
rating scales as well as to guide rater training. The descriptive framework of deci-
sion-making behaviors of the raters in Cumming et al.’s (2001; 2002) study can be
found in Table 6 below.

Table 6: Descriptive framework of decision-making behaviors while rating TOEFL Writing


Tasks (Cumming et al., 2001, 2002)
Self-monitoring Focus Rhetorical and Ideational Language Focus
Focus

Interpretation Strategies
Read or interpret prompt or Discern rhetorical structure Classify errors into types
task input or both
Read or reread composition Summarize ideas or proposi- Interpret or edit ambiguous
tions or unclear phrases
Envision personal situation Scan whole composition or
of the writer observe layout
Judgment Strategies
Decide on macro-strategy for Assess reasoning, logic or Assess quantity of total writ-
reading and rating; compare topic development ten production
with other compositions, or
summarize, distinguish or
tally judgments collectively

Consider own personal re- Assess task completion or Assess comprehensibility


sponse or biases relevance and fluency
Define or revise own criteria Assess coherence and iden- Consider frequency and
tify redundancies gravity of errors
Articulate general impression Assess interest, originality or Consider lexis
creativity
Articulate or revise scoring Assess text organisation, Consider syntax or morphol-
decision style, register, discourse ogy
functions or genre
Consider use and understand- Consider spelling or punctua-
ing of source material tion
Rate ideas or rhetoric Rate language overall

53
The 27 behaviors identified in Table 6 are those that one might expect from ex-
perienced raters when rating ESL/EFL compositions. On the basis of this, the au-
thors argue that analytic rating scales should reflect how experienced raters score.
For example, they divide their attention equally between content and language. It
might also make sense to weight criteria more heavily towards language at the
lower end of the scale and more towards rhetoric and ideas at the higher end. This
might show that language learners need to manifest a certain threshold of lan-
guage before raters are able to attend to their ideas. The study also showed that it
is necessary for each task type to have a rating scale that is uniquely designed for
it. However this might not be practical.

3.3.3.3 Empirically-based scale development

In empirically-based scale development, descriptors are created through an em-


pirically verifiable procedure and are based on observable learner behaviour. In
this way, according to Fulcher (2003), there is a close relationship between the
linguistic behaviour, the task and the rating scale. One of the first studies focus-
sing on data-driven development was undertaken by Fulcher (1987; 1996a) in the
context of the English Language Testing Service (ELTS) speaking test2. Fulcher
questioned the descriptors used for fluency. He carried out a discourse analysis
and found that the assumptions underlying the scale were not really evident in the
performance of native or non-native speakers of English. For example, the scale
proposed that hesitations are a feature of less fluent speech. Fulcher, however,
found that hesitations are used by the native speaker as a turn-taking device and
that they probably indicate online processing of propositional information. On the
basis of his findings he proposed a new rating scale.

Another data-based approach to scale development has been proposed more re-
cently by researchers working on the Cambridge ESOL examinations (Hawkey,
2001; Hawkey & Barker, 2004). The aim of the study was to develop an overarch-
ing rating scale to cover Cambridge ESOL writing examinations at different lev-
els. They used a corpus-based approach to distinguish key features at four pre-
assessed proficiency levels. Writing scripts were classed into subcorpora at differ-
ent levels on the basis of previous ratings. The subcorpora were then analysed to
identify the salient features underlying each level. The scripts were reread by the
main researcher who then decided which features should be included in the rating
scale. Therefore, the criteria included in the design of this study emerged partly
from the intuitions of the main researcher as well as from features identified by
the corpus analyst. On the basis of this, a draft scale was designed. It is not clear,
however, whether any validation of this common scale of writing was undertaken.

54
3.3.3.3.1 Empirically derived, binary-choice, boundary definition scales
(EBBs)

Empirically derived, binary-choice, boundary definition (EBB) rating scales were


developed by Upshur and Turner (1995; 1999). EBB scales are derived by asking
experienced raters or teachers to sort writing scripts into separate piles of better
and poorer performances. Then, the separate piles are analyzed and key features at
each level identified. On the basis of this, critical questions are devised which dis-
tinguish the different levels. Raters make a number of yes/no choices at each level
of a flow chart to arrive at a final score. EBB scales can therefore be considered as
based on both intuition and empirical evidence.

Upshur and Turner (1995) claim that the main difference between EBB scales and
traditional rating scales is, that instead of having descriptors that define the mid-
point of a band, there are a number of questions which describe the boundaries
between categories. Ratings are therefore based on differences rather than simi-
larities. They also contend that the strongpoint of their scale lies in its simplicity
as no more than one feature competes at a particular level. Fulcher (2003), how-
ever, argues that the EBB rating scales do not take into account a theoretical, lin-
ear process of second language acquisition. They rely entirely on the decisions of
expert raters. Another weakness of the EBB scales is that they can only be applied
to a specific task and cannot be generalized to others or to the real world. Also,
they again rely heavily on the judgment of expert raters working within a particu-
lar context. Finally, the scale heavily weights some criteria over others. Those cri-
teria are weighted more heavily which are at a higher level of decision-making
and require a decision to be made first. Upshur and Turner found increased inter-
rater reliability but no post-hoc validation studies were carried out.

3.3.3.3.2 Scaling descriptors

Another method of empirical scale development was proposed by North (1995;


North & Schneider, 1998) who introduced the method of scaling descriptors when
developing a common European framework for reporting language competency.
North followed four distinct phases. In Phase 1, he collected the scale descriptors
of thirty rating scales and pulled them apart. This resulted in a ‘pool’ of two thou-
sand band descriptors. The ‘pool’ was then grouped into different kinds of com-
municative activities and different aspects of strategic and communicative compe-
tence. In Phase 2, teachers were given an envelope of band descriptors and asked
to group them into four or five given categories. They were further asked to mark
down what they found particularly clear or useful about a descriptor. Teachers
then circled the band descriptors which they found relevant to their teaching. In
another workshop, teachers were given descriptors of one category and asked to

55
rank them according to ‘low’, ‘middle’ and ‘high’ and then divide each group into
a further two levels to arrive at six levels. The descriptors that were ranked most
consistently were then put into questionnaires linked by common anchor items,
which were the same in all questionnaires. The third phase involved the qualita-
tive analysis of the data. Raters were asked to rate a small number of students
from their own classes using the descriptors in the questionnaires. Multi-faceted
Rasch mea-surement3 was then used to construct a single scale from the descrip-
tors, identifying any misfitting descriptors in the process. Next, cut-off points
were established using dif-ficulty estimates, natural gaps and groupings. The
whole process was repeated in the fourth and final phase and in this case other
languages (French and German) were added as well as other skills (listening,
reading and speaking). North and Schneider (1998) acknowledge that their
method is essentially a-theoretical in nature as it is not based on either empirically
validated descriptions of language pro-ficiency or on a model of language learn-
ing.

This section has reviewed ways in which rating scale descriptors can be devel-
oped. First was a description of intuition-based scale development, and then the-
ory-based scale development was explored. Possible theories discussed were the
four skills model, models of com-municative competence, theories/models of
writing and models of rater decision-making. The final scale development method
described was empirical scale de-velopment. After reviewing each of these possi-
ble ap-proaches to scale development, it is clear that each approach provides dif-
ferent types of information and therefore none seems sufficient on its own. Impli-
cations for the development of a rating scale for diagnostic assessment can be
found later in this chapter.

3.3.4 What will the descriptors look like and how many scoring levels will be
used?

The rating scale developer also has to make a number of decisions at the descrip-
tor level. Firstly, how many bands the rating scale should have needs to be de-
cided. Secondly, the developer has to decide how the descriptors will differentiate
between the levels. Finally, the descriptor formulation style needs to be deter-
mined.

3.3.4.1 The number of bands in a scale

Research has shown that raters can only differentiate between seven (plus or mi-
nus two) levels (Miller, 1956)4. North (2003) points out that there is a certain ten-
sion when deciding the number of levels. Firstly, one needs enough levels to show
progress and discriminate between different learners, but the number of bands

56
should not exceed a certain number so that raters can still make reasonable dis-
tinctions. He argues that there is a direct relationship between reliability and deci-
sion power. Myford (2002) investigated the reliability and candidate separation of
a number of different scales and concluded that the reliability was highest for
scales ranging from five to nine scale points, lending credibility to Miller’s sug-
gestion.

Another issue is how many bands are appropriate for specific categories. Some
categories might not lend themselves to as fine distinctions as others. This might
manifest itself in the inability of a scale developer to formulate descriptors at all
levels or the failure of the raters to distinguish between the levels even if they are
defined (North, 2003). According to North, there are different ways of reacting to
this problem: Test developers can admit to the problem, or investigate the prob-
lem. If one circumvents the problem, one can combine categories into broader
categories. In the investigation approach, the researcher investigates each band by
making use of Rasch scalar analysis (as suggested by Davidson, 1993). This re-
quires an iterative process where mis-functioning scale bands are revised or re-
formulated and then modelled again by statistical analysis until the problem is
solved.

3.3.4.2 Distinguishing between levels

Another issue that the rating scale developer has to tackle is how to distinguish
between the different levels. Several approaches are possible. For example, not all
scales provide descriptors for the levels. Some rating scales might start with 100
points and ask the rater to subtract points for each mistake. They are therefore
based on deficiency and not on a competence approach which would give credit
for ability. Such a scheme was presented by Reid (1993) and an extract is pre-
sented in Figure 6 below:

Begin with 100 points and subtract points for each deficiency:
- appropriate register (formality or informality)- 10 points
- language conventions - 10 points
- accuracy and range of vocabulary - 5 points
Figure 6: Extract from deficit marking scheme (Reid, 1993)

Other rating schemes might require the rater to score each aspect under investiga-
tion out of three. There is no scale that guides the rating and it is therefore very
hard to know how raters agree on a particular score for a certain feature.
An alternative to the kinds of rating schemes shown above is to assign marks on a
scale. There are three different types of scales which roughly follow the historical
development of rating scales (North, 2003).

57
a) Graphic and numerical rating scales: These scales present a continuous line be-
tween two points representing the top and the bottom ends of the scale. Graphic
scales require the rater to choose a point on the scale, whilst numerical scales di-
vide the continuum into intervals represented by numbers. An example of each of
these can be found in Figure 7 below. The graphic scale is at the top and the nu-
merical scale is at the bottom.

Quality: High
____________________________________ Low

Quality: High
____________________________________ Low
5 4 3 2 1
Figure 7: Graphic and numerical rating scales (North, 2003)

A drawback of these types of scales is that they say nothing about the behavior
associated with each of the levels of the continuum. It is therefore not clear why
two raters might agree on a particular level.

b) Labeled scales: Later, rating scale developers set out to add cues to the various
points along the scale. Cues were usually quite vague, with stages on the contin-
uum ranging from, for example, ‘too many errors’ to ‘almost never makes mis-
takes’ or they might range from ‘poor’ to ‘excellent’. The obvious disadvantage of
these types of scales lies in their vagueness. It is, for example, a quite subjective
judgment whether a learner’s writing is ‘above average’ or ‘excellent’.

c) Defined scales: Another step in rating scale development was taken when the
horizontal scales described above were changed to vertical scales, so that there
was suddenly ample space for longer descriptions. An example of such a scale is
Shohamy et al.’s (1992) ESL Writing scale. Shohamy’s team was able to show
that these more detailed descriptors led to a higher level of inter-rater reliability.
An extract from the scale can be found in Figure 8 below.

58
Accuracy
5 Near native accuracy
4 Few sporadic mistakes; more sophisticated; complex sentence
structures; idiomatic expression
3 Consistent errors; accurate use of varied/richer vocabulary; longer
sentence structure.
2 Frequent consistent errors yet comprehensible; basic structures
and simple vocabulary
1 Poor grammar and vocabulary strongly interfering with compre-
hensibility; elementary errors.
0 entirely inaccurate
Figure 8: ESL Writing: Linguistic (Shohamy et al., 1992)

Myford (2002) compared the reliability of a number of different scale types. She
was interested to see whether the number of scale points or the presence or ab-
sence of a defined midpoint made a difference. She found no significant differ-
ences in the resultant reliability and therefore concluded that the training of raters
is more important than the type of descriptors used.

3.3.4.3 Descriptor formulation styles

The rating scale designer also has to decide how to formulate the descriptors.
North (2003) distinguishes three different approaches to formulating descriptors.

a) Abstract formulation: Scales which define the degree or presence of a certain


feature at each band by qualifiers and quantifiers like, for example, ‘a lot’, ‘some’,
‘a few’ etc. One downside to such scales is that they are not empirically based and
therefore it is not clear if each of these levels actually occurs and if there is a sig-
nificant difference between each of these quantifiers.

b) Concrete formulation: Scales which focus on salient characteristics which can


be defined in concrete terms at the different bands. As the focus is on explicitness,
there is no attempt made to create a semantic continuum where each descriptor
shares phrases with the descriptors above and below. The advantage of this ap-
proach is that the descriptors actually provide information. That means the de-
scriptors of each band can be converted into a checklist of ‘yes’ or ‘no’ questions.
Scales that have such descriptors usually result in greater inter-rater reliability.
However, it is often not clear how these descriptions were assigned to particular
bands. For this reason, Fulcher (1987; 1996a) and North (2003) suggested an em-
pirical approach to scale development.

c) ‘Objective’ formulation: These are scales which seek objectivity by pegging


bands to identifiable, countable features like mistakes or specific pieces of re-
quired information. These scales have been influenced by traditional marking

59
schemes, by behavioral objectives and by research tools which analyze speaking
or writing in terms of objective features like the number of words per utterance or
number of words in error-free utterances. This third formulation style aims for
objectivity in a very simplistic manner. One example of such a scale, in the con-
text of speaking, can be found in the Oral Situation Test scoring rubric by Raf-
faldini (1988). An extract is presented in Figure 9 below. The scale attempts to
have raters count structures where possible (e.g. for cohesion, structures and vo-
cabulary). However, although Raffaldini attempts to reduce subjectivity, the rater
still has to make some very subjective decisions. It is for example not clear what
is classed as a ‘major’ and ‘minor’ error. Furthermore, using a quantitative ap-
proach for operational purposes is extremely time-consuming.

Linguistics: Evaluates grammatical competence of learners


A. Structure: assesses morphological and structural accuracy
9. no errors
8. one or two minor errors (may be due to mispronunciation)
7. one major error
6. two major errors or major error plus some minor ones
5. three major errors
4. four major errors
3. many major or minor errors but the response is interpretable
2. severe structural problems make response difficult to interpret
1. response is almost completely structurally inaccurate and unin-
terpretable
Figure 9: Extract from Oral Situation Test scoring rubric (Raffaldini , 1988)

3.3.5 How will scores be reported?

Finally, the scale developer has to decide how to report the scores. Here the rating
scale developer should return to Alderson’s (1991) rating scale categories and de-
cide what the initial purpose of the scale was, as well as what the purpose of the
writing test was. Scores should for example not be combined if the stakeholders
could profit from knowing sub-scores. However, where it is not important to
know sub-scores, a combined score should be reported. Similarly, the rating scale
designer needs to decide if any of the categories on the rating scale should be
weighted.

3.3.6 How will the rating scale be validated?

As McNamara (1996) and Weigle (2002) point out, the scale that is used in as-
sessing writing performance, implicitly or explicitly represents the theoretical ba-
sis of a writing test. That means it embodies the test developer’s notion of what
underlying abilities are being measured by the test. Therefore, the rating scale is
of great importance to the validity of a test.

60
Before reviewing the relevant literature on how rating scales can be validated, it is
important to briefly explore how validity is conceptualized and then discuss how
it can be applied to rating scales.

Overall, validity is achieved, according to Alderson et al. (1995), if a test tests


what it is supposed to test. They argue that if a test is not valid for the purpose for
which it was designed, then the scores do not mean what they are intended to
mean.

The view of validation has changed historically (Chapelle, 1999). Whilst in the
1960s it was seen as one of two important aspects of language tests (the other be-
ing reliability), subsequent work has focussed on identifying a number of different
features of tests which contribute to validity. Prior to Messick’s (1989) seminal
paper, different types of validity were established as separate aspects, each of
which will be briefly described below.

x construct validity
x content validity
x criterion-related validity, consisting of concurrent and predictive validity
x face validity

The construct validity of a language test was defined by Davies et al. (1999) as an
indication of how representative it is of an underlying theory of language use. In
the case of writing assessment, construct validity determines how far the task
measures writing ability (Hyland, 2003). Hamp-Lyons (2003) argues that con-
structs cannot be seen and are therefore difficult to measure. They have to be
measured by tapping some examples of behaviour that represent the construct. In
the case of writing assessment, this ability is operationalized by the rating scale
descriptors.

Content validity evaluates whether the tasks in a test are similar to what writers
are required to write about in the target language situation (Hamp-Lyons, 1990;
Hyland, 2003). This is usually established through a needs analysis. Hamp-Lyons
argues that whilst there has been a call for, say, history majors to be required to
write on a certain history topic, this does not guarantee that they have actually
studied this particular topic. She therefore argues that it is more useful to cover
this issue under construct validity and sample what it is that writers do when writ-
ing on a history topic. While content validity is more central to the task that learn-
ers are required to perform, the rating scale should also display content validity in
the sense that it should reflect as much as possible how writing is perceived by
readers in the target language use domain.

61
Criterion-related validity refers to the way a test score compares to other similar
measures. There are two types of criterion-related validity (Hughes, 2003). Firstly,
concurrent validity measures how the test scores compare with other comparable
test scores. The result of the comparison is usually expressed as a correlation coef-
ficient, ranging in value from -1.0 to +1.0 (Alderson et al., 1995). Most concurrent
validity coefficients range from +0.5 to +0.7. Higher coefficients are possible for
closely related and reliable tests. Secondly, predictive validity differs from con-
current validity in that, instead of collecting the external measures at the same
time as the administration of the experimental test, the external measures are
gathered some time after the test has been given (Alderson et al., 1995; Hamp-
Lyons, 1990). Predictive validity measures how well a test predicts performance
on an external criterion.

Face validity refers to the test’s surface credibility or public acceptability


(Alderson et al., 1995). Weir (1990) argues that this is not a validity in the techni-
cal sense, but if a test does not have face validity, it may not be acceptable either
to the students taking it or the teachers and receiving institutions who may make
use of it. Hamp-Lyons (1990) shows that direct tests of writing have always had
good face validity, even in the times of indirect writing assessment.

Messick (1989; 1994; 1996) proposed a more integrated view of validity. He saw
assessment as a process of reasoning and evidence gathering which is carried out
so that inferences can be made about test takers. He argued that establishing the
meaningfulness of those inferences should be seen as the main task of test devel-
opers. He therefore redefined validity as ‘an integrated evaluative judgement of
the degree to which empirical evidence and theoretical rationales support the ade-
quacy and appropriateness of inferences and actions based on test scores’ (1989,
p.13). Messick argued that construct validity is the unifying factor to which all
other validities contribute and he also extended the notion of validity beyond test
score meaning to include relevance, utility, value implications and social conse-
quences.

Table 7 below shows the different facets of validity identified by Messick. He iso-
lated two sources of justification for test validity: the evidential basis and the con-
sequential basis. The evidential basis focuses on establishing validity through em-
pirical investigation. The consequential basis focuses on justification based on the
effects of a test after its administration. Both the evidential basis and the conse-
quential basis need to be evaluated in terms of the two functions which Messick
labelled across the top of the table: test interpretation, which focuses on how ade-
quate test interpretations are, and test use, which focuses on the adequacy of ac-
tions based on the test.

62
Table 7: Messick's (1989) facets of validity
Test interpretation Test use
Evidential Basis Construct Validity Construct Validity + Rele-
vance/Utility
Consequential Basis Value Implications Social Consequences

To establish the evidential basis of test interpretation, construct validity has to be


determined empirically and through reasoning. To arrive at the evidential basis for
test use, construct validity as well as relevance and utility have to be appraised in
the same way. To establish the consequential basis of test interpretation, value
implications have to be investigated. Messick (1989) defines the value implica-
tions as ‘the more political and situational sources of social values bearing on test-
ing’ (p.42). We therefore have to investigate what social and cultural values and
assumptions underlie test constructs (McNamara & Roever, 2006). Traditionally,
these value implications have been viewed as the responsibility of the test users as
it was argued that only users are familiar with the circumstances of a particular
context in which a test is administered. Finally, the consequential basis for test use
is established by evaluating the social consequences. Again, these social conse-
quences were traditionally viewed as the responsibility of the test users because
they will know when a test is misused. However, some authors now argue that this
responsibility has shifted back to the test designer who should be able to predict
potential sources of misuse.

Chapelle (1999) produced a summary table which outlines the contrasts between
past and current conceptions of validation (Table 8).

Bachman (2005) and Bachman and Palmer (forthcoming), based on previous


work by Kane (1992; 1999) and Mislevy (1996; 2003), have proposed a formal
process for developing a validity argument. They termed this assessment use ar-
gument (AUA) and developed clear guidelines on how to undertake this by means
of a chain of warrants, claims, backings and rebuttals. They divided the AUA into
two parts - an assessment validity argument and an assessment utilization argu-
ment.

Bachman and Palmer’s (1996; forthcoming) facets of test usefulness consist of a


list of six qualities which together define the usefulness of a given test and which
can form a useful basis for establishing the validity of a test. These are construct
validity, reliability, authenticity, interactiveness, impact, and practicality. Con-
struct validity refers to the ‘meaningfulness and appropriateness of the interpreta-
tions that we make on the basis of test scores’ (1996, p. 21).

63
Table 8: Summary of contrasts between past and current conceptions of validation (from
Chapelle, 1999)
Past Current
Validity was considered a Validity is considered an argu-
characteristic of a test: the ment concerning test interpreta-
extent to which a test meas- tion and use: the extent to which
ures what it is supposed to test interpretations and uses can
measure be justified
Reliability was seen as distinct Reliability can be seen as one
from and a necessary condi- type of validity evidence
tion for validity
Validity was often established Validity is argued on the basis of
through correlations of a test a number of types of rationales
with other tests and evidence, including the con-
sequences of testing
Construct validity was seen as Validity is a unitary concept with
one of three types of validity construct validity as central (con-
(the three validities were con- tent and criterion-related evidence
tent, criterion-related, and can be used as evidence about
construct) construct validity
Establishing validity was con- Justifying the validity of test use
sidered within the purview of is the responsibility of all test
testing researchers responsible users
for developing large-scale,
high-stakes tests

Reliability can be defined as the consistency of measurement across different fac-


ets of the rating situation. Authenticity is defined as ‘the degree of correspondence
of the characteristics of a given language test task to the features of a target lan-
guage use (TLU) task’ (1996, p. 23). Interactiveness refers to ‘the extent and type
of involvement of the test taker’s individual characteristics in accomplishing a test
task’ (1996, p. 25). Impact can be defined as the effect that the test has on indi-
viduals. Finally, practicality is defined as the relationship between the resources
required and the resources available.

Bachman and Palmer’s (1996, forthcoming) facets of test usefulness were devel-
oped to establish the validity of entire tests and not to validate aspects of tests, like
for example the rating scale. However, most of the aspects can be adapted to be
used as a framework for rating scale validation, in combination with warrants
which represent an ideal situation. Table 9 below presents the aspects of test use-
fulness with the relevant warrants which will be used for the validation of the rat-
ing scale later in this book (Chapter 10). Not all aspects of test usefulness can
however be usefully applied to rating scale validation. Especially interactiveness
cannot be established for a rating scale. Therefore, this concept was excluded.

64
Table 9: Facets of rating scale validity (based on Bachman and Palmer, 1996)
Construct validity
The scale provides the intended assessment outcome appropriate to purpose and context and
the raters perceive the scale as representing the construct adequately

The trait scales successfully discriminate between test takers and the raters report that the
scale is functioning adequately

The rating scale descriptors reflect current applied linguistics theory as well as research

Reliability
Raters rate reliably and interchangeably when using the scale

Authenticity (content validity)


The scale reflects as much as possible how writing is perceived by readers in the TLU domain

Impact (test consequences)


The feedback test takers receive is relevant, complete and meaningful

The test scores and feedback are perceived as relevant, complete and meaningful by other
stakeholders
The impact on raters is positive
Practicality
The scale use is practical
The scale development is practical

3.4 Problems with currently available rating scales

Several criticisms have been leveled at existing rating scales. Firstly, as has been
mentioned earlier in this chapter, the a priori nature of rating scale development
has been criticized (Brindley, 1991; Fulcher, 1996a; North, 1995). Rating scales
are often not based on an accepted theory or model of language development
(Fulcher, 1996b; North & Schneider, 1998) nor are they based on an empirical
investigation of language performance (Young, 1995). This results in scales that
include features that do not actually occur in the writing performances of learners
(Fulcher, 1996a; Upshur & Turner, 1995). Rating scales based on pre-existing
scales might also result in rating scale criteria which are irrelevant to the task in
question or the context (Turner & Upshur, 2002). Other researchers have con-
tended that rating scales are often not consistent with findings from second lan-
guage acquisition (Brindley, 1998; North, 1995; Turner & Upshur, 2002; Upshur
& Turner, 1995). Rating scales also generally assume a linear development of

65
language ability, although studies such as those undertaken by Meisel, Clahsen
and Pienemann (1981) show that this might not be justified (Young, 1995).

Another group of criticisms is leveled at the descriptors. Brindley (1998) and


many other authors have pointed out that the criteria often use terminology which
is subjective and imprecise (Mickan, 2003; Upshur & Turner, 1995; Watson Todd
et al., 2004) which makes it hard for raters to be consistent. The band levels have
also been criticized for often being interdependent (and therefore not criterion-
referenced) (Turner & Upshur, 2002; Upshur & Turner, 1995) and for using rela-
tive wording as well as adjectives and intensifiers to differentiate between levels
(Mickan, 2003). Other researchers have shown that features are often grouped to-
gether at the descriptor level although they might not actually co-occur (Turner &
Upshur, 2002; Upshur & Turner, 1995) or develop in step (Perkins & Gass, 1996;
Young, 1995). Finally, Upshur and Turner (1995) have argued that rating scales
are often simply too broad to be discriminating for the population they are used
on.

Fewer studies have focussed on the problems raters experience when using rating
scales. There is, however, a growing body of research that indicates that raters of-
ten find it very difficult to assign levels and that they employ a number of strate-
gies to cope with these problems. Shaw (2002), for example, noted that about a
third of the raters he interviewed reported problems when using the criteria. How-
ever, he does not mention what problems they referred to. Claire (2002, cited in
Mickan, 2003) reported that raters regularly debate the criteria in moderation ses-
sions and describe problems in applying descriptors with terms like ‘appropri-
ately’. Similarly, Smith (2000), who conducted think-aloud protocols of raters
marking writing scripts, noted that raters had ‘difficulty interpreting and applying
some of the relativistic terminology used to describe performances’ (p. 186).

However, Lumley (2002; 2005), who also conducted think-aloud protocols with
raters, noted raters experiencing problems only in unusual situations, when raters
for example encountered problem scripts or features that were not mentioned in
the scripts. He observed how, when raters were not able to apply the criteria, they
fell back on their personal experiences. Otherwise, he found that raters encoun-
tered very few problems in applying the criteria.

3.5 Rating scales for diagnostic assessment

Evidence from Alderson’s (2005) discussion of diagnostic assessment (see previ-


ous chapter) suggests that diagnostic tests of writing should be differentiated from
other tests of writing (e.g. placement or proficiency tests). It is therefore conceiv-

66
able that rating scales used in other assessment contexts are not appropriate for
diagnostic purposes.

Alderson (2005) proposes a number of key features that differentiate diagnostic


tests from other types of assessment. Some of these features are directly relevant
to rating scales. For example, he suggests that one of the features of diagnostic
assessment is to identify strengths and weaknesses in learners’ writing ability.
However, some authors (e.g. Weigle, 2002) have shown that when raters use ana-
lytic rating scales, they often display a halo effect, because their overall impres-
sion of a writing script (or the impression of one aspect of the writing script)
guides their rating of each of the traits. This suggests that currently available rat-
ing scales might not be effective when used for diagnosis, as they might not be
able to identify strengths and weaknesses, and thus would result in feedback
which is not informative for students.

Alderson (2005) further suggests that diagnostic assessment usually focuses more
on specific features rather than global abilities. Some of the literature reviewed
above, however, suggests that current rating scales make use of vague and impres-
sionistic terminology and that raters often seem to struggle when employing these
types of scales. Impres-sionistic and vague terminology on the descriptor level
might not be conducive to identifying specific features in a writing script.

Alderson (2005) also argues that a diagnostic test should be either theoretically-
based or based on a syllabus. Because the rating scale represents the de facto test
construct, the rating scale used for a diagnostic assessment of writing should be
based on a theory (or syllabus). Alderson further suggests that this theory should
be as detailed as possible, rather than global. Diagnostic tests should also be based
on current SLA theory and research.

Overall, it seems doubtful that rating scales which are designed for proficiency or
placement procedures would also be appropriate for diagnostic assessment. But
what features would a rating scale for diagnostic assessment have to display?
Weigle’s (2002) five steps in rating scale development suggest the following:

(1) To be able to identify strengths and weaknesses in a learners’ writing and to


provide useful feedback to students, an analytic scale is needed. A holistic scale
would only result in a single score, which is not helpful in a diagnostic context. It
would also be important that separate aspects of writing ability are not mixed into
the same descriptor, so that these aspects can be assessed separately. The scale
should furthermore be developed in a manner which discourages raters from dis-
playing a halo effect.

67
(2) The rating scale should be assessor-oriented, so that raters are assisted in iden-
tifying specific details in learners’ writing. Rating scales should therefore provide
as much information as necessary for raters to assign bands reliably. Similarly, it
could also be argued that the scale should be user-oriented, as feedback is central
in diagnostic assessment.

(3) The rating scale should be based on a theory or model of language develop-
ment5 (as suggested by Alderson, 2005). In this way, the criteria chosen will re-
flect as closely as possible our current understanding of writing (and language)
development. The theory should be as detailed as possible, to provide a useful ba-
sis for the descriptors. The descriptors should ideally be empirically-developed. In
this way, they will be based on actual student performance rather than being con-
ceived in a vacuum. If the descriptors are based on empirical investigation, they
can be based on our current understanding of SLA theory.

(4) An objective formulation style, as described earlier in this chapter is probably


the most suitable for diagnostic purposes because raters are focussed on specific
rather than global abilities. Level descriptors should furthermore not be differenti-
ated by vague terminology which could make it more difficult for raters to assign
levels.

(5) The way the scores are reported to stakeholders is central to diagnostic as-
sessment. Scores should be provided in such a way as to offer as much feedback
as possible to students.
3.6 Conclusion

This chapter has investigated a number of options available to rating scale devel-
opers and has then discussed features of scales which might be most suitable to
the diagnostic context. One suggestion is that a rating scale for diagnostic assess-
ment should be theory-based. However, a closer look at the different models and
theories that could be or have been used for rating scale development reveals that
none of them provide an outright solution. Our current understanding of writing is
not sufficiently developed to base a rating scale just on one theory. The following
chapter therefore attempts to follow a similar path to Grabe and Kaplan’s (1996)
taxonomy of writing in order to establish a taxonomy of aspects of writing rele-
vant to rating scale development, which will then serve as a theoretical basis for
the design of the rating scale.

68
---

Notes:
1
The TWE is still administered in areas where the TOEFL iBT has not been introduced (e.g.
where access to computers is difficult).
2
In this case an example from the context of speaking is chosen because it is the most well-known
study exemplifying this type of scale development. The principles of this study are just as applica-
ble to the context of writing.
3
This method is described in more detail in Chapter 8.
4
Miller was not referring to raters in his article, but was instead referring to human processing
capacity in general.
5
A diagnostic test based on a syllabus is also possible, but not the focus of this study

69
Chapter 4: Measuring Constructs and Con-
structing Measures

4.1 Introduction

In the previous chapter, it was proposed that a rating scale for diagnostic assess-
ment should be (1) based on a theory of writing and/or language development and
(2) based on empirical investigation at the descriptor level. This chapter, there-
fore, sets out to achieve two purposes. Firstly, it attempts to arrive at a taxonomy
of the different theories and models available to rating scale developers. It will be
argued that, because currently no satisfactory theory or model of writing devel-
opment is available, a taxonomy based on a number of theories and models can
provide the most comprehensive description of our current knowledge about writ-
ing development. The first part of the chapter describes such a taxonomy. Based
on this taxonomy, a number of aspects of writing will be chosen which will serve
as the trait categories in the rating scale. To conclude the first part of the chapter,
the rating scale the Diagnostic English Language Needs Assessment (DELNA)1
uses currently to rate writing scripts is reviewed in terms of these constructs.

The second aim of the chapter is to arrive at discourse analytic measures which
can be used as a basis for the empirical investigation of a new rating scale. These
discourse analytic measures should represent each of the different trait categories
chosen from the taxonomy. The relevant literature on each of these different as-
pects of writing (or traits) is reviewed to establish which discourse analytic meas-
ures should be used to operationalize each of these traits. At the end of this chap-
ter, a list of discourse analytic measures is presented, which will then be used dur-
ing the pilot study.

4.2 Theory-based rating scale design

A number of authors have argued that a theoretical basis for rating scale design is
necessary. For example, McNamara (1996, p.49) writes ‘an atheoretical approach
to rating scale design in fact provides an inadequate basis for practice’ and that
‘the completeness, adequacy and coherence of such models is crucial’, and North
(2003, following Lantolf and Frawley, 1985) argues that ‘unless the conceptual
framework behind the scale takes some account of ling-uistic theory and research
in its definition of proficiency, its validity will be limited and it can be accused of
constructing a closed reality of little general interest’. North further argues that
one cannot actually avoid theory. He claims that it is more than sensible to have a
valid conceptual framework and try and to incorporate relevant insights from the-
ory when the scale is being developed. Therefore, for him, models of language use

71
are a logical starting point. Alderson (2005) also suggests that diagnostic tests
should be based on a theory.

However, although there is general agreement that rating scales should be based
on a theoretical framework, there are a number of problems with the models cur-
rently available (as reviewed in the previous chapter). These problems are further
discussed below.

Models of communicative competence have been used as the theoretical basis for
many rating scales. They have the advantage of being general models, which can
be transferred across contexts, assuring generalisability of results. Therefore re-
sults should be expected to show less variation across different tasks and general-
ise to other contexts (as shown by Fulcher, 1995). However, there are a number of
problems with using these models as a conceptual framework for a rating scale.
The first problem is, that they are models of communicative competence and not
models of performance. Therefore, they have problems of coping when compe-
tence is put into use. North (2003), for example, argues that these models have no
place for fluency, which is a component of performance. This is one of the most
obvious elements necessary to turn a model of competence into a model of per-
formance, which, as North (2003) and McNamara (1996) point out, is really
needed. The second problem relates to the operationalisation of the models. The
fact that certain aspects are components of a theoretical model does not mean
these parameters can be isolated as observable aspects and operationalised into
rating scale descriptors and hence tested separately.

So models of communicative competence throw up a number of problems when


considered as a theoretical basis for rating scales. An alternative possibility is bas-
ing the rating scale on a theory of writing. However, as discussed in the previous
chapter, currently there is no adequate theory of writing available (Cumming,
1998; Cumming & Riazi, 2000).

Most writing is not undertaken for the writer himself/herself but for a certain au-
dience. It could therefore be argued that raters’ decision-making models could be
used as a basis for rating scale design, since raters are the readers of the writing
scripts produced in the context of assessment. The assessment of writing should
take into account how readers of L2 writing think and respond to writing. These
decision-making processes have been modelled by Cumming et al. (2001; 2002)
in a reader-writer model (shown in Table 6 in the previous chapter). Brindley
(1991), however, has concerns about using raters’ decision-making processes as a
basis for the assessment process and the rating scale, for a number of reasons.
Firstly, he finds it hard to define what makes an ‘expert’ judge. He also argues
that these judges might be unreliable and base their judgements on different crite-

72
ria, as background and context might play an important role (as was seen in re-
search reported in Chapter 2). Even the method used in rater decision-making
studies, the concurrent think-aloud protocol, has been questioned (e.g. Stratman &
Hamp-Lyons, 1994) and recent research by Barkaoui (2007a; 2007b) reinforces
the doubts about the validity of this method.

It therefore seems that there is no theory currently available that can serve by it-
self as a basis for the design of a rating scale for writing for diagnostic assess-
ment. North (2003) argues that describing the stages of learning surpasses our
knowledge of the learning process and Lantolf and Frawley (1985, cited in
McNamara, 1996) add:

A review of the recent literature on proficiency and communicative com-


petence demonstrates quite clearly that there is nothing even approaching
a reasonable and unified theory of proficiency (p. 186).

There are therefore those that argue that one should not attempt to describe the
stages of attainment in a rating scale (e.g. Mislevy, 1993). However, in practical
terms, teachers and raters need some reference point to base their decisions on.

If currently no adequate model or theory is available, it is necessary to investigate


what such a model or theory should ideally look like. According to McNamara
(1996), such a model needs to be rich enough to conceptualize any issue which
might potentially be relevant to cope with performance. He argues that, in princi-
ple, there should be no limit to the dimensions of the model, as long as it is as rich
and complete as possible, but still possesses clarity. Secondly, he reasons that as a
next step a careful research agenda is necessary to investigate the significance of
the different measurement variables that the model proposes. Finally, it is also
important to ascertain which of these variables are appropriate and practical to
assess in a given test situation.

Following McNamara’s (1996) suggestions, I will propose a taxonomy based on


the available theories and models as a possible solution until a more adequate
model is found. A taxonomy is seen here as a list which is divided into ordered
groups or categories. A taxonomy based on the different models reviewed in the
previous chapter, would satisfy McNamara’s requirement that a model needs to be
rich enough to conceptualize any relevant issue. A taxonomy would group all
similar features of the different models together, and it would not exclude any fac-
tors. With a carefully grouped taxonomy, the researcher can embark on testing the
importance of the different variables. Finally, the researcher can use the taxonomy
as a basis to decide which aspects are testable and which are not. Such a taxon-
omy would also be in accord with Luoma’s (2004) and Alderson and Clapham’s

73
(1992) suggestion that an assessment is more valid if more than one model is used
in combination. It would also conform with North’s (2003) argument that a rating
scale based on a general model is more valid.

4.3 The Taxonomy

The taxonomy proposed here is an amalgamation of the following models dis-


cussed in the previous chapter: Bachman and Palmer’s (1996) model of commu-
nicative competence, Grabe and Kaplan’s (1996) model of text construction and
their writing taxonomy, the models of rater decision-making by Milanovic et al.
(1996), Sakyi (2000) and Cumming et al. (2001; 2002), and Lado’s (1961) Four
Skills Model. All features the models propose including those shared by more
than one model were mapped onto a common table. From that table, it became
clear that some models are more extensive than others. The models of language
ability and the models of writing, for example, do not incorporate content. It can,
of course, be argued that content is not part of language ability, but rather a cogni-
tive aspect which is unrelated to language. However, it seems that raters see con-
tent as an important aspect of writing assessment. Without exception, all models
specify surface level textual features like grammar, vocabulary and syntax as
components, however they differ in their description of the features. Some, for
example, might include grammar, others errors, or the frequency of errors. Most
models include the aspects of coherence, cohesion and mechanics.

Fluency is not an aspect that is part of the models of language competence. How-
ever, raters seem to consider this construct as part of their decision-making proc-
esses. A group of features connected with the reader includes stance or audience
awareness. These are termed socio-linguistic knowledge in the models of commu-
nicative competence and stance, posture and audience awareness in Grabe and
Kaplan’s two models. However, very little mention of this aspect of writing can
be found in the models of rater decision-making, which might be because raters
are often not specifically trained to rate these. These aspects will from now on be
grouped together and referred to as features of reader/writer interaction.

Only aspects of writing that can be assessed on the basis of the writing product are
included in the taxonomy. For example, whilst there is no doubt that the affect of
a writer plays an important role in the outcome of the writing product, it is unreal-
istic for raters to assess this. It is therefore not included in the list of criteria. Simi-
larly, the world knowledge of a writer cannot be assessed on the basis of a prod-
uct, only on the quality of the content of a piece of writing. It is moreover doubt-
ful that interest, creativity and originality of content can be assessed objectively
and these aspects are therefore also not included in the list.

74
The features in the taxonomy were grouped into the following eight categories,
which will form the basis of the constructs further pursued in the remainder of this
study (see Table 10 below):

Table 10: Categorisation of taxonomy features into constructs


Category Feature from model/theory
Accuracy Vocabulary
Syntax
Grammar
Error types, frequency and gravity of errors
Morphology
Functional knowledge
Fluency Text length
Fluency
Editing
Complexity Vocabulary
Syntax
Morphology
Functional knowledge
Mechanics Spelling, punctuation, capitalisation, paragraphing
Layout
Cohesion Cohesion
Coherence Coherence
Reader/writer interaction Functional knowledge, sociolinguistic knowledge
Style, stance and posture
Audience awareness
Content Topic development
Relevance
Support
Logic
Quantity of content
Task completion
Use of source material

4.4 Evaluation of the usefulness of the DELNA rating scale for diagnostic
assessment

In this section, I will analyze the existing rating scale in terms of the constructs
that have been identified in the preceding section as well as some of the criticisms
of rating scales discussed in Chapter 3. The DELNA2 rating scale (Table 11 be-
low), as it is currently in use, has evolved over several years. It was developed
from other existing rating scales and on the basis of expert intuition. Over the
years several changes have been carried out, mainly on the basis of suggestions by
raters.

75
The DELNA rating scale has nine categories, grouped together into the three
groups - form, fluency and content. Each category is divided into six level de-
scriptors ranging from four to nine. A first glance at the scale reveals that a num-
ber of the constructs identified in the previous section are represented. A closer
look at the scale however, also reveals some of the problems common to rating
scales identified in Chapter 3. These problems can mostly be found in the group-
ings of the categories and the wording of the level descriptors.

The group of categories under the heading form consists of sentence structure,
grammatical accuracy and vocabulary & spelling. A closer look at the category of
sentence structure shows that the descriptors mix aspects of accuracy and com-
plexity. At level 6, for example, one reads ‘adequate range – errors in complex
sentences may be frequent’. There is, however, no indication of what adequate
range means and how that is different from level 7 ‘satisfactory variety – reduced
accuracy in complex sentences’.

The category of grammatical accuracy focusses purely on the accuracy of syntax.


For example, level 8 reads ‘no significant errors in syntax’. How this accuracy is
different from the accuracy under the category ‘sentence structure’ is not clear and
could be confusing for raters. The third category under the heading form, vocabu-
lary and spelling, conflates the possibly separate constructs of complexity of vo-
cabulary, accuracy of vocabulary and mechanics (spelling). At some levels, the
descriptor refers only to the complexity of the vocabulary (not the accuracy), for
example levels seven to nine, and at other levels to both complexity and accuracy,
for example levels 5 and 6. At level 4, the descriptor refers only to the accuracy of
the vocabulary.

Overall, the traits under form represent the aspects of accuracy and complexity as
identified in the taxonomy as well as one aspect of mechanics, namely spelling.

Under the heading of content, there are three categories: description of data, in-
terpretation of data and development of ideas. These three categories are gener-
ally intended to follow the three sub-sections of the task3. The first section of the
task requires the writers to describe the data provided in a graph or a table. The
level descriptors in this category represent a cline from ‘clearly and accurately’
over ‘accurately’, ‘generally accurately’ to ‘adequately’ and ‘inadequately’.
These levels might be hard for raters to distinguish. The second category under
the heading of content refers to the ‘interpretation of ideas’. Here a number of the
categories identified by Cumming et al.’s (2001; 2002) rater decision-making
processes are mixed in the level descriptors. For example, at some levels the raters
are asked to rate the relevance of ideas, at others the quantity of ideas and the clar-
ity or the length of the essay. A similar problem can be identified in the next cate-

76
gory entitled development of ideas. Again, some level descriptors include rele-
vance, supporting evidence, length of essay or clarity, which are all separate con-
cepts according to Cumming et al.’s findings.

In general, the content category can be equated with the construct of content iden-
tified in the taxonomy.

The third heading is entitled fluency. However, none of the categories is measur-
ing fluency, as the three categories are organisation, cohesion and style. Organi-
sation looks at the paragraphing of the writing as well as logical organisation.
These might possibly be separate constructs, with the formatting conventions be-
ing an aspect of mechanics and organisation being an aspect of coherence. The
category of cohesion refers to cohesive devices, but does not explain what exactly
raters should look for. The category of style might refer to the category of
reader/writer interaction.

However, the raters are given very little guidance in terms of the features of style
to rate. The heading of fluency equates to the constructs of cohesion and coher-
ence, reader/writer interaction, features of academic writing and possibly some
aspects of mechanics.

Overall, it can be said that the DELNA rating scale is a comprehensive scale that
covers almost all constructs of writing identified in the taxonomy in the previous
section. However, the groupings are at times arbitrary; some level descriptors mix
separate constructs and some rating scale descriptors could be criticized for being
vague and using impressionistic terminology. The construct of fluency, which has
been identified as being important when measuring performance, is not part of the
DELNA rating scale.
When compared to the list of features that a diagnostic rating scale should display,
the following observations can be made. (1) The DELNA scale is an analytic rat-
ing scale, but at times separate aspects of writing ability are mixed into one de-
scriptor. (2) The rating scale is assessor-oriented, although at times the descriptors
include vague terminology and might therefore not provide sufficient guidance to
raters. (3) It is not clear whether the scale is based on any theory or model of writ-
ing development. It was developed in what Fulcher (2003) would term an ‘intui-
tive’ manner. (4) The scale descriptors do not have an objective formulation style,
and many descriptors make use of adjectives or adverbs to differentiate between
levels. (5) Scores are currently not reported to stakeholders separately. Students
receive a single averaged score and comments based on fluency, content and
form. Taking all these features of the DELNA rating scale into account, it is
doubtful that the scale provides an adequate basis for diagnostic assessment.

77
Fluency Organisation Cohesion Style
9 Essay organised effectively – Skilful use of cohesive de- Academic – appro-
fluent – introduction and vices – message able to be priate to task
concluding comment followed effortlessly
8 Essay fluent – well organised Appropriate use of cohesive Generally aca-
– logical paragraphing devices – message able to be demic – may be
followed throughout slight awkwardness
7 Essay organised – paragraph- Adequate use of cohesive Adequate under-
ing adequate devices – slight strain for standing of aca-
reader demic style
6 Evidence of organisation – Lack / inappropriate use of Some understand-
paragraphing may not be cohesive devices causes some ing of academic
entirely logical strain for reader style
5 Little organisation – possibly Cohesive devices absent/ in- Style not appropri-
no paragraphing adequate/ inappropriate – ate to task
considerable strain for reader
4 Lacks organisation Cohesive devices absent – No apparent under-
severe strain for reader standing of style
Table 11: DELNA rating scale – fluency

Content Description of data Interpretation of Development / extension of


data ideas
9 Data clearly and accu- Interpretation logi- Ideas relevant and well sup-
rately described cal and appropriate ported – appropriate conclusion
reached
8 Data described accurately Interpretation suffi- Ideas sufficient and supported.
cient / appropriate Some may lack obvious rele-
vance
7 Data generally described Interpretation gen- Ideas adequate – some support-
accurately erally adequate ing evidence may be lacking
6 Data described adequately Interpretation may Ideas may not be expressed
/ may be overemphasis on be brief / inappro- clearly or supported appropri-
figures priate ately – essay may be short
5 Data (partially) described / Interpretation often Few appropriate ideas ex-
may be inaccuracies / very inaccurate / very pressed – inadequate support-
brief / inappropriate brief / inappropriate ing evidence – essay may be
short
4 Data not / inadequately Interpretation lack- Ideas largely incomprehensible
described ing/ unclear
Table 11 (cont.): DELNA rating scale - content

78
Form Sentence structure Grammatical accuracy Vocabulary & spelling
9 Sophisticated control of Error free Extensive vocab/ may be one or
sentence structure two minor spelling errors
8 Controlled and varied No significant errors in Vocab appropriate / may be few
sentence structure sytax minor spelling errors
7 Satisfactory variety – Errors minor/ not intru- Vocab adequate / occasionally
reduced accuracy in sive inappropriate / some minor
complex sentences spelling erros
6 Adequate range – errors Errors intrusive / may Limited, possibly inaccurate /
in complex sentences cause problems with inappropriate vocab/ spelling
may be frequent expression of ideas errors
5 Limited control of sen- Frequent errors in sytax Range and use of vocab inade-
tence structure cause significant strain quate. Errors in word formation
& spelling cause strain
4 Inadequate control of Frequent basic syntacti- Basic errors in word formation /
sentence structure cal errors impede com- spelling. Errors disproportion-
prehension ate to length and complexity of
script
Table 11 (cont.): DELNA rating scale - form

Whilst the taxonomy described above has provided us with eight constructs which
can be used as the basis for the trait scales of the new rating scales, the descriptors
will be de-rived empirically. For this purpose, operational definitions of the dif-
ferent constructs need to be developed. That is the intention of the second part of
this chapter.

4.5 Theoretical background to constructs and operationalisations

The eight constructs constituting the taxonomy of writing are described in more
detail in the following sections. In these sections the theoretical basis of the con-
structs is discussed, followed by examples of research that has identified discourse
analytic measures to operationalize the three different constructs. The main aim of
this section is to identify measures which have successfully distinguished different
proficiency levels of writing. Based on the findings of the review of the literature,
a summary of suitable measures for the empirical investigation will be presented.

4.5.1 Accuracy, Fluency and Complexity

The following section discusses the theoretical basis underlying the analytic
measures of accuracy, fluency and complexity. First, a theoretical framework
based on an information-processing model is presented and then each measure is

79
described in detail. In these sections, the varying measures previous studies have
employed to operationalize the different concepts are investigated.

Measures of accuracy, fluency and complexity are often used in second language
acquisition research because they provide a balanced picture of learner language
(Ellis & Barkhuizen, 2005). Accuracy refers to ‘freedom of error’ (Foster & Ske-
han, 1996, p. 305), fluency refers to ‘the processing of language in real time’
(Schmidt, 1992, p.358) where there is ‘primacy of meaning’ (Foster & Skehan,
1996 ,p. 304) and complexity is ‘the extent to which learners produce elaborated
language’ (Ellis & Barkhuizen, 2005).

In 1981, Meisel, Clahsen and Pienemann developed the Multidimensional Model


of L2 acquisition, which proposed that learners differ in their orientation to learn-
ing and that this influences their progress in different areas of L2 knowledge.
Learners with a ‘segregative orientation’, for example, are likely to achieve func-
tional ability at the expense of complexity and possibly accuracy. In contrast,
learners with an ‘integrative orientation’ may prioritize accuracy and complexity
at the expense of fluency. Underlying Meisel et al.’s model is the assumption that
L2 learners might experience difficulty in respect to focussing on content and
form simultaneously and therefore need to choose what to pay attention to.

A possible explanation of this phenomenon lies in theories of second language


acquisition that propose a limited processing capacity (e.g. Skehan, 1998b). In the
case of output, or more specifically writing, which is the focus of this study,
learners need to access both world knowledge and L2 knowledge from their long-
term memories and hold these in their short-term memories in order to construct
messages that represent the meaning they intend and which are at the same time
linguistically appropriate.

Focussing more closely on L2 knowledge, Skehan (1998b) proposes that it is


stored in two forms, one being exemplar-based knowledge and the other one rule-
based knowledge. The former consists of chunks or formulaic expressions which
can be accessed relatively effortlessly and therefore are able to conserve valuable
processing resources. This particular component of L2 knowledge contributes to
increased production and fluency. The other component of L2 knowledge, the
rule-based system, stores complex linguistic rules which allow the speaker to form
an infinite number of well-formed sentences in innovative ways. But this is more
costly in terms of processing capacity and this knowledge is harder to access if
limited planning time is available.

Skehan (1998b) uses the model proposed above to suggest three aspects underly-
ing L2 performance (see Figure 10 below). Learner production is to be analysed

80
with an initial partition between meaning and form. Form can further be subdi-
vided into control and restructuring. Meaning is reflected in fluency, while form is
either displayed in accuracy (if the learner prioritizes control) or in complexity (if
opportunities for restructuring arise because the learner is taking risks).

Figure 10: Skehan’s three aspects of L2 performance (from Ellis & Barkhuizen, 2005)

Skehan (1996, p. 50) considers the possible results of learners allocating their at-
tentional resources in a certain way. He argues that a focus on accuracy makes it
less likely that interlanguage change will occur (production will be slow and
probably consume a large part of the attentional resources). A focus on complex-
ity and the process of restructuring increases the chance that new forms can be
incorporated in the interlanguage system. A focus on fluency will lead to language
being produced more quickly and with lower attention to producing accurate lan-
guage and incorporating new forms. He proposes that as learners do not have
enough processing capacity available to attend to all three aspects equally, it is
important to understand the consequences of allocating resources in one direction
or another. A focus on performance is likely to prioritize fluency, with restructur-
ing and accuracy assigned lesser importance. A focus on development might shift
the concentration to restructuring, with accuracy and fluency becoming less im-
portant.

Discourse analytic measures of accuracy, fluency and complexity are based on an


information-processing frame-work of L2 acquisition and are therefore appropri-
ate for investigating L2 production. They have been used in a variety of studies
investigating task difficulty (e.g. Iwashita, McNamara, & Elder, 2001; Skehan,
1996), and effec-tiveness of planning time (e.g. Crookes, 1989; Ellis, 1987; Ellis
& Yuan, 2004; Mehnert, 1998; Ortega, 1999; Wigglesworth, 1997) as well as the
effects of different tea-ching techniques (e.g. Ishikawa, 1995).

81
In the context of language testing, Iwashita et al. (2001) have criticized the meas-
ures of accuracy, fluency and com-plexity used in research as being too complex
and time consuming to be used under operational testing conditions. They call for
more practical and efficient measures of ability that are not as sensitive to varia-
tions in task structure and processing conditions. In their study, they propose a
rating scale based on aspects of accuracy, fluency and complexity.

Tavakoli and Skehan (2005) demonstrated the potential usefulness of discourse


analytic measures of accuracy, fluency and complexity for language testing when
they performed a principal components analysis on the oral dataset of their study.
The aim was to show that the dependent variables of their study were in fact dis-
tinct factors. The factor analysis produced a three factor solution. The results can
be seen in Table 12 below4.

As Table 12 shows, Factor 1 is made up of six measures (length of run, speech


rate, total amount of silence, total time spent speaking, number of pauses and
length of pauses).

Table 12: Factor analysis for measures of accuracy, fluency and complexity (Tavakoli and
Skehan, 2005)
Measures Factor Factor Factor Communality
1 2 3
Reformulations .88 .880
False starts .94 .892
Replacements .41 .276
Repetitions .62 .490
Accuracy .65 .662
Complexity .87 .716
Length of run -.66 -.44 .43 .767
Speech rate -.84 .793
Total silence . 95 .912
Time spent speak- -.94 .902
ing
No. of pauses .80 .736
Mean length .87 .844
pause

These measures represent what the authors refer to as the temporal aspects of flu-
ency. The second factor is based on the measures of reformulations, false starts,
replacements and repetitions. These measures are associated with another aspect
of fluency, namely repair fluency (e.g. Skehan, 2001). The third factor has load-
ings of measures of accuracy and complexity as well as length of run. This indi-
cates that more accurate language was also more complex. These loadings also
suggest that the measures represent the same underlying constructs, which con-

82
firms Skehan’s (1998b) model of task performance according to which accuracy
and complexity are both aspects of form, while fluency is meaning-oriented. The
results of this factor analysis are potentially useful for the field of language test-
ing, especially rating scale design, as it can be shown which measures are in fact
distinct entities and can therefore be represented separately on a rating scale. It is
worth noting however, that the research investigated oral language use and that
the results may not be applicable to written production.

In the three sections below discourse analytic measures of accuracy, fluency and
complexity are examined in more detail. Definitions are given and commonly
used measures are reviewed.

4.5.1.1 Accuracy

Polio (1997) reviewed several studies that employed measures of accuracy. Some
studies used holistic measures in the form of a rating scale (looking at the accu-
racy of syntax, morphology, vocabulary and punctuation), whilst others used more
objective measures like error-free t-units5. Others counted the number of errors
with or without classifying them.

The accuracy of writing texts has been analyzed through a number of discourse
analytic measures. Usually, errors in the text are counted in some fashion. Two
approaches have been developed. The first one involves focusing on whether a
structural unit (e.g. clause, t-unit) is error free. Typical measures found in the lit-
erature include the number of error-free t-units per total number of t-units or the
number of error-free clauses per total number of clauses. For this measure, a deci-
sion has to be made as to what constitutes an error. According to Wolfe-Quintero
(1998), this decision might be quite subjective as it might depend on the re-
searcher’s preferences or views on what constitutes an error for a certain popula-
tion of students. Error-free measures of accuracy have been criticized by Bar-
dovi-Harlig and Bofman (1989) for not being sufficiently discriminating because
a unit with only one error is treated in the same way as a unit with more than one
error. Furthermore, error-free measures do not disclose the types of errors that are
involved as some might impede communication more than others. In light of these
criticisms, a second approach to measuring accuracy was developed based on the
number of errors in relation to a certain production unit (e.g. the number of errors
per t-unit). One problem of this method is that all errors are still given the same
weight. Some researchers (e.g. Homburg, 1984) have de-veloped a system of cod-
ing errors according to gravity, but Wolfe-Quintero et al. (1998) argue that these
systems are usually based on the intuitions of the researchers rather than being
empirically based.

83
Several studies have found a relationship between the number of error-free t-
units and proficiency as measured by program level (Hirano, 1991; Sharma, 1980;
Tedick, 1990), standardized test scores (Hirano, 1991), holistic ratings (Homburg,
1984; Perkins, 1980), grades (Tomita, 1990) or comparison with native speakers
(Perkins & Leahy, 1980). Two studies found no relationship between error-free t-
units and grades (Kawata, 1992; Perkins & Leahy, 1980). Wolfe-Quintero et al.
argue that for the number of error-free t-units to be effective, a time-limit for
completing the writing task needs to be set (as was done by most studies they in-
vestigated). Another measure that seems promising according to Wolfe-Quintero
et al. is the number of error-free clauses. This measure has only been employed
by Ishikawa (1995) to differentiate between proficiency levels. Ishikawa devel-
oped this measure with the idea that her beginning students were less likely to
have errors in all clauses than in t-units, because the string is likely to be shorter.
She found a significant improvement after three months of instruction.

The error-free t-unit ratio (error-free t-units per total number of t-units) or the
percentage of error-free t-units has been employed by several studies to examine
the relationship between this measure and proficiency. According to Wolfe-
Quintero et al., twelve studies have found a significant relationship but eleven
have not. Of the twelve significant studies, some investigated the relationship be-
tween error-free t-units ratio and program level (Hirano, 1991; Larsen-Freeman,
1978; Larsen-Freeman & Strom, 1977), test scores (Arnaud, 1992; Hirano, 1991;
Vann, 1979) or grades (Kawata, 1992; Tomita, 1990). However, three studies re-
lating to program level were not significant (Henry, 1996; Larsen-Freeman, 1983;
Tapia, 1993). Some longitudinal studies were also not able to capture a significant
increase in accuracy, indicating that the percentage of error-free t-units cannot
capture short-term increases over time. Another accuracy measure, error-free
clause ratio (total number of error-free clauses divided by the total number of
clauses) was used by only two researchers with mixed results. Ishikawa (1995)
chose this measure as a smaller unit of analysis for her beginner-level learners.
She found a significant increase for one of her groups over a three month period.
Her other group and Tapia’s (1993) students all increased in this measure without
showing a statistically significant difference. Another measure in this group, is
errors per t-unit (total number of errors divided by the total number of t-units).
This measure has been shown to be related to holistic ratings (Flahive & Gerlach
Snow, 1980; Perkins, 1980; Perkins & Leahy, 1980) but has been less successful
in discriminating between program level and proficiency level (Flahive & Gerlach
Snow, 1980; Homburg, 1984). Wolfe-Quintero et al. therefore argue that this
might indicate that this measure does not discriminate between program level and
proficiency level, but rather gives an indication of what teachers look for when
making comparative judgements between learners. However, they argue that this
issue needs to be examined in more detail. The last measure in this group is the

84
errors per clause ratio (total number of errors divided by total number of
clauses). The findings were the same as those of the errors per t-unit measure,
showing that these two measures are more related to holistic ratings than to pro-
gram level.

4.5.1.2 Fluency

Fluency has been defined in a variety of ways. It might refer to the smoothness of
writing or speech in terms of temporal aspects; it might represent the level of
automatisation of psychological processes; or it might be defined in contrast to
accuracy (Koponen & Riggenbach, 2000). Reflecting the multi-faceted nature of
fluency, researchers have developed a number of measures to assess fluency. Ske-
han (2003) has identified four groups of measures: breakdown fluency, repair flu-
ency, speech/writing rate and automatisation. All these categories were developed
in the context of speech rather than writing. They are however, just as applicable
to the context of writing. Breakdown fluency in the context of speech is measured
by silence. In the context of writing this could be measured by a break in the writ-
ing process, which cannot be examined on the basis of the product alone. Repair
fluency has been operationalised in the context of speech as reformulations, re-
placements, false starts and repetition. For writing, this could be measured by the
number of revisions (self-corrections) a writer undertakes during the composing
process (Chenoweth & Hayes, 2001). Kellogg (1996) has shown that this editing
process can take place at any stage during or after the writing process. Another
sub-category of fluency is speech/writing rate, a temporal aspect of fluency, op-
erationalised by the number of words per minute. The final sub-group is automa-
tisation, measured by length of run (Skehan, 2003). Only repair fluency and tem-
poral aspects of writing (writing rate) can be measured on the basis of a writing
product. Furthermore, writing rate can only be established if the product was pro-
duced under a time limit or if the time spent writing was recorded. That repair flu-
ency and temporal aspects of fluency are separate entities has been shown by Ta-
vakoli and Skehan’s (2005) factor analysis (Table 12).

In the context of writing, Chenowith and Hayes (2001) found that even within a
period of only two semesters their students displayed a significant increase in
writing fluency. This included an increase in burst length (automatisation), a de-
crease in the frequency of revision (repair fluency), and an increase in the number
of words accepted and written down (writing rate).
One measure that can be used to investigate temporal aspects of fluency is the
number of words which, according to Wolfe-Quintero et al. (1998), has produced
rather mixed results. According to their analysis, eleven studies found a signifi-
cant relationship between the number of words and writing development, while
seven studies did not. However, this measure might be more reliable if it is ap-

85
plied to writing that has been produced under time pressure. Kennedy and Thorp
(2002), who investigated the differences in writing performance at three different
IELTS levels, found a difference between essays at levels 4, 6 and 8, with writers
at level 4 struggling to meet the word limit. However, they also report a large
amount of overlap between the levels. Cumming et al. (2005), in a more recent
study focussing on the next generation TOEFL, found statistically significant dif-
ferences only between essays at levels 3 and 4 (and levels 3 and 5), but no differ-
ences between levels 4 and 5. The descriptive statistics indicate a slight increase
in the number of words between levels 4 and 5. Another interesting measure to
pursue might be the number of verbs. This measure has only been used once
(Harley & King, 1989) in a study which compared native and non-native speakers
and which produced significant results. However, it has never been used to differ-
entiate between different proficiency levels.

No studies of the writing product have investigated repair fluency. The number of
self-corrections, a measure mirroring the number of reformulations and false
starts in speech, might be a worthwhile measure to pursue in this study.

4.5.1.3 Complexity

The importance of grammatical and lexical complexity in academic writing has


been pointed out by Hinkel (2003), who argues that investigations into L2 texts
have shown that in large-scale testing and university-level assessments, shortcom-
ings in syntactic and lexical complexity in students’ writing are often considered a
severe handicap. According to her, research has shown that raters often criticize
simple constructions and an unsophisticated lexicon, a consideration that might
reduce the score awarded (Reid, 1993; Vaughan, 1991). Furthermore, L2 writers’
range and sophistication have been shown to be reliable predictors of overall Test
of Written English scores (Frase, Faletti, Ginther, & Grant, 1999).

Ellis and Barkhuizen (2005) suggest that complexity can be analysed according to
the language aspects they relate to. These could include interactional, proposi-
tional, functional, grammatical or lexical aspects. As propositional and functional
complexity are hard to operationalize and interactional complexity is a feature of
speech, only grammatical and lexical complexity will be considered here
(following Wolfe-Quintero et al., 1998).

4.5.1.3.1 Grammatical complexity

Grammatical complexity is concerned with grammatical variation and sophistica-


tion. It is therefore not important how many production units (like clauses or t-
units) are present in a piece of writing, but rather how complex these are.

86
The measures that have been shown to most significantly distinguish between pro-
ficiency levels, according to Wolfe-Quintero et al. (1998), seem to be the t-unit
complexity ratio, the dependent clause per clause ratio and the dependent clause
per t-unit ratio (with the last two producing rather mixed results in previous stud-
ies).

The t-unit complexity ratio (number of clauses per t-units) was first used by Hunt
(1965). A t-unit contains one independent clause plus any number of other clauses
(including adverbial, adjectival and nominal clauses). Therefore, a t-unit complex-
ity ratio of two would mean that on average each t-unit consists of one independ-
ent clause plus one other clause. Wolfe-Quintero et al. (1998) point out that in L2
writing not all sentences are marked for tense or have subjects. They argue that it
is therefore important to include all finite and non-finite verb phrases in the t-unit
(as was done by Bardovi-Harlig & Bofman, 1989). This would change the meas-
ure to a verb phrases per t-unit measure. They argue that it would be useful to
compare which of these measures is more revealing. The t-unit complexity ratio
was designed to measure grammatical complexity, assuming that in more complex
writing there are more clauses per t-unit. However, in second language research,
there have been mixed results. Hirano (1991) found a significant relationship be-
tween the t-unit complexity ratio and program level, as did Cooper (1976) and
Monroe (1975) between this measure and school level, and Flahive and Snow
(1980) found a relationship between this measure and a number of their program
levels. However other studies (Bardovi-Harlig & Bofman, 1989; Ishikawa, 1995;
Perkins, 1980; Sharma, 1980) obtained no significant results. For example, Cum-
ming et al.’s (2005) detailed analysis of TOEFL essays resulted in a similar num-
ber of clauses across proficiency levels. The means ranged from 1.5 to 1.8 for the
different levels. Similarly, Banerjee and Franceschina (2006) found no differences
between pro-ficiency levels when conducting a similar analysis on IELTS writing
scripts. According to Wolfe-Quintero et al. (1998) this measure is most related to
program or school level and holistic ratings. They also point to the fact that even
in studies that found no significant results, scores on this measure increased.

The second useful measure identified by Wolfe-Quintero et al. is the dependent


clause ratio (number of dependent clauses per clauses). This measure examines
the degree of embedding in a text. Hirano (1991) found a significant relationship
between this measure and three different program levels.

The final measure deemed promising by Wolfe-Quintero et al., is the dependent


clauses per t-unit measure (number of dependent clauses per t-units). Two au-
thors have used this measure, both investigating the relationship between it and
holistic ratings. Homburg (1984) found a significant relationship, whilst Vann

87
(1979) did not. Vann also did not find the measure to be a predictor in a multiple
regression step analysis of TOEFL scores.

4.5.1.3.2 Lexical complexity

The second group of complexity measures is related to lexical complexity, which


refers to the richness of a writers’ lexicon. Lexical complexity is evident in the
lexical range (variation) and the size (sophistication) of a writer’s vocabulary.
Writers with a wider vocabulary are able to use a larger number of basic and so-
phisticated words, whilst writers with less complex vocabulary might be able to
use only a limited number of basic words. It is therefore important to investigate
how varied and sophisticated the words are rather than just count the number of
words.

The most commonly known ratio measure of lexical complexity is the type/token
ratio (total number of different word types divided by the total number of words).
Type/token ratios, however, have been criticized, as they are sensitive to the
length of the writing sample. It is therefore important that if the type/token ratio is
used, the length of the sample has to be limited to a certain number of words6.
This might be one possible reason for Cumming and Mellow (1996) not finding a
significant difference between their learners of English in different program lev-
els. They did however find that, although not significant, the data showed the ex-
pected trend.

The second measure which was identified as promising by Wolfe-Quintero et al.


(1998) is lexical sophistication (total number of sophisticated lexical words di-
vided by total number of lexical words). This measure is calculated by identifying
the lexical words in a written sample which are not on a list of basic words, or are
on a special ‘sophisticated’ word list, like the Academic Word List (Coxhead,
2000).

Another measure of lexical sophistication is calculated by dividing the total num-


ber of sophisticated word types by the total number of word types. Laufer (1994)
used this measure to analyze university level compositions in a longitudinal study.
She defined sophistication as the number of words not on the 2000-word fre-
quency list. She found a significant difference over time.

In a more recent study conducted by Cumming et al. (2005) in the context of the
next generation TOEFL, the authors used average word length as an indicator of
lexical complexity. This measure had been used successfully in other studies (e.g.
Engber, 1995; Frase et al., 1999; Grant & Ginther, 2000), but failed to differenti-
ate between candidates at different proficiency levels in Cumming et al.’s study.

88
4.5.1.4 Summary of accuracy, fluency and complexity

As the constructs of accuracy, fluency and complexity are based on a current view
of second language acquisition, they are more promising for the investigation of
writing performance than more traditional constructs and measures like grammar,
vocabulary or error counts. Measures of accuracy, fluency and complexity have
been shown to successfully distinguish between different levels of writing devel-
opment and have been shown to be separate constructs, as shown by Skehan’s fac-
tor analysis. A number of measures from the literature review were selected to be
further pursued in the pilot study. These can be seen in Table 13 below.

Table 13: Measures of accuracy, fluency and complexity worthy of further investigation
Construct Measures
Accuracy Number of error-free t-units
Number of error-free clauses
Error-free t-unit ratio
Error-free clause ratio
Errors per t-unit
Fluency Number of words
Number of self-corrections
Grammatical complexity Clauses per t-unit
Dependent clauses per t-unit
Dependent clauses per clause
Lexical complexity Average word length
Lexical sophistication

Measures were selected based on two principles. Firstly, they needed to have been
shown by previous research to be successful in distinguishing between different
proficiency levels of writing and secondly, they needed to be sufficiently easy for
raters to apply during the rating process.

4.5.2 Mechanics

Very few studies have attempted to quantify aspects of mechanics, which include
spelling, punctuation, capitalization, and indentation (Polio, 2001). Most studies
that have investigated this construct to date (e.g. Pennington & So, 1993; Tsang,
1996), have made use of the Jacobs scale (Jacobs et al., 1981). However, none of
the studies had mechanics as a focus. It is therefore not clear if the scale is able to
reliably distinguish between different levels of mechanical quality. A second issue
raised by Polio (2001) is that it is not entirely clear if mechanics is a construct at
all. It is for example not clear if the different sub-components are related. Polio
further points out that in studies looking at accuracy, spelling is in fact often dis-
regarded. Bereiter (1980) argues however, that writing is significantly different

89
from speech in that it requires certain conventions like spelling and punctuation
and it might therefore be necessary to measure these.

Two studies were identified that measured aspects of mechanics without the use
of a rating scale. Firstly, Mugharbil (1999) set out to discover the order in which
second language learners acquire punctuation marks. He concluded that the period
(or full stop) was the first punctuation mark acquired and the semi-colon the last.
For beginning learners, he was able to show that the comma was the least often
correctly placed. The second study that included a measure for mechanics was
conducted by Kennedy and Thorp (2002) in the context of an analysis of textual
features produced by candidates of the IELTS test. The authors looked at para-
graphing and counted the number of paragraphs produced by writers at three dif-
ferent levels of writing, levels 4, 6 and 8.

Table 14: Paragraph percentages (from Kennedy and Thorp, 2002)


Number of para- 1 2 3 4 5 6 7 8 9 10+
graphs per essay
Percentage of essays –
Level 8 0 6.6 13.3 16 26.6 13.3 16.6 3.3 0 3.3

Percentage of essays –
Level 6 2 0 6 48 26 14 2 0 2 0

Percentage of essays –
Level 4 10 8 18 24 22 14 40 0 0 0

They were able to show that ten percent of the writers at level 4 produced only
one paragraph, whilst writers at levels 6 generally produced four or more para-
graphs. However, the results (shown in Table 14) are anything but conclusive.

So overall, the area of mechanics seems to have been very little explored in stud-
ies of second language writing. Several areas seem to be of interest and will there-
fore be further pursued in the pilot study. These are: punctuation, spelling, capi-
talization and paragraphing.

4.5.3 Cohesion and coherence

In this section, measures of cohesion and coherence will be discussed as well as


any relevant research that has been undertaken in this area. First, ‘cohesion’ and
‘coherence’ will be defined. Then, a closer look will be taken at research into each
of these two areas. N, measures that have been used to operationalise these con-
cepts will be presented. An abundance of measures have been developed to meas-
ure the two concepts; however only a handful are useful in the context of the pro-

90
posed study. Therefore, only measures that can be operationalised for a rating
scale will be reviewed.

4.5.3.1 Coherence

According to Grabe and Kaplan (1996), there is little consensus on an overall


definition of coherence as it is considered an abstract and fuzzy concept (Connor,
1990; Roberts & Kreuz, 1993). Lee (2002b), for example, defined it as ‘the rela-
tionships that link the ideas in a text to create meaning’ and Chiang (1999) wrote
that coherence ‘pertains to discourse-level features or macro-propositions that are
responsible for giving prose its overall organisation’. A number of definitions also
link coherence to the reader, arguing that coherence might not entirely reside in
the text. Yule (1996, cited in Watson-Todd et al., 2004), for example, writes that
coherence represents the ‘less tangible ways of connecting discourse which are
not overt and which reside in how people interpret texts rather than in the texts
themselves’.

Overall, it can be said that coherence resides at textual level (not sentence level),
where it creates links between ideas to create meaning, show organisation and
make a text summarizable. It is further concerned with how people interpret the
text. Coherence is created not only by the writer’s purpose but also the readers’
(possibly even whole discourse communities’) expectation. Lautamatti (1990) dis-
tinguishes two types of coherence: interactional and propositional. The former is
created when succeeding speech acts in discourse are linked. This is the case in
spoken informal language. The latter occurs through links created by the idea-
tional content of the discourse and is evident in more formal settings and written
language. This chapter will discuss only propositional coherence that can be found
in writing.

Taking the above definitions of coherence into account, it is not surprising that the
concept of coherence has been one of the most criticized in existing rating scales.
Descriptors are usually vague, as has been shown by Watson-Todd et al. (2004),
who provide examples of typical descriptors. For example, good writing should be
‘well organised’, and ‘cohesive’, should have a ‘clear progression of well-linked
ideas’. Poor quality writing, on the other hand, is often described as so ‘fragmen-
tary that comprehension of the intended communication is virtually impossible’.
These descriptors often require subjective interpretation and might lead to confu-
sion among the raters. Hoey (1991) argues that, because coherence resides outside
the text, judgments will inevitably have to be subjective and vary from reader to
reader. Chiang (1999; 2003), however, was able to show that raters, contrary to
what has been shown by many studies, put more emphasis on coherence and co-
hesion in writing than on grammar, if they have clear descriptors to focus on. The

91
following section on coherence therefore aims to illustrate the debate in the litera-
ture on coherence and describe the measures which have been proposed to meas-
ure coherence objectively.

Research investigating coherence dates back as far as the 19th century. Then, how-
ever, coherence was predominantly defined in terms of sentence connections and
paragraph structure (Lee, 2002a). Only since the emergence of discourse analysis
in the 1960s has more emphasis been placed on constituents larger than the sen-
tence. Researchers began investigating what principles tie a text together and in
what contexts texts occur. Coherence, according to Grabe and Kaplan (1996),
should derive its meaning from what a text is and how a text is constructed. This
can be considered either as internal to the text or internal to the reader. If defined
as internal to the text, coherence can be explained as the formal properties of a
text. In this context Halliday and Hasan (1976) developed their theory of cohe-
sion, which will be discussed in more detail in the section on cohesion below.
Other researchers investigated information distribution in texts, introducing the
concepts of given and new information (Vande Kopple, 1983, 1986), also referred
to as topic and comment (Connor & Farmer, 1990) or theme and rheme (Halliday,
1985, 1994). From these, Lautamatti (1987) and later Scheider and Connor (1990)
developed topical structure analysis as a tool for analyzing coherence. They were
able to identify different structural patterns in texts and were able to teach this
method to ESL students to successfully investigate the coherence of their texts
(Connor & Farmer, 1990). This method will be described in more detail later in
this chapter.

Kintsch and Dijk (1978) described coherence in terms of propositions and their
ordering in text. Thus coherence has been described in terms of cohesion and the
ordering of information structure to form the macrostructure of texts. Hoey (1991)
looked at lexical patterns in a text, whilst other linguists have looked at metadis-
coursal features of a text, for example, logical connectors, sequencers and hedges,
and how they contribute to the overall coherence of texts (Cheng & Steffensen,
1996; Crismore, Markkanen, & Steffensen, 1993). There is therefore, from a lin-
guistic perspective, plenty of evidence that coherence can be found, at least partly,
within texts.

Other research, however, has defined coherence as internal to the reader. This
view has its basis in modern reading theories, which have shown that text process-
ing is an interaction between the reader and the text and that readers use their
world knowledge and knowledge of text structures to make sense of a text
(Carrell, 1988). Readers can anticipate upcoming textual information, which helps
to organise the text into understandable information (Bamberg, 1983). The reader
can therefore be regarded as an important contributor to coherence.

92
Although it is quite clear from these two strands of research that coherence resides
both in the text and is created through an interaction between the reader and the
text, for the purpose of this research, only coherence internal to the text is consid-
ered. Although probably not a complete picture of coherence, coherence internal
to the text can be more easily operationalised for the purpose of rating scale de-
scriptors and can be defined in more detail. Aspects of writing that are created by
an interaction between the reader and the text are investigated in a later section
called ‘reader/writer interaction.

4.5.3.1.1 Measuring coherence

Several different ways of measuring coherence have been proposed in the litera-
ture. This section will describe three measures: metadiscourse markers, topical
structure analysis and topic-based analysis.

Crismore, Markkannen and Steffensen (1993) and Intaraprawat and Steffensen


(1995) proposed the use of metadiscourse markers to analyze coherence (based
on previous work by Vande Kopple, 1985 and Lautamatti, 1978). Metadiscourse
is defined as ‘the writers’ discourse about discourse, their directions of how read-
ers should read, react to, and evaluate what they have written about the subject
matter’ (Crismore et al., 1993). These authors argue that both professional and
non-professional writers project themselves into texts to guide and direct readers
so that readers can better understand the content and the writer’s attitude toward
the content and the reader. This metadiscourse does not add any propositional
content, but is intended to help the reader organise, interpret and evaluate the in-
formation supplied. Crismore et al. revised a classification scheme of metadis-
course initially proposed by Vande Kopple (1985), keeping the latter’s two over-
arching categories of ‘textual metadiscourse’ and ‘interpersonal discourse’. The
categories of textual metadiscourse proposed by Crismore et al. (1993) are as fol-
lows7:

Textual Metadiscourse (used for logical and ethical appeals)


1. Textual markers
a. Logical connectives
b. Sequencers
c. Reminders
d. Topicalizers
2. Interpretive Markers
a. Code glosses
b. Illocution markers

93
Logical connectives include coordinating conjunctions (e.g. and, but) and con-
junctive adverbs (e.g. therefore, in addition). Sequencers include numbers as well
as counting and numbering words like ‘first’, ‘second’ and so on. Reminders are
expressions that refer to earlier text, like, for instance, ‘as I noted earlier’. Topi-
calizers are words or phrases that indicate a topic shift. These can include ‘well’,
‘now’, ‘in regard to’ or ‘speaking of’.

Interpretive markers include code glosses and illocution markers. Code glosses
are explanations of text introduced by expressions such as ‘namely’, ‘for example’
or ‘what I mean is’. These expressions provide more information for words or
propositions which the writer anticipates will be difficult for the reader. Illocution
markers name the act that the writer is performing. These might include expres-
sions like ‘I state again that…’, ‘to sum up’, ‘to conclude’, ‘to give an example’
or ‘I plead with you’.

Intaraprawat and Steffensen (1995) used the categories described above to inves-
tigate the difference between good and poor ESL essays. They found that good
essays displayed twice as many metadiscoursal features as poor essays. They also
found a higher density of metadiscourse features in the good essays (calculated as
features per average number of t-units). Good writers used more than twice the
number of code glosses and three times as many illocutionary markers. They
found very little difference in connectives between the two groups and explained
this by suggesting that these are explicitly taught in many writing courses. The
good essays had a higher percentage of interpersonal features while the poor had a
higher percentage of textual features.
Topical structure analysis (TSA) was first developed by Lautamatti (1987) in the
context of text readability to analyse topic development in reading material. She
defined the topic of a sentence as ‘what the sentence is about’ and the comment of
a sentence as ‘what is said about the topic’. Lautamatti described three types of
progression which advance the discourse topic by developing a sequence of sen-
tence topics. Through this sequence of sentence topics, local coherence is created.
The three types of progression can be summarized as follows (Hoenisch, 1996):

Parallel progression, in which the topics of successive sen-tences are the same,
producing a repetition of topic that reinforces the idea for the reader (<a, b>, <a,
c>, <a, d>);

Sequential progression, in which the topics of successive sentences are always


different, as the comment of one sentence becomes, or is used to create, the topic
of the next (<a, b>, <b, c>, <c, d>); and

94
Extended parallel progression, in which the first and the last topics of a piece of
text are the same but are interrupted with some sequential progression (<a, b>, <b,
c>, <a, d>).

Witte (1983a; 1983b) made use of TSA in writing research. He compared two
groups of persuasive writing scripts, one rated high and one rated low, in terms of
the use of the three types of progression described above. He found that the higher
level writers used less sequential progression and more extended and parallel pro-
gression. There are however several shortcomings in Witte’s study. Firstly, the
raters were not professional raters, but were rather recruited from a variety of pro-
fessions. Secondly, Witte did not use a standardized scoring scheme. He also con-
ducted the study in a controlled revision situation. The students revised a text
written by another person. Furthermore, Witte did not report any intercoder reli-
ability analysis.

In 1990, Schneider and Connor set out to compare the use of topical structure by
45 writers taking the TWE (Test of Written English). They grouped the 45 argu-
mentative essays into three different levels (high, medium, low). As with Witte’s
study, Schneider and Connor did not report any intercoder reliability statistics.
The findings were contradictory to Witte’s findings. The higher level writers used
more sequential progression while the low and middle group used more parallel
progression. There was no difference between the levels in the use of extended
parallel progression. Connor and Schneider drew up clear guidelines on how to
code TSA and also suggested a re-interpretation of sequential progression in their
discussion section. They suggested dividing sequential progression into the fol-
lowing subcategories:
Direct sequential progression, in which the comment of the previous sentence be-
comes the topic of the following sentence. The topic and comment are either word
derivations (e.g. science, scientist) or they form a part-whole relation (these
groups, housewives, children) (<a,b>, <b,c>, <c,d>).

Indirect sequential progression, in which the comment of the previous sentence


becomes the topic of the following sentence but topic and comment are related
only by semantic sets (e.g. scientists, their inventions and dis-coveries, the inven-
tion of the radio, telephone and te-levision).

Unrelated sequential progression, in which topics are not clearly related to either
the previous sentence topic or discourse topic (<a,b>, <c,d>, <e,f>).

Wu (1997) in his doctoral dissertation, applied Schneider and Connor’s revised


categories to analyse two groups of scripts rated using the Jacob’s et al. scale
(1981). He found in his analysis no statistically significant difference in terms of

95
the use of parallel progression between high and low level writers. Higher level
writers used slightly more extended parallel progression and more related sequen-
tial progression.

A more recent study using TSA to compare groups of writing based on holistic
ratings was undertaken by Burneikaité and Zabiliúté (2003). Using the original
criteria of topical structure developed by Lautamatti and Witte, they investigated
the use of topical structure in argumentative essays by three groups of students
rated as high, middle and low, based on a rating scale adapted from Tribble
(1996). They found that the lower level writers overused parallel progression
whilst the higher level writers used a balance between parallel and extended paral-
lel progression. The differences in terms of sequential progression were small,
although they did show that lower level writers used this type of progression
slightly less regularly. Burneikaité and Zabiliúté failed to report any inter-rater
reliability statistics.

All studies conducted since Witte’s study in 1983 have produced very similar
findings, but with some differences. Two out of three studies found that lower
level writers used more parallel progression than higher level writers. However,
Wu (1997) found no significant difference. All three studies found that higher
level writers used more extended parallel progression. In terms of sequential pro-
gression the differences in findings can be explained by the different ways this
category was applied. Schneider and Connor (1990) and Burneikaité and Zabu-
liúté (2003) used the definition of sequential progression with no subcategories.
Both studies found that higher level writers use more sequential progression. Wu
found no differences between different levels of writing using this same category.
However, he was able to show that higher level writers used more related sequen-
tial progression. It is also not entirely clear how much task type or topic familiar-
ity influences the use of topical structure and if findings can be transferred from
one writing situation to another.

As an extension of topical structure analysis, Watson-Todd (1998; Watson Todd


et al., 2004) developed topic-based analysis. He wanted to develop an objective
method of analyzing coherence. In topic-based analysis, key concepts are identi-
fied through frequency. Logical relationships between these concepts are identi-
fied, and from these line diagrams representing the schemata of the discourse are
drawn up and coherence is measured. Although this method has been shown to be
promising for differentiating between raters’ judgments of coherence, it will not
be further pursued, as it is too complicated for a rater to undertake during the rat-
ing process.

96
In conclusion, it can be said that coherence remains a fuzzy concept, and that it
will be hard to define the concept operationally. For the purpose of this study,
topical structure analysis and metadiscoursal markers seem the most promising.

4.5.3.2 Cohesion

Cohesion has been defined by Fitzgerald and Spiegel (1986) as ‘the linguistic fea-
tures which help to make a sequence of sentences in a text’ (i.e. give it texture).
Reid (1992) defined it as ‘explicit linguistic devices used to convey information,
specifically the discrete lexical cues used to signal relations between parts of dis-
course’. To her, cohesion devices are therefore words and phrases that act as sig-
nals to the reader; these words relate what is being stated to what has been stated
and to what will soon be stated. She goes on to argue that cohesion is a subcate-
gory or sub-element of coherence.

Analysis of cohesion has received much attention among applied linguists and
writing researchers. The term ‘cohesion’ was popularized by Halliday and Hasan
(1976) who developed a model for analysing texts. They showed that cohesive
ties involve a relation between two items within a text. One item cannot be effec-
tively decoded without reference to the other. Cohesive ties are ties that operate
intersententially (between sentences). For the purpose of this study, cohesive ties
were operationalized as operating between t-units. They can also, however, as was
pointed out by Halliday and Hasan (1976) operate between clauses.

Halliday and Hasan show that cohesion is not always necessary in achieving
communication, but helps guide the reader’s or listener’s understanding of text
units. Their model has been criticized by various authors, but nevertheless has
been a major influence in language teaching.

Halliday and Hasan (1976) identify the following two broad types of cohesion in
English:

- grammatical cohesion: the surface marking of semantic links between


clauses and sentences in written discourse and between utterances and
turns in speech
- lexical cohesion: related vocabulary items which occur across clause and
sentence boundaries in written texts and are major characteristics of coher-
ent discourse.

The first item of grammatical cohesion described by Halliday and Hasan (1976) is
the term reference. Reference refers to items of language that instead of being in-
terpreted semantically in their own right, make reference to other items for which

97
the context is clear to both sender and receiver. The retrieval of these items can be
either exo-phoric or endophoric (within the text). Exophoric reference looks out-
side the text to the immediate situation or refers to cultural or general knowledge
(homophoric). Endophoric reference can be either anaphoric (referring to a word
or phrase used earlier in a text) or cataphoric (referring to a word or phrase used
later in the text). There are three types of reference: personal, demonstrative and
comparative. These words indicate to the listener/reader that information is to be
retrieved from elsewhere.

A second major type of grammatical cohesive tie is that of substitution. It is an-


other formal link between sentences where items like do or so replace a word or
group of words which appeared in an earlier sentence. Substitution, as Halliday
and Hasan (1976) point out, is a relation on the lexico-grammatical level. There-
fore the substitute item has the same structural function as that for which it substi-
tutes. Substitution can be either nominal, verbal or clausal. Substitution is more
frequent in spoken texts.

A third kind of grammatical cohesive device, ellipsis, is the omission of an ele-


ment normally required by the grammar which the reader/listener can recover
from the linguistic context and which therefore need not be repeated. Halliday and
Hassan (1976) called ellipsis: ‘substitution by zero’. Like substitution, ellipsis sets
up a relationship that is lexico-grammatical. Ellipsis can also be divided into the
three categories: nominal, verbal or clausal. Like sub-stitution, ellipsis is more
frequent in spoken texts and is normally an anaphoric relation.

The final grammatical cohesive relation is that of conjunction. Conjunctions are


words (or phrases) that join different sections of texts in ways that express their
logical-semantic relationship. Conjunctions contribute to cohesion but unlike ref-
erence, substitution and ellipsis, are not a search instruction but signal a relation-
ship between segments of discourse. There are many words and phrases which
can be put into this category and many different ways in which they can be classi-
fied. Halliday and Hasan (1976) identified four broad types of conjunctions:

- additive
- adversative
- causal
- temporal

The second major group of cohesive relations is lexical cohesion. The cohesive
effect is achieved by the selection of certain vocabulary items that occur in the
context of related lexical items. Halliday and Hasan (1976) identify two principal
kinds and their subcategories:

98
- reiteration:
- repetition
- synonym, near-synonym
- antonym
- superordinate relations: -hyponym
- meronym
- general
Nouns
- collocations

Several authors have debated whether collocations properly belong to the notion
of lexical cohesion, since collocation only refers to the probability that lexical
items will co-occur and there is no semantic relation between them.
Halliday and Hasan (1976) acknowledged some of the problems with their model
when they suggested that the boundaries between lexical and grammatical cohe-
sion are not always clear. They further observed that the closer the ties the greater
the cohesive strength, and that a higher density of cohesive ties increases the co-
hesive strength.

4.5.3.2.1 Measuring cohesion

Halliday and Hasan’s (1976) categories of cohesion have been applied in a num-
ber of research projects with varying results. Witte and Faigley (1981) in the con-
text of L1 English, for example, compared the cohesion of high and low level es-
says. They found a higher density of cohesive ties in high level essays. Almost a
third of all words in the better essays contributed to cohesion and the cohesive ties
spanned shorter distances than in lower-level essays. They also found that the ma-
jority of lexical ties in low-level essays involved repetition, whilst high-level es-
says relied more on lexical collocation. In contrast, Neuner (1987) found that none
of the ties were used more in good essays than in poor freshman essays. He did,
however, find a difference between cohesive chains (three or more cohesive ties
that refer to each other), in the cohesive distance and in the variety of word types
and the maturity of word choice. For example, in good essays, cohesive chains are
sustained over greater distances and involve greater proportions of the whole text.
Good writers also used more different words in their cohesive chains as well as
less frequent words than the poor writers. A very similar result was found by
Crowhurst (1987), who compared cohesion at different grade levels in two differ-
ent genres (arguments and narratives). He also found that the overall frequency of
cohesive ties did not increase with grade level, but that synonyms and collocations
(a sign of more mature vocabulary) did.

99
Jafapur (1991) applied Halliday and Hasan’s categories to ESL writing. He found
that in the essays the number of cohesive ties and the number of different types of
cohesion successfully discriminated between different proficiency levels. Reid
(1992), investigating ESL and NS writing, focussed on the percentages of coordi-
nate conjunctions, subordinate conjunctions, prepositions and pronouns and found
that ESL writers used more pronouns and coordinating conjunctions than NS, but
fewer prepositions and subordinating conjunctions. Two other studies also com-
pared native and non-native speaking writers in terms of their use of connectors.
Field and Yip (1992) were able to show that Cantonese writers significantly over-
use such devices. However, Granger and Tyson (1996) in a large scale investiga-
tion of the International Corpus of Learner English were not able to confirm these
findings. They emphasised that a qualitative analysis of the connectors is impor-
tant. They documented the underuse of some connectors and overuse of others.

Two recent studies compared the performances of test takers over different profi-
ciency levels. Firstly, Kennedy and Thorp (2002), in the context of IELTS, were
able to show that writers at levels 4 and 6 used markers like ‘however, firstly,
secondly’ and subordinators more than writers at level 8. They concluded that
writers at level 8 seemed to have other means at their disposal to mark these con-
nections, whilst lower level writers needed to rely on these overt lexico-
grammatical markers to structure their argument. Even more recently and also in
the context of IELTS, Banerjee and Franceschina (2006) looked at the use of de-
monstrative reference over five different IELTS levels. They found that the use of
‘this’ and ‘these’ increased with proficiency level whilst the use of ‘that’ and
‘those’ stayed relatively level or decreased.
Several authors have specifically investigated lexical cohesion (Hoey, 1991; Liu,
2000; Reynolds, 1995), arguing that this is the most common and important type
of cohesion. Hoey, for example, investigated the types of lexical repetition and
classified them into simple and complex lexical repetition and paraphrase. He
showed how lexical repetition can be mapped into a matrix, revealing the links
throughout the whole text. This method of analysis, although very promising, will
not be further pursued, as it is too complex to be performed by raters during a rat-
ing session. Both Hassan’s (1984) and Hoey’s (1991) models were developed for
the first language writing context and rely on the concept that quantity is signifi-
cant. However, Reynolds (1995) questions whether quantity makes a text more
cohesive. It is also not clear if this can be transferred to the L2 writing context.

Based on these findings, the number of anaphoric pronominals, the number of


linking devices and the number of lexical chains will be further pursued in the pi-
lot study.

100
4.5.4 Reader/writer interaction

Reader/writer interaction expands the focus of study beyond the ideational dimen-
sions of texts to the ways in which texts function at the interpersonal level.
Hyland (2000b) argues that writers do more than produce texts in which they pre-
sent an external reality; they also negotiate the status of their claims, present their
work so that readers are most likely to find it persuasive, and balance fact with
evaluation and certainty with caution. Writers have to take a position with respect
to their statements and to their audiences, and a variety of features have been ex-
amined to see how they contribute to this negotiation of a successful reader-writer
relationship.

In this section Crismore et al.’s (1993) interpersonal metadiscoursal markers are


described in detail8. These are divided into the following categories:

Interpersonal Metadiscourse (used for emotional and ethical appeals)


1. Hedges (epistemic certainty markers)
2. Certainty markers (epistemic emphatics or boosters)
3. Attributors
4. Attitude markers
5. Commentaries

Hedges, have been defined as ‘ways in which authors tone down uncertain or po-
tentially risky claims’ (Hyland, 2000a), as ‘conventions of inexplicitness’ and as
‘a guarded stance’ (P. Shaw & Liu, 1998), as structures that ‘signal a tentative as-
sessment of referential information and convey collegial respect for the views of
colleagues’ (Hyland, 2000a) or as ‘the absence of categorical commitment, the
expression of uncertainty, typically realized by lexical devices such as might’
(Hyland, 2000b). Examples of hedges are epistemic modals like might, may,
could, and other structures such as I think, I feel, I suppose, perhaps, maybe, it is
possible. Hyland (1996a; 1996b; 1998) differentiates between two functions of
hedging: content-oriented and reader-oriented. Content-oriented hedges mitigate
between the propositional content of a piece of writing and the discourse commu-
nity’s conception of what the truth is like. Content-oriented hedges can in turn be
divided into accuracy-oriented hedges and writer-oriented hedges. The writer
needs to express propositions as accurately as possible. This is made possible by
accuracy-oriented hedges which allow the writer to express claims with greater
precision, acknowledge uncertainty and signal that a statement is based on the
writer’s plausible reasoning rather than assured knowledge. The writer, however,
also needs to acknowledge contrary views from readers. Writer-oriented hedges
permit the writer to speculate. The second major category of hedges are the
reader-oriented hedges. Through these, the writer develops a writer-reader rela-

101
tionship. These structures help to tone down statements in order to gain the
reader’s ratification of claims. Hyland (2000b) suggests that hedges are highly
frequent in academic writing and are more frequent than one in every 50 words.

A number of researchers have looked at hedging in L2 learners’ writing. Bloor


and Bloor (1991), for example, found that direct and unqualified writing rather
than the use of hedging devices, was more typical of EFL writers. Similarly, Hu,
Brown and Brown (1982) found that Chinese L2 writers are more direct and au-
thoritative in tone and make more use of stronger modals than native speakers.
Hyland and Milton (1997) investigated how both L1 and L2 students express
doubt and certainty in writing. They found that the two groups of writers used a
similar number of modifiers - one device in every 55 words - but native speakers
used two-thirds of the devices to weaken claims whilst non-native speakers used
over half of the modifiers in their writing to strengthen claims. In a more recent
study, Kennedy and Thorp (2002) were able to show that writers at levels 4 and 6
in the IELTS writing section used fewer hedging devices than writers at level 8.
Boosters (or certainty markers) have been defined as expressions ‘that allow
writers to express conviction and to mark involvement and solidarity with an au-
dience’ (Hyland, 1998) or as ‘the ways in which writers modify the assertions
they make, emphasizing what they believe to be correct’ (Hyland, 2000a). Boost-
ers include expressions like clearly show, definite, certain, it is a fact that or obvi-
ously. As has already been described above in the context of hedges, a number of
studies found that L2 writers overuse boosters in their writing and are therefore
found to make unjustifiably strong assertions (Allison, 1995; Bloor & Bloor,
1991; Hyland & Milton, 1997; Kennedy & Thorp, 2002).

The third structure on Crismore’s list of interpersonal discourse markers, attribu-


tors, increase the force of an argument and can take the form of a narrator as in
‘John claims that the earth is flat’ or as an attributor as in ‘Einstein claimed that
our universe is expanding’. In Vande Kopple’s (1985) categorization, these were
separate categories, but Crismore et al. (1993) found in their analysis that these
two features performed a very similar function and therefore grouped them to-
gether.

Attitude markers express the writer’s affective values and emphasize the proposi-
tional content, but do not show commitment to it. These include words and
phrases like ‘unfortunately’ or ‘most importantly’. They can perform the functions
of expressing surprise, concession, agreement, disagreement and so on.

Finally, the category of commentaries establishes a reader-writer relationship by


bringing the reader into the discourse through expressions like ‘you may not agree
that’, ‘my friend’, ‘think about it’.

102
Intaraprawat and Steffenson (1995) used all the categories described above to in-
vestigate differences between good and poor ESL essays. They found that good
students used twice as many hedges, attitude markers and attributors, more than
double the number of emphatics (boosters) and three times as many commentar-
ies.

Apart from hedges, boosters, attributors, attitude markers and commentaries, writ-
ers can also express reader-writer interaction by showing writer identity in their
writing. As Hyland (2002a) suggests, academic writing is not just about convey-
ing an ideational ‘content’, it is also about the representation of self. Ivanic (1998;
Ivanic & Weldon, 1999) identifies three aspects of identity interacting in writing.
Firstly, there is the autobiographical self, which is influenced by the writer’s life
history. Then there is the discoursal self, which represents the image or ‘voice’ the
writer projects in a text. Finally, there is the authorial self, which is the extent to
which a writer intrudes into a text and claims responsibility for its content. This is
achieved through ‘stance’. For the purpose of this study only the third type of
identity will be discussed here. Academic writing is a site in which social posi-
tioning is constructed. The academy’s emphasis on analysis and interpretation
means that students must position themselves in relation to the material they dis-
cuss, finding a way to express their own arguments (Hyland, 2002a). Writers are
therefore required to establish a stance towards their propositions and to get be-
hind their words. The problem with identity, however, is that it is uncertain. On
the one hand, an impersonal style is seen as a key feature of academic writing, as
it symbolizes the idea that academic research is objective and empirical. However,
textbooks encourage writers to make their own voice clear through the first per-
son. This constitutes a problem for L2 writers. Hyland (2002b) argues that L2
writers are often told not to use ‘I’ or ‘in my opinion’ in their academic writing. In
his investigation on the use of the first person in L1 expert and L2 writing, he
found that professional writers are four times more likely to use the first person
than L2 student writers (Hyland, 2002a).

Hyland (2002b) argues that this underuse of first person pronouns in L2 writing
inevitably results in a loss of voice. Contrary to Hyland’s (2002a; 2002b) findings,
Shaw and Liu (1998) showed that as L2 students’ writing develops, they move
away from using personal pronouns in their writing and make more use of passive
verbs. They therefore argue that more developed writing has less authorial refer-
ence.

If writers choose not to display writer identity, but rather want to keep a piece of
writing more impersonal, they could do this by increased use of the passive voice.
This was investigated by Banerjee and Franceschina (2006), who found that the

103
higher the IELTS score awarded to a writing script, the more passives the writer
had used.

Summing up, there are various devices available to writers to establish a success-
ful writer-reader relationship. Among these are hedges, boosters, attributors and
attitude markers, as well as markers of identity and the use of the passive voice,
all of which will be further pursued in the pilot study.

4.5.5 Content

Few researchers have investigated objective measures of content. Usually, either


holistic or multi-trait rating scales have been employed for this purpose. Among
those that have tried to find objective measures is Kepner (1991), who counted the
number of higher level propositions that included ‘propositions or propositional
clusters within the student text which exemplified the cognitive processes of
analysis, comparison/contrast, inference/interpretation and/or evaluation’ (p.308).
However, Kepner failed to make his counts a function of the number of words, so
that his measure may simply reflect the length of writing. Also, this measure does
not discriminate between relevant and irrelevant propositions and between propo-
sitions of varying importance to the writing. Similarly, Friedlander (1990) counted
the number of details. He did not operationalize this feature, nor did he, like Kep-
ner, make it a function of the number of words. Polio (2001) suggests counting
idea units (based on the work of Kintsch & Keenan, 1973) to quantify the density
of content.

Kennedy and Thorp (2002) recorded the main topics for IELTS essays produced
at three proficiency levels. However, their analysis was inconsistent in that they
did not follow the same procedures for essays at levels 4, 6 and 8. Therefore, the
results are difficult to compare. No other research was located that compared can-
didates’ performance on content over different proficiency levels without using a
rating scale. Because of the lack of discourse analytic measures of content in the
literature, a measure specific to the current study will be designed.

4.6 Conclusion

Overall, this chapter has shown that, although no adequate model or theory of
writing or writing proficiency is currently available, a taxonomy based on current
models of language development can guide the rating scale design process and
provide an underlying theoretical basis.

104
Table 15: List of measures to be trialed during pilot study
Construct Measures
Accuracy Number of error-free t-units
Number of error-free clauses
Error-free t-unit ratio
Error-free clause ratio
Errors per t-unit
Fluency Number of words
Number of self-corrections
Complexity Clauses per t-unit
Dependent clauses per t-unit
Dependent clauses per clause
Average word length
Lexical sophistication
Mechanics Number of punctuation errors
Number of spelling errors
Number of capitalization errors
Paragraphing
Cohesion Number of anaphoric pronominals
Number of linking devices
Number of lexical chains
Coherence Categories of topical structure analysis
Metadiscoursal markers
Reader/writer interaction Number of hedges
Number of boosters
Number of attributors
Number of attitude markers
Number of markers of writer identity
Number of instances of passive voice
Content Measure specific to this research

I have shown that the constructs identified as important aspects of academic writ-
ing have been operationalized to varying degrees and with varying success. Table
15 above shows the eight constructs from the taxonomy in the left hand column,
whilst the column on the right presents the different discourse analytic measures
that were chosen as operationalisations of these constructs. Each discourse ana-
lytic measure will be trialed during the pilot study phase, which is described in the
following chapter.

105
---
Notes:
1
For a detailed description of DELNA (Diagnostic English Language Needs Assessment), refer to
the methodology section.
2
For a detailed description of DELNA (Diagnostic English Language Needs Assessment), refer to
the methodology section.
3
For a detailed description of the three DELNA writing tasks, refer to the methodology section.
4
The data set used was based on oral performance. It is not clear if the same results would be ob-
tained for written performance.
5
A t-unit contains one independent clause plus any number of other clauses (including adverbial,
adjective, and nominal). The t-unit was first developed by Hunt (1965)
6
Recent developments in type/token ratio take length into account (Jarvis, 2002). These complex
formulae are however not suitable for the context of this study. More simple measures must there-
fore be calculated on the basis of equal length word segments.
7
Interpersonal metadiscourse is described in the section on reader/writer interaction.
8
Textual metadiscourse markers were discussed in the section on coherence

106
Chapter 5: METHODOLOGY – ANALYSIS OF WRITING
SCRIPTS

5.1 Design
The study reported here was implemented in two phases. At the beginning of
Phase 1, a pilot study was undertaken to select the most suitable discourse analytic
measures from those identified in the literature review. The main aim of the pilot
study was to identify discourse analytic measures which are successful in differ-
entiating between different levels of writing performance. Then, during the main
analysis, a large number of writing scripts were analysed using those discourse
analytic measures. Those measures successful in discriminating between scripts at
different proficiency levels during the main analysis were then used as the basis
for the descriptors during the development of the rating scale. The final part of
this first phase was the design of a new rating scale based on the findings of the
main analysis. The hypothesis is that this newly developed rating scale would be
more suitable for diagnostic purposes because it is theoretically-based (i.e. based
on the taxonomy described in Chapter 4), empirically-developed and therefore has
level descriptors which are more specific (rather than global) and avoid vague,
impressionistic terminology.

The second phase of the study involved the validation of the new rating scale for
diagnostic writing assessment. For this purpose, ten raters rated one hundred writ-
ing samples, first using the existing DELNA (Diagnostic English Language Needs
Assessment) rating scale and then the same ten raters rated the same one hundred
scripts using the new rating scale. The rating results from these two scales were
then compared. To elicit the raters’ opinions about the efficacy of the two scales, a
questionnaire was administered and a subset of the raters was interviewed.

The two phases of this research study were characterized by two different types of
research design. The first phase, the analysis of the writing scripts, followed what
Seliger and Shohamy (1989) termed ‘descriptive research’ because it is used to
establish phenomena by explicitly describing them. Descriptive research provides
measures of frequency for different features of interest. It is important to empha-
size that descriptive research does not manipulate contexts (by for example estab-
lishing groups of participants, as is often found in experimental studies). The
groups used in the analysis were pre-existing. In this study, the groups were de-
termined according to a proficiency score based on the performance of each can-
didate. The data analysis was quantitative.

The second phase employed two rather different research design features. The
first part of Phase 2, the ratings based on the two rating scales, can also be de-
scribed as a descriptive study because the ratings of the ten raters were compared

107
under two conditions. It is best viewed as a descriptive study comparing the scores
obtained for two groups (Seliger & Shohamy, 1989). It should be noted that the
candidates were not randomly selected and the two types of treatment (the two
rating scales) were not administered in a counterbalanced design. If the study had
displayed these two features, it could have been considered a quasi-experimental
study (Mackey & Gass, 2005; Nunan, 1992). The data analysis was quantitative
and employed statistical procedures. The second part of Phase 2, the administra-
tion of questionnaires and interviews, involved qualitative data analyzed qualita-
tively. Therefore, it can be argued that the study overall followed a mixture of
qualitative, quantitative and descriptive designs.

Figure 11 below is a visual representation of the outline of the study.

Figure 11: Outline of study

For reasons of readability, the method, results and discussion sections of the two
phases are kept separate. The method, results and discussion of Phase 2 can be
found in later in this book.

The current chapter presents the research questions for both phases and a general
introduction to the context in which the whole study was conducted.

108
5.2 Research Questions

The overarching research question for the whole project was the following:

To what extent is a theoretically-based and empirically developed rating scale of


academic writing more valid for diagnostic writing assessment than an existing,
intuitively developed rating scale?

For reasons of practicality, the main research question was further divided into
three subsidiary questions, one guiding the analysis of Phase 1, and the other two
relevant to Phase 2.

Phase 1:
1. Which discourse analytic measures are successful in distinguishing between
writing samples at different DELNA writing levels?

Phase 2:
2a. Research Question 1: Do the ratings produced using the two rating scales dif-
fer in terms of (a) the discrimination between candidates, (b) rater spread and
agreement, (c) variability in the ratings, (d) rating scale properties (e) and what
the different traits measure?
2b. What are raters’ perceptions of the two different rating scales for writing?

5.3 Context of the study

5.3.1 The assessment instrument

The Diagnostic English Language Needs Assessment (DELNA) was established


in 2001 to identify the academic English language needs of both ESB (English
speaking background) and EAL (English as an additional language) students fol-
lowing admission to the university, so that those found to be at risk could be of-
fered suitable English language support. The results of the test give an indication
of the students’ English language skills and those found at risk are directed to-
wards various language support options (Elder, 2003; Elder & Erlam, 2001; Elder
& von Randow, 2002).

Although being optional for most years since 2001, since 2007 the assessment is a
requirement for all first year undergraduate students.

DELNA consists of two parts: screening and diagnosis. The screening section in-
cludes two components: vocabulary and speed reading. It is conducted online and
takes 30 minutes. The diagnosis section, which takes two hours and is conducted

109
by the pen and paper method, comprises sub-tests of reading and listening (devel-
oped and validated at the University of Melbourne) and an expository writing
task, which requires students to describe a graph or information in a table and then
interpret the data. The writing component, which is the focus of this study, was
developed in-house and is scored analytically on nine 6-point scales ranging from
4 (“at high risk of failure due to limited academic English”) to 9 (“highly compe-
tent academic writer”), with accompanying level descriptors describing the nature
of the writer’s performance against each of the analytic criteria. The complete
DELNA rating scale was presented in Table 11 in the previous chapter.

Based on their DELNA results, students are advised to attend suitable courses.
EAL students might be advised to seek help in the English Language Self-access
Centre (ELSAC) and ESB students might be advised to seek writing help in the
Student Learning Centre (SLC) which provides similar assistance to writing labs
found at other universities.

While DELNA can be considered a low-stakes test in the sense that it is used for
diagnosis rather than selection purposes, its utility is dependent on the accuracy of
test scores in diagnosing students’ language needs. The writing task is therefore
assessed twice by separate raters and concerted training efforts have been made to
enhance the reliability of scoring (Elder, Barkhuizen, Knoch, & von Randow,
2007; Elder et al., 2005; Knoch, Read, & von Randow, 2007).

5.3.2 The raters

The DELNA raters are all experienced teachers of English and/or English as a
second language. All raters have high levels of English language proficiency al-
though not all are native-speakers (NS) of English. Some raters are certified
IELTS (International English Language Testing System) examiners whereas oth-
ers have gained experience of writing assessment in other contexts. All raters take
part in regular training sessions which are conducted throughout the year both
online and face-to-face (Elder et al., 2007; Elder et al., 2005; Knoch et al., 2007).

5.3.3 The tasks

At the time of the study, five DELNA writing prompts were in use, all of which
follow a similar three-part structure. Students are required to first describe a graph
or table of information presented to them. This graph or table consists of some
simple statistics requiring no specialist knowledge. Students are then asked to in-
terpret this information, suggesting reasons for any trends observed. In the final
part, students are required to either compare this information with the situation in

110
their own country, or suggest ideas on how this situation could be changed or dis-
cuss how it will impact on the country.

The writing task has a set time limit of 30 minutes. Students can, however, hand
in their writing earlier if they have finished.

A multi-faceted Rasch analysis (Rasch, 1980) using the computer program


FACETS (Linacre, 1988, 2006) was conducted to establish the difficulty of the
prompts. One prompt was found to be marginally more difficult than the others
and was therefore excluded from any further analysis.

5.3.4 The students

The students taking DELNA are generally undergraduate students, although some
are postgraduates. More detailed background information on the students whose
writing samples were investigated in this study will be provided in the methodol-
ogy section of the main analysis of Phase 1 later in this chapter.

5.4 Phase 1: Analysis of writing scripts

5.4.1 Introduction:

The method section below describes the analysis of the writing scripts in more
detail. Because a number of suitable measures were identified in the review of the
literature, a pilot study was first undertaken to finalise the discourse analytic
measures to be used in the main analysis of Phase 1. Two criteria were stipulated
to ensure measures were suitable. Firstly, each measure had to differentiate be-
tween writing at the different band levels of DELNA. Secondly a measure had to
be sufficiently simple to be transferable into a rating scale.

5.4.2 Procedures (Pilot study):

5.4.2.1 Data collection and selection:

After gaining ethics approval to use the DELNA scripts for research purposes, the
scripts for the pilot study were selected. This involved cataloguing the scripts
available from the 2004 DELNA administration into a data base. All scripts were
given a running ID number, which was recorded on the script and also entered
into the data base. Other information recorded in the data base included scores
awarded for each script by the two different raters and a number of background
variables which will be described in more detail in the methodology section of the
main study of Phase 1 later in this chapter. For the purpose of the pilot study, the

111
six levels used in the current rating scale were collapsed into three levels. The ra-
tionale for this was that the pilot study was conducted only on a small number of
scripts and by collapsing the levels it was hoped that the analysis would yield
clearer results. Fifteen scripts, five at each of the three resulting proficiency lev-
els, were randomly selected from the 2004 administration of the DELNA assess-
ment. The only selection criterion used for these scripts was that the two raters
had agreed on the level of the essay. The three groups of scripts will henceforth be
referred to as ‘low’ for scripts of levels 4 and 5, ‘middle’ for scripts of levels 6
and 7, and ‘high’ for scripts of levels 8 and 9.

5.4.2.2 Data analysis:

Analysis of the pilot study was undertaken manually by the researcher. The sec-
tion below outlining the method and results of the pilot study explains the process
taken during the pilot analysis and why certain measures were further pursued or
adjusted according to the data in hand. Because of the extremely small sample
size in the pilot study, no inferential statistics were calculated and the data was not
double coded. Coding for inter-rater reliability was, however, undertaken in the
main study.

As the methodology of the pilot study will be described in much detail (including
definitions for each measure), all definitions of measures used in the pilot study
are therefore not repeated in the description of the methodology of the main
analysis.

5.4.3 Results from the pilot study

5.4.3.1 Accuracy:

Because several measures of accuracy were deemed potentially suitable by Wolfe-


Quintero et al. (1998) and other authors, the decision was made not to be too se-
lective before the pilot study, but rather to see which of the measures were best
suited to the data.

For the purpose of this study, error was defined, following Lennon (1991), as ‘a
linguistic form or combination of forms which, in the same context and under
similar conditions of production, would, in all likelihood, not be produced by the
speakers’ native speaker counterparts’. (p. 182). A t-unit was defined following
Hunt (1965) as containing ‘one independent clause plus any number of other
clauses (including adverbial, adjectival, and nominal)’. A clause was defined as ‘a
group of words containing a subject and a verb which form part of a sentence’. An
independent clause was defined as ‘a clause that can stand alone as a sentence’.

112
T-unit boundaries were located at each full-stop (following Schneider and Con-
nor, 1990), as well as at boundaries between two independent clauses as a t-unit is
defined as an independent clause with all its dependent clauses. Therefore, typi-
cally, t-unit boundaries occur before co-ordinating conjunctions like and, but, or,
yet, for, nor, so. As some of the data that forms part of this study was written by
learners of English, occasionally there were problems deciding on t-unit bounda-
ries because at times either the main verb or the subject (or both) were omitted. It
was therefore decided, that to qualify as a t-unit, the independent clause needed to
have both a subject and a main verb. Only the placement of a full stop by a stu-
dent could override this rule. So for example, the sentence ‘the rise in opportunity
for students.’ was coded as a t-unit, although no verb is present, but a full stop
was put at the end by the writer.

Below, a sample extract from the pilot study is reproduced (Figure 12). Errors are
marked in bold (with omissions indicated in square brackets), t-unit boundaries
are indicated with // whilst clause boundaries are indicated with a /.

<84>
The graph indicates [missing: the] average minutes per week spent on hobbies and games
by age group and sex. //

The males age between 12-24 years old spent the most time on hobbies and games.// It is
indicated approximately 255 minutes per week.// As comparison, female in the same age
group spent around 90 minutes on hobbies and games.// Males spent [missing: the] least
time on hobbies and games at 45-54 years old// but females spent [missing: the] least time
on hobbies and game at 25-34 years old.// As we can see, both sexes increase their time on
hobbies and games after 45-54 years old. //
Figure 12: Sample text for accuracy

Results of the pilot study can be seen in Table 19 below. The results are arranged
by the three different proficiency levels (low, middle, high) as described above.
Table 19 displays the means and standard deviations for each measure at each
level, low, medium and high. It becomes clear from this analysis that all the
measures were successful in distinguishing between the different levels, although
some were more successful than others. Among these measures were error-free t-
units, error-free clauses and errors/clause. The percentage of error-free t-units was
selected for the second phase of this study as this measure might be the easiest for
the raters to apply and is unaffected by the length of the script.

113
Table 16: Descriptive statistics - accuracy
Low Middle High
Accuracy Mean SD Mean SD Mean SD
Error-free t-units 1.4 1.14 6.4 2.30 15.6 1.82
Error-free clauses 5.67 1.75 13.33 4.32 30.67 4.84
Error-free t- 0.08 0.04 0.32 .10 0.84 0.11
units/t-units
Error-free 0.23 0.05 .41 .10 .95 .03
clauses/clauses
Errors/t-units 2.21 .18 1.43 .26 0.07 0.16
Errors/clause 1.36 .20 .75 .14 .03 .01

5.4.3.2 Fluency:

Fluency was divided into two separate aspects: writing rate and repair fluency ac-
cording to the findings of the literature review. Writing rate (temporal fluency)
was operationalised as the number of words. This measure was possible because
the essays were written under a time limit and these conditions were the same for
all students taking the assessment. It is however possible that some students did
not utilize the whole time available; therefore this measure needs to be interpreted
with some caution.

Repair fluency was operationalised as the number of self-corrections, which was


defined as ‘any instances of insertions or deletions a student has made to his/her
text’. In more detail, self-corrections were defined as in Table 17 below.

Table 17: Definition of self-correction used in this study

Self-correction: any instance of self-correction by itself. This can be just crossed out letters or
words or longer uninterrupted stretches of writing, which can even be as long as a paragraph. In-
sertions also count as one no matter how long the insertion is. If there are an insertion and a dele-
tion in the same place, then this counts as two.

Number of words in self-corrections: These are all the words (or individual free-standing at-
tempts at words) that have been deleted plus the number of words inserted.
If there is a deletion as part of an insertion or an insertion as part of a deletion, then it is counted as
part of the larger part in the number of words, but not counted separately.
If a letter is written over by another letter, it is not counted as two self-corrections, but just as one.
Deletions that range over two sentences or two paragraphs are counted as one.
Scripts where it is apparent that a correction has been rubbed out, are marked as ‘pencil’ and ex-
cluded from any further analysis as the exact number of insertions or deletions cannot be estab-
lished.

114
It was furthermore of interest whether, apart from the number of self-corrections,
there was any difference in the average length of each self-correction produced by
the writers at different levels.

Table 18: Descriptive statistics - fluency


Low Middle High
Fluency Mean SD Mean SD Mean SD
Words 212.4 82.25 245.4 56.55 341.4 48.77
No. of self- 29.4 11.14 15.8 7.82 9.8 7.82
corrections
Average 21.4 20.06 21 17.79 19 11
length of
self-
correction

The results for the analysis of fluency can be found in Table 18 above. It is clear
from the table that the number of words and the number of self-corrections were
successful measures, whilst the average length of self-correction was not. There-
fore, only the number of words and the number of self-corrections were used in
the main analysis.

5.4.3.3 Grammatical complexity:

As for accuracy, the most promising measures of grammatical complexity identi-


fied in the literature review were applied to the data. Results are presented in Ta-
ble 19 below.

The same definitions of clauses and independent clauses were used as in the sec-
tion on accuracy. A dependent clause was defined as ‘a clause that cannot stand
on its own in the sense that it depends on another clause for its meaning’.

Table 19: Descriptive statistics – grammatical complexity


Low Middle High
Grammatical com- Mean SD Mean SD Mean SD
plexity
Clauses per t-unit 1.63 .15 1.8 .09 1.99 .33
Dependent clauses 0.65 .05 0.85 .07 1.13 .24
per t-unit
Dependent clauses 0.55 .06 0.42 .07 0.35 .06
per clause

Table 19 above shows that all three measures distinguished between the three
groups of writing scripts, although there was considerable overlap. Because both

115
clauses per t-unit and dependent clauses per t-unit ultimately measure the same
construct, only the measure ‘clauses per t-unit’ was used in the main analysis.

5.4.3.4 Lexical complexity:

With regard to grammatical complexity, successful measures as identified in the


literature review were trialled on the data. For this analysis, sophisticated lexical
words were defined as words that are part of the Academic Word List (Coxhead,
2000) plus Offlist words. Offlist words are words that are not included in any of
the word lists (1000-, 2000-wordlist or Academic Word List). These are usually
less frequent words (e.g. lifestyle, landscape etc). Lexical words were defined as
content words, i.e. nouns, verbs, adjectives and most adverbs (as opposed to func-
tion or grammatical words). Word types were defined as the different words oc-
curring in a text as differentiated to word tokens. The results can be seen in Table
20 below.

Table 20: Descriptive statistics – lexical complexity


Low Middle High
Mean SD Mean SD Mean SD
Lexical complexity
Sophisticated lexical 0.13 0.03 0.16 0.05 0.18 0.06
words/total lexical
words
Sophisticated word 0.09 0.04 0.12 0.02 0.16 0.02
types/total word
types
Word types/total 0.46 0.01 0.44 0.11 0.48 0.04
words
Percentage words 5.73 1.63 8.44 0.67 9.48 1.27
from Academic Word
list
Average word length 4.49 0.27 5.02 0.17 5.19 0.19

In this case not all measures were equally successful in differentiating between the
different levels of data. All measures except ‘word types per total words’ were
able to differentiate between the levels. The variables used for the main analysis
were ‘the average word length’, ‘the number of sophisticated lexical words over
the total number of lexical words’ and ‘the percentage words from the Academic
Word List’.

116
5.4.3.5 Mechanics:
To measure mechanics, accuracy of punctuation, spelling, capitalisation and para-
graphing was assessed. Punctuation errors were defined as ‘errors in the placing
of full stops’. Commas were not included as accurate comma use is hard to opera-
tionalise. Other punctuation marks were not included as they were used only
rarely. Full stop mistakes are indicated by a / (slash) in the example (Figure 13)
below.

There are many factors that may have impacted on these trends,/firstly there was a change
of laws as the Australian government decided to discontinue New Zealand citizens from ob-
taining Australian benefits,/ this prevented many low-socio-economic families from migrat-
ing to Australia.
Figure 13: Sample text with punctuation mistakes

Spelling errors were defined as ‘any errors in spelling’. The example below (Fig-
ure 14) has the spelling mistake highlighted.

And the reason for a drop in 15-64 is the job oppotunities in New Zealand has a significant
decrease
Figure 14: Sample text with spelling error

Capitalisation errors were defined as (a) failure to use a capital letter for a noun
where it is required in English or (b) an inappropriate use of a capital letter. The
following example sentence (Figure 15) has all errors in capitalisation marked in
bold.

The trend of weekly time spent on hobbies and games by males and females of Third world
countries might be different to that of New Zealand, Australia and European Countries.
Figure 15: Sample text with capitalisation errors

For paragraphing, it was decided not to adopt Kennedy and Thorp’s (2002) sys-
tem of simply counting paragraphs produced, as this did not return very meaning-
ful results. Instead, a new measure was developed. It was assumed that because of
the nature of the task, a five-paragraph model could be expected. Because each
task is divided into three main sections, the writers should ideally produce a para-
graph on each of these sections as well as an introduction and conclusion. This
means that paragraphing was measured very mechanically. The maximum number
of points a writer could score in this section was five, one point for each para-
graph. If students further divided any of these paragraphs, that was still only
counted as one (i.e if a writer produced three paragraphs as part of the interpreta-
tion section, that was scored only as one point, not as three). If a writer connected

117
for example the introduction and the data description into one, this was scored as
one, not as two, because only one paragraph was produced. If it was logical, writ-
ers could also have body paragraphs that described a part of the data and then
gave the reasons for that piece of data, and then a separate paragraph for the next
data and reasons etc, but not more than two were counted. Also, if one part of the
question clearly was not answered, then the writer would not be able to score full
points. Below (see Table 21) are some examples of how students divided their
texts and how they were scored (/ indicates a paragraph break). It should be noted
that this was a very mechanical way of scoring and that no regard was taken of
organisation within paragraphs, which was partly covered by coherence.

Table 21: Examples of paragraphing


a) data description + interpretation / conclusion = 2
b) introduction + data description + interpretation 1 / interpretation 2 / interpretation 3 + implica-
tion=3
c) introduction / data 1 + interpretation/ data 2 + interpretation / data 3 + interpretation = 4
d) introduction / data description / interpretation / implication / implication + conclusion = 5

The results of the analysis of mechanics can be found in Table 22 below. The fig-
ures for punctuation, spelling and capitalisation indicate the average number of
errors per essay, whilst the scores under paragraphing denote the analysis of para-
graphing as described above.

Table 22 shows that whilst punctuation and spelling mistakes decreased as the
writing level increased, the same was not the case for capitalisation. In the case of
paragraphing, students of higher writing ability used more paragraphs than lower
level writers. However, there was much overlap. Punctuation, spelling and para-
graphing were analysed in the main study.

5.4.3.6 Coherence:

Analysis of the textual metadiscoursal markers introduced by Crismore et al.


(1993) showed that the writers used very few sequencers, code glosses, reminders,
illocutionary markers and topicalizers. Writers of lower level essays used less than
one of these per essay, whilst the writers of the high-level essays used on average
two. Because these devices were found infrequently, they were excluded from any
further analysis. Logical connectives were analysed as part of cohesion.

A topical structure analysis based on the categories proposed by Schneider and


Connor (1990) was undertaken. These authors used the categories of parallel pro-
gression, sequential progression and extended parallel progression to successfully

118
Table 22: Descriptive statistics – mechanics
Low Middle High
Mechanics Mean SD Mean SD Mean SD
Punctuation 2.3 2.07 2 1.4 0 0
Spelling 8.17 5.7 3.8 2.56 .33 .52
Capitalisation 1 1 2.2 1.92 0 0
Paragraphing 2 1 3.2 4.5 4.2 .84

differentiate between writing at three levels of the Test of Written English (as was
described in the literature review). In parallel progression, the topic of a t-unit is
identical to the topic of the preceding t-unit. In sequential progression, the topic of
a t-unit relates back to the comment of the previous t-unit. In extended parallel
progression, the topic of a t-unit is identical to a topic of a t-unit before the imme-
diately proceeding t-unit. As part of their discussion, Schneider and Connor sug-
gested three subcategories of sequential progression. The first they termed ‘di-
rectly related sequential regression’. This includes (a) the comment of the previ-
ous t-unit becoming the new topic, (b) word derivations (e.g. science, scientist)
and (c) part-whole relations (e.g. these groups, housewives, children, and old peo-
ple). The second subcategory was termed ‘indirectly related sequential topics’
which include related semantic sets (e.g. scientists and the invention of the radio).
The final subcategory was ‘unrelated sequential topics’ where the topic does not
relate back to the previous t-unit. An initial analysis using these categories (i.e.
parallel progression, the three subcategories of sequential progression and ex-
tended parallel progression) showed that for the current data, this differentiation
only partially works (see Table 23 below). The table below expresses in percent-
ages the extent to which each type of progression was used in each writing script.

Table 23: Coherence based on Schneider and Connor (1990)

Low Middle High


Coherence
Parallel 33% 27% 17%
Direct sequential 13% 16% 29%
Indirect sequential 15% 8% 21%
Unrelated sequential 38% 49% 33%
Extended parallel 1% 1% 0%

Table 23 above shows that as the level of the essays increased, students made use
of less parallel progression and more direct sequential progression (as was found
by Schneider and Connor). However, indirect and unrelated sequential progres-
sion did not follow a clear pattern. Very few instances of extended parallel pro-
gression were found.

119
A further, more detailed analysis of the category of unrelated sequential progres-
sion made it clear, however, that more categories were necessary. For example, it
was found that especially in the higher level essays, a large percentage of the t-
units found unrelated in the above analysis were in fact perfectly cohesive because
the writer introduced the topic at the beginning of a paragraph or used a linking
device to create coherence. According to Schneider and Connor, cases like these
were not recognised as being coherent as they did not conform to the above cate-
gories. An analysis revealed, however, that more skilful writers use linking de-
vices or paragraph introductions quite commonly. For the final analysis, both
these categories were analysed together in one category called super-structure.
Superstructure therefore creates coherence by a linking device instead of topical
progression.

Another category created after the more detailed analysis was the category of co-
herence breaks. In this case, the writer attempts coherence but fails. This might be
caused by either an incorrect linking device or an erroneously used pronominal
reference.
Apart from the two new categories created for this analysis, two other categories
of topical structure analysis were adapted from the literature. Firstly, indirect se-
quential progression was extended to indirect progression, to include cases in
which the topic of a t-unit indirectly links back to the previous topic. Similarly,
extended parallel progression was changed to extended progression to include an
extended link back to an earlier comment. Table 24 below shows all categories of
topical structure used in the pilot study. Definitions and examples are also sup-
plied.

Table 24: Definitions and examples of topical structure analysis categories

1. Parallel progression
Topics of successive sentences are the same (or synonyms)
<a,b> <a,c>
Maori and PI males are just as active as the rest of NZ. They also have other interests.
2. Direct sequential progression
The comment of the previous sentence becomes the topic of the following sentence
<a,b> <b,c>
The graph showing the average minutes per week spent on hobbies and games by age group
and sex, shows many differences in the time spent by females and males in NZ on hobbies and
games. These differences include on age factor.
3. Indirect progression
The topic or comment of the previous sentence becomes the topic of the following sentence.
The topic/or comment are only indirectly related (by inference, e.g. related semantic sets)
<a,b> <indirect a, c> or <a,b> <indirect b, c>

120
The main reasons for the increase in the number of immigrates is the development of some
third-world countries. e.g. China. People in those countries has got that amount of money to
support themselves living in a foreign country.
4. Superstructure
Coherence is created by a linking device instead of topic progression
<a,b> <linking device, c,d>
Reasons may be the advance in transportation and the promotion of New Zealand's natural
environment and "green image". For example, the filming of "The Lord of the rings"
brought more tourist to explore the beautiful nature of NZ.
5. Extended progression
The topic or comment before the previous sentence becomes the topic of the new sentence
<a,b> ... <a,c> or <a,b> ... <b,c>
The first line graph shows New Zealanders arriving in and departing from New Zealand be-
tween 2000 and 2002. The horizontal axis shows the times and the vertical axis shows the
number of passengers which are New Zealanders. The number of New Zealanders leaving
and arriving have increased slowly from 2000 to 2002.
6. Coherence break
Attempt at coherence fails because of an error
<a,b> <failed attempts at a or b or linker, c>
The reasons for the change on the graph. It’s all depends on their personal attitude.
7. Unrelated progression
The topic of a sentence is not related to the topic or comment in the previous sentence
<a,b> <c,d>
The increase in tourist arrivers has a direct affect to New Zealand economy in recent years.
The government reveals that unemployment rate is down to 4% which is a great news to all
New Zealander’s.

Table 25 below presents the results of the pilot study. The mean scores in the table
are the mean percentages of each category found in essays at that level.

The table shows that as students progressed, they used less parallel progression,
more direct sequential progression, more indirect progression and more super-
structure. Higher level students produced fewer coherence breaks and less unre-
lated progression. Extended progression showed no clear trend over the different
levels of writing. All categories were, however, included in the main analysis of
the data as, to calculate percentage of usage, all types of progression were re-
quired.

121
5.4.3.7 Cohesion:

As discussed in the literature review, several categories of cohesion were de-


scribed by Halliday and Hasan (1976). The first category is reference. Reference
was operationalised as the number of anaphoric pronominals. An example appears
in Figure 16 below:

But the old people are emmigrating to the green countries like Australia or New Zealand.
Because they need a better environment to live in for the rest of their life.
Figure 16: Anaphoric pronominal

Very few instances of ellipsis and substitution were found in the data, and these
measures were therefore excluded from the main analysis.

Table 25: Descriptive statistics – coherence


Low Middle High
Mean SD Mean SD Mean SD
Coherence
Parallel pro- 14 8.6 13.4 14.5 11.2 4.02
gression
Direct sequen- 6.6 11.26 9.2 6.38 13.33 6.77
tial progres-
sion
Indirect pro- 14.66 9.18 24 5.74 25 13.51
gression
Superstructure 3.83 4.83 10.4 6.58 18.2 18.19
Extended pro- 8.66 14.08 4.4 7.37 9.6 12.82
gression
Coherence 15.33 12.13 8 6.96 2.8 3.90
break
Unrelated 24.66 10.41 22 11.87 13.6 6.58
progression

Examples of linking devices, or conjunctions, can be found in the examples below


(Figure 17). An initial analysis showed that differentiating between correctly and
incorrectly used linking devices was not worthwhile, because very few were used
clearly incorrectly. Rather they were used very mechanically. It was therefore de-
cided that a count of the number of linking devices per text would be more fruit-
ful.

122
Public can also gain better nutrious products. Therefore the life span increases over time.

As time goes by, more and more elders would stay at home and could not devote themselves
to the society. Less young people could actually work for NZ society and might make NZ’s
economy be worse and non-competitive.

Furthermore, the population trends in NZ are more likely as European countries which pro-
vide sufficient medical facilities, many nutrious products and better education. However,
many countries, such as Africa or india are quite different from NZ with many young chil-
dren in one family.
Figure 17: Linking devices

Lexical chains were defined as ‘a group of lexical items which are related to each
other by synonymy, antonymy, meronymy, hyponymy or repetition’. In the exam-
ple below (Figure 18), a complete text is reproduced. Lexical chains that weave
through the text are indicated in superscript and bold writing. The lexical chain
indicated with number one relates to the different age groups mentioned in the
data. The lexical chain indicated by a two in superscript is made up of lexical
items that describe an increase. The third lexical chain (indicated with a three) re-
lates to health and medicine, whilst the last lexical chain (indicated by a zero) re-
lates to work and the economy.

The table 1 shows that the age group 15-64 years old¹ occupies the greatest portion among
the three groups¹ from year 1996 to 2051. The age group above 65 years old¹ has the
smalles portion compared with the other two groups¹. However, the percentage of age
group above 65 years old¹ keeps increasing² while the percentage of the other two age
groups¹ increase². Furthermore, The population¹ is growing² an the average age is also in-
creasing² from year 1996 to 2051. There are two possible reasons for the increasing² in

Population¹ over time. One is the modern medical technology³. People¹ could access to the
medical facilities³ which can provide better medical facilities³ which can provide better
medical services³ and improve public’s¹ health³. Public¹ can also gain better nutrious
products³.

Therefore the life span increases² over time. The other reason is that better education makes
people¹ know how to keep a healthy³ life for themselves.

In addition, as time goes by, more and more elders¹ would stay at homeº and could not de-
vote themselves to the society. Less young people¹ could actually workº for NZ society and
might make NZ’s economyº be worse and non-competitiveº.

Furthermore, the population¹ trends in NZ are more likely as European countries which
provide sufficient medical facilities³, many nutrious products³ and better education. How-
ever, many countries, such as Africa or india are quite different from NZ with many young
children¹ in one family.
Figure 18: Example of lexical chains

123
Table 26 lists the findings of the cohesion analysis. It shows that higher level
writers used more anaphoric pronominals and fewer linking devices. Higher level
writers also used more lexical chains1.

Table 26: Descriptive statistic – cohesion


Low Middle High
Cohesion Mean SD Mean SD Mean SD

No. of anaphoric pronominals 2.8 1.48 5.6 2.61 8.2 2.59


No. of linking devices 9.4 2.19 5.6 2.3 4.4 2.41
No. of lexical chains 3.0 .071 4.2 .84 5.8 .84

For the main study, it was decided to use the number of anaphoric pronominals
and the number of linking devices for the analysis of the main study. Measuring
the number of lexical chains was found to be very time-consuming. Also because
it is a high inference measure and rater reliability would be hard to achieve, it was
deemed unsuitable for both the main analysis and the rating scale.

5.4.3.8 Reader-Writer interaction:

The following aspects of reader/writer interaction were investigated: hedges,


boosters, markers of writer identity, attitude markers, commentaries and the use of
the passive voice. The analysis of the trial data demonstrated that the writers of
these essays used very few instances of attitude markers, commentaries and mark-
ers of writer identity. These were therefore excluded from the pilot study. Interest-
ing results were found, however, for the use of hedges, boosters and passive
voice. For this study, hedges were defined as ‘ways in which authors tone down
their claims’ (Hyland, 2000a). Examples highlighted in bold appear in the follow-
ing extract (Figure 19):

The leap by 12% in this range for 2051 will likely impact a) the workforce: costs to pay for
the elderly may be higher; more + more of the population approaching 65+ + after may
choose to stay in the workforce longer.
Figure 19: Hedges

Boosters were defined as ‘ways in which writers emphasise their assertions’. In-
stances of boosters can be found in the example below (Figure 20).

In New Zealand, the population trends represented unsignificantly from the past to present
time. But there is a clearly change for the population trends in future.
Figure 20: Boosters

124
Finally, an instance of the passive voice can be seen in the following example
(Figure 21).

This big progress could have been achieved by investing more in promoting
accurate driving habit, such as driving at safe speed, fasterning seat belt and
so on.
Figure 21: Passive voice

The results for the analysis of reader/writer interaction can be found in Table 27
below. The table shows that as students’ writing ability increased, they used more
hedges and fewer boosters, and writers at the highest level made use of the pas-
sive voice more than the lower two levels. Although very few instances of writer
identity were found in the sample used for the pilot study, it was decided that this
measure would be pursued in the main analysis in order to see, first, if a relation-
ship could be found between the use of the passive voice and markers of writer
identity and also because it is very easy to analyse with the help of a concor-
dancing program. Hedges, boosters, markers of writer identity and passive voice
were included in the main study.

Table 27: Descriptive statistics - reader/writer interaction


Low Middle High
Reader Writer interaction Mean SD Mean SD Mean SD
Hedges 3.6 1.34 7.4 2.30 8.8 3.63
Boosters 7.2 2.39 5.4 2.07 2.8 .84
Passive voice .60 .89 3.8 2.39 4.8 .84

5.4.3.9 Content:

As with paragraphing, no empirical measure of content was identified in the lit-


erature review. Therefore, a measure specific to the DELNA task was developed.
Twelve current DELNA raters were asked to produce sample answers to the four
prompts that were used as part of this study. They were instructed to take no
longer than the 30 minutes allocated to students and were given the same task
sheets that students use when taking the assessment.

Table 28: Descriptive statistics: content


Low Middle High
Content Mean SD Mean SD Mean SD
Data description 2.8 1.1 6.2 1.64 7.2 0.84
Data interpretation 1.8 1.3 3.6 0.55 5.0 1.22
Part three 1.2 1.3 4.2 1.30 5.8 1.3

125
The scripts written by the DELNA raters were deemed to be model answers.
These model answers were then analysed in three stages in terms of their content.
Firstly, the content of the data description section was analysed. Here, the types
of information produced by most raters in their task answers were recorded. In-
formation from the prompt which was usually summarised, or not mentioned at all
in the answers, was also recorded in this analysis. The same was done for the
other two sections in the writing task, the interpretation of data and Part three in
which writers are asked to either discuss any future possible developments or de-
scribe the situation in their own country. In these two parts each proposition made
by the model answers was noted down.

After the model answers were examined, a scoring system was developed as fol-
lows: For section one (data description), each trend described correctly was given
one mark, and each trend described by the appropriate figures was given another
point. For sections two (interpretation) and three, each proposition was given one
point. For sections two and three, writers were also given additional points for
supporting ideas.

The table above shows the findings for the pilot study (Table 28). From the table
it can be seen that higher level writers described more of the data provided and
that they also provided more ideas and supporting arguments in the second and
third part of the essay

Table 29: Measures to be used in the main analysis of Phase 1


Constructs Measures
Accuracy Percentage error-free t-units
Fluency - repair No. of self-corrections

- temporal No. of words


Complexity - grammatical Clauses per t-unit

- lexical Sophisticated lexical words per total lexical words


Average word length
No. of AWL words
Mechanics No. of spelling mistakes
No. of punctuation mistakes
Paragraphing
Coherence % Parallel progression
% Direct sequential progression
% Indirect progression
% Unrelated progression
% Superstructure
% Coherence breaks
% Extended progression

126
Cohesion No. of anaphoric pronominal references
No. of connectors
Reader/Writer Interaction No. of hedges
No. of boosters
No. of markers of writer identity
No. of passive voice verbs
Content % of data described correctly
No. of propositions in part 2 of task
No. of propositions in part 3 of task

All these measures were seen as useful for the main analysis. Because the
amounts of data provided by the different tasks varied slightly, it was decided to
convert the score for the data description into a percentage score which represents
the amount of data described out of the total data that could be described.

Based on the pilot study reported above, the measures in Table 29 were chosen for
the main study.

5.5 Main study Phase 1

The following section will briefly describe the writing scripts collected as part of
the 2004 administration of the DELNA assessment. Of the just over two thousand
scripts, 601 were randomly chosen for the main analysis.

5.5.1 Instruments:

5.5.1.1 Writing scripts:

Five prompts were used in the administration of DELNA in 2004. Table 30 below
illustrates the distribution across scripts in the sample. As mentioned previously,
scripts on prompt five were excluded based on a FACETS analysis (in which
prompt was specified as a facet), which showed that it was marginally more diffi-
cult than the others.

Table 30: Percentages of different prompts used in sample


Task Frequency Percentage
1 176 29.3%
2 93 15.5%
3 171 28.4%
4 161 26.7%
TOTAL 601 100%

127
The length of the scripts ranged from 47 to 628 words, with a mean of 270 words.
Deletions were not part of the word count. All scripts were originally written by
hand and then typed for the analysis.

Table 31 below shows the distribution of final scores awar-ded to the writing
scripts. This is based on the averaged final score from both raters. It can be seen
that no scripts were awarded a nine overall by both raters.

Table 31: Final marks awarded to scripts in sample


Final Mark Frequency Percentage
4 12 2%
5 115 19%
6 276 46%
7 172 29%
8 26 4%
9 0 0%
TOTAL 601 100%

5.5.2 Participants
5.5.2.1 The writers:

Several background variables were available for the participants, because DELNA
students routinely fill in a background information sheet when booking their as-
sessment. Here, gender, age group and L1 of the students in the sample are re-
ported.
Table 32 below shows that there were somewhat more females in the sample
overall.

Table 32: Gender distribution in sample


Gender Frequency Percentage
Female 329 55%
Male 247 41%
Not specified 25 4%

Table 33 below shows that most students occupied the under 20 category. Very
few writing scripts in the sample were produced by writers over 41.

The L1 of the students was also noted as part of the self-report questionnaire. Ta-
ble 34 below shows that the two largest L1 groups were students speaking an East
Asian language as L1 (41%), closely followed by students with English as their
first language (36%). Other L1s included in the sample were European languages

128
other than English (9%), Pacific Island languages (4%), languages from Paki-
stan/India and Sri Lanka (4%) and others (3%). A further 3% of students did not
specify their L1.

Table 33: Age distribution in sample


Age group Frequency Percentage
Under 20 340 57%
20 – 40 225 37%
41 or above 14 2%
Not specified 22 4%
TOTAL 601 100%

Table 34: L1 of students in sample


L1 Frequency Percentage
English 217 36%
East Asian language 248 41%
European language 52 9%
Pacific Island language 26 4%
Language from Pakistan/India/Sri Lanka 21 4%
Other 19 3%
Not specified 18 3%
Total 601 100%

As part of the information above, the distribution of the final average writing
mark in relation to the test takers’ L1 was calculated. Table 35 shows that almost
all students scoring an eight overall were native speakers of English, while the
largest number scoring lower marks (fours or fives) were from Asian back-
grounds. Test takers that did not specify their language background were not in-
cluded in this table.

Table 35: Marks awarded to different L1 groups in sample


L1 \ Final Writing Mark 4 5 6 7 8 Total
English - 13 86 95 23 217
East Asian language 11 79 124 32 2 248
European language - 8 21 23 - 52
Pacific Island language - 6 15 5 - 26
Language from India/Sri Lanka/Pakistan - 3 9 8 1 21
Other - 4 10 5 - 19

129
5.5.2.2 The raters:

Very little specific information was available about the raters of the 601 scripts
during the 2004 administration. However, as mentioned earlier, all DELNA raters
are experienced teachers of either ESOL or English, a large number have rating
experience outside the context of DELNA (for example in the context of IELTS)
and all have postgraduate qualifications. More background details on the partici-
pating raters in Phase 2 of the study will be reported in Chapter 8.

5.5.3 Procedures

5.5.3.1 Data collection:

The 601 writing scripts randomly selected for the purpose of this study were col-
lected as part of the normal administration of the DELNA writing component over
the course of the academic year 2004. All scripts were rated by two raters and, in
case of discrepancies of more than two band scores, a third rater was consulted.
As part of the DELNA administration, a background information sheet is rou-
tinely collected from each student. Several categories on this background informa-
tion sheet were entered into a database (see section on data entry).

5.5.3.2 Data entry:

Data were entered into a Microsoft Access Database which included a random ID
number for each script, the students’ ID number to identify the script, the task
(prompt) number, the score awarded to the scripts by the two raters on the three
different categories in the analytic scale (fluency, content, form) as well as any
relevant background information about the students. The variables entered from
the background information sheet were as follows: country of birth, gender, age
group, L1, home language, time in NZ, time in other English speaking country,
marks on other relevant English exams and enrolment at the University of Auck-
land at time of sitting the assessment (i.e. first, second or third year). The scores
awarded on each category of the analytic scale (i.e. fluency, content, form), by the
two (or three) raters were then averaged (in the case of uneven scores arising, the
score was rounded down) to arrive at a final score for each script in each category.
An overall writing score was also calculated for each script. This was based on the
average of the mean scores for each of the three categories of fluency, content and
form. The overall score was rounded down if .333 and up if .667.

130
5.5.3.3 Data analysis:

5.5.3.3.1Accuracy:

As mentioned in the pilot study, the measure chosen for accuracy was the per-
centage of error-free t-units. This therefore involved identifying both t-unit
boundaries and errors. As these variables cannot be coded with the aid of com-
puter programs (Sylvianne Granger, personal comm-unication), both had to be
coded manually. To save time, t-units were coded in combination with clause
boundaries (see grammatical complexity) and errors were coded in combination
with spelling mistakes and punctuation mistakes (see mechanics).

After coding t-unit boundaries and errors, all error-free t-units were recorded into
a SPSS (Statistical Package for the Social Sciences) spreadsheet. To make the
variable more meaningful, the percentage of error-free t-units was calculated by
dividing the error-free t-units by the total number of t-units. A second coder was
then involved to ensure inter-rater reliability by double-coding a subset of the
whole sample (50 scripts). A Pearson correlation co-efficient was calculated using
SPSS.

5.5.3.3.2 Temporal Fluency:

Temporal fluency was operationalised by the number of words written. This was
established using a Perl Program specifically produced for this task. The output of
the Perl program is composed of the script number from 0 to 601 in one column
and the number of words in the script in the adjacent column. The output is in
TextPad (a free downloadable software for Windows) and this can then easily be
transferred into Excel or SPSS spreadsheets. The reason a Perl programme was
chosen for this task is that, instead of having to go through the laborious task of
checking the number of words in each individual script through the help of the
Microsoft Word Tools menu, Perl performs the analysis within seconds. Because
this variable was analysed by a computer program, double rating was unneces-
sary. However, it should be mentioned that as part of the design process of the
Perl program, a number of spot checks were carried out to ensure that the program
was working in the way required.

5.5.3.3.3 Repair Fluency:

The variable chosen to analyse repair fluency was the number of self-corrections.
The self-corrections were ope-rationalised as described in the pilot study. To en-
sure inter-rater reliability, this variable was double rated in 50 scripts and a Pear-
son correlation coefficient was calculated using SPSS.

131
5.5.3.3.4 Grammatical complexity:

Grammatical complexity, as mentioned in the previous section, was operational-


ised as the number of clauses per total number of t-units. Both clauses and t-
units were coded manually. A clause boundary can occur between an independent
clause (a clause that can stand by itself) and a dependent clause (a clause which
cannot stand by itself), or between two dependent clauses. However, as with the
coding of t-units described above, sometimes the clause boundaries were hard to
define as some of the writers had not achieved a high level of accuracy in their
writing. As with the t-units, the decision was made that a clause needed to contain
a subject and a main verb to count as a clause. Therefore, the sentence ‘the graph
shows that the amount of departures after 2001 big’ was counted as just one t-unit
with no clause attached because the verb was missing in the second part. Again, a
second coder was used to code a subset of the whole sample (50 scripts) to ensure
inter-coder reliability. Then, a Pearson correlation coefficient was calculated us-
ing SPSS.

5.5.3.3.5 Lexical complexity:

Lexical complexity was coded into three variables: firstly, sophisticated lexical
words per total lexical words, secondly the average length of words and finally the
number of AWL words. The variable sophisticated lexical words per total lexical
words was analysed with the help of the computer program Web VocabProfile
(Cobb, 2002) which is an adaptation of Heatly and Nation’s Range (1994).

Before the data was entered into VocabProfile, all spelling mistakes were cor-
rected. This was done because the program would not be able to recognise mis-
spelled words and would therefore move them into the offlist wordlist. The ra-
tionale behind including these words in the analysis was that the writer had at-
tempted the items, but was just not able to spell them correctly. Items of vocabu-
lary that were too unclear to be corrected were excluded from the analysis.
The sophisticated lexical words were taken from the tokens of the AWL (aca-
demic word list) and the Off-List Word tokens. However, as the Off-List words
also included ab-breviations and words like ‘Zealander’ from New Zea-lander,
this list was first scanned and then only the ‘real’ Off-List words were included in
the analysis. The Off-List words could be investigated easily because lower down
the screen, each token of the Off-List words was given. The number of sophisti-
cated lexical words was then divided by the total number of content words. As the
number of content words was not stated in the output of VocabProfile, the value
for lexical density had to be used. Lexical density is defined as the number of con-
tent words divided by the total number of words. Therefore, it was quite straight-
forward to arrive at the number of content words (i.e. by multiplying the value of

132
lexical density by the total number of words). Because the variable sophisticated
lexical words over total lexical words was analysed with the aid of the computer
program VocabProfile, no inter-rater reliability check was deemed necessary.

The second variable that was investigated for lexical complexity was the average
length of words. This was done completely automatically, again using a Perl
script specifically designed for the task. The Perl program was written so that it
identified the number of characters in each script, as well as the number of spaces
between characters. Before this count, the Perl script disregarded all punc-tuation
marks (so that they were not added into the final count where they might inflate
the length of words). To arrive at the final average word length for each script, the
number of characters was divided by the number of spaces between words. As this
was done completely automatically, no inter-rater reliability check was deemed
necessary. The Perl program was however thoroughly checked for any mistakes
before it was used.

Finally, the number of words from the Academic Word List was recorded in the
spreadsheet. This was also taken from the output of VocabProfile.

5.5.3.3.6 Mechanics:

The first group of variables examined for mechanics was the number of errors in
each script for spelling and punctuation. They were coded at the same time as the
rest of the errors (i.e. the types of errors analysed for accuracy). Each of these was
defined as described in the methodology section of the pilot study. A second rater
rated a subset of the data (50 scripts) and Pearson correlation coefficients were
calculated for each of the variables using SPSS.

Paragraphing was coded as described in the pilot study. Double-coding of a subset


of 50 scripts was undertaken and a Pearson correlation coefficient was calculated
to ensure inter-rater reliability.

5.5.3.3.7 Coherence:

Using the categories established in the pilot study, the scripts were coded manu-
ally. The same t-unit breaks for accuracy were used. Inter-rater reliability was es-
tablished using a second coder who rated a subset of 50 scripts and calculated us-
ing a Pearson correlation coefficient in SPSS.

133
5.5.3.3.8 Cohesion:

The variable chosen to investigate cohesion was the number of anaphoric pro-
nominals (e.g. this, that, these) used by the writer. The pronominals used in the
main analysis are listed in Appendix 1. The decision was made that instead of
hand-coding these in the 601 writing scripts, with the risk of missing some due to
human error, a concordancing program would be used to search for each of these
pronominals individually. The concordancer chosen for this task was MonoConc
Pro Concordance Software Version 2.2 (Barlow, 2002).

Monoconc not only displays the concordancing lines, but also displays as much
context as is requested. This proved invaluable, because many of the words identi-
fied were not anaphoric pronominals and thus were not acting as cohesive devices
as described by Halliday and Hasan (1976). Although this method of data analysis
has the advantage that it saves time compared to the manual method, it still
proved time-consuming in the sense that all instances of the words in the concor-
dance needed to be checked in the top window, to eliminate all occasions where
the word was not used as a cohesive device. For example, when counting the use
of those, all instances of those as in those of us, needed to be discarded as well as
the those used in the sense of those people that I am familiar with. After pronomi-
nals that were not used as cohesive devices were discarded, the next step was to
assess if the referent referred to by the pronominal was in fact over the clause
boundary in accordance with the definition adopted for cohesive devices. This ex-
cluded a number of possessive pronominals occurring in the same clause as the
referent as for example the use of its in ... the motor vehicle crashes declined to
half its number....

Following this procedure, each pronominal was recorded and entered into an
SPSS spreadsheet next to the relevant script number. The next step was to ex-
clude all pronouns that occurred fewer than 50 times in all scripts. This was done
because it was not deemed useful to include very rare items in a rating scale.
Therefore the following words were excluded from any further analysis: here, its,
those, his, her, she and he. Then the results for each pronoun were correlated with
the final score awarded by the DELNA raters. Finally, an inter-rater reliability
check was undertaken by double-rating 50 scripts and calculating a Pearson corre-
lation coefficient.

5.5.3.3.9 Reader-Writer Interaction:

Reader-Writer interaction was investigated by using MonoConc (Barlow, 2002)


which was described in the previous section on cohesion. The structures investi-
gated in this category were allocated to four groups: hedges, boosters, markers of

134
writer identity and the passive voice. The complete list of items investigated was
established based on previous research of the literature and can be found in Ap-
pendix 1. Each lexical item was investigated individually using MonoConc. Here
special care needed to be taken, so that lexical items that did not function as
hedges or boosters were excluded from the analysis. For example, in the case of
the booster certain, all uses of certain + noun needed to be excluded as this struc-
ture does not act as a boosting device. In the case of the lexical item major, all
uses of the word in conjunction with cities or axial routes, for example, needed to
be excluded because these were also not used as boosters. So for each lexical item
in Appendix 1, the whole concordancing list produced in MonoConc needed to be
thoroughly examined before each instance of that item could be entered into a
spreadsheet. Finally, all items were added together, so that a final frequency count
for each script was found for hedges, boosters and markers of writer identity. The
passive voice was initially also investigated using MonoConc. However, because
it was impossible to search for erroneous instances of the passive (i.e. unsuccess-
ful attempts), this analysis was later refined by a manual search.

Finally, all four variables investigated in the category of mechanics underwent an


inter-rater reliability check. Fifty scripts were coded by a second rater and a Pear-
son correlation coefficient calculated.

5.5.3.3.10 Content:

Using the scoring scheme described in the pilot study, the scripts were manually
coded. A second rater was used to ensure inter-rater reliability by scoring a subset
of 50 scripts. A Pearson correlation coefficient was calculated using SPSS.

5.5.3.4 Data analysis: Inferential statistics

To ascertain that any differences found between different DELNA writing levels
did not occur purely due to sampling variation, each measure in the analysis was
subjected to an Analysis of Variance (ANOVA). A number of assumptions under-
lie an ANOVA (A. Field, 2000; Wild & Seber, 2000). The first assumption relates
to independence of samples. This assumption is satisfied in this situation, as no
writing script is repeated in any of the groups (DELNA band levels) compared.
The second assumption stipulates that the sample should be normally distributed.
However, according to Wild & Seber (2000, p. 452), ANOVA is robust enough to
cope with departures from this assumption. Furthermore, because most groups in
this analysis were very large, we can rely on the central limit theorem, which
stipulates that large samples will always be approximately normally distributed.
The third assumption stipulates that the groups compared should have equal vari-
ances. This is the most important assumption relating to ANOVA. Wild & Seber

135
(2000) suggest that this can be tested by ensuring that the largest standard devia-
tion is no more than twice as large as the smallest standard deviation2. If the vari-
ances were found to be unequal following this analysis, a Welch test (Welch’s
variance-weighted ANOVA) was used. This test is robust enough to cope with
departures from the assumption of equality of variances and performs well in
situations where group sizes are unequal. The post hoc test used for all analyses
was the Games-Howell procedure. This test is appropriate when va-riances are
unequal or when variances and group sizes are unequal (A. Field, 2000, p.276).
This was found to be the most appropriate test of pair-wise comparisons because
in all cases the groups were unequal (with DELNA band levels 4 and 8 having
fewer cases than band levels 5, 6 and 7).

Whilst pair-wise post hoc comparisons were performed for each measure, it was
not deemed important for each mea-sure to achieve statistical significance be-
tween each ad-jacent level. Pair-wise comparisons between adjacent levels are
however briefly mentioned in the results chapter.
After the ANOVAs and pair-wise post hoc comparisons had been computed, it
came to my attention that a MANOVA analysis would be more suitable for this
type of data as it would avoid Type 1 errors. Because the data violated some un-
derlying assumptions of inferential statistics, especially the assumptions of equal
variances, a non-parametric MANOVA was chosen. The computer program
PERMANOVA (Anderson, 2001, 2005; McArdle & Anderson, 2001) was used
for this as SPSS is unable to compute non-parametric MANOVAs. However, the
resulting significance values for each structure showed very little difference from
those computed by the ANOVAs described above, and it was therefore decided to
keep the ANOVAs in the results section as these results are more easily presented
and interpreted.

---
Notes:
1
In retrospect it might have been better to have looked at the number of items in a lexical chain.
2
This test was chosen over the Levene’s test for equality of variances, as the Levene’s test almost
always returns significant results in the case of large samples.

136
Chapter 6: Results – Analysis Of Writing Scripts

6.1 . Introduction

The following chapter presents the results of Phase 1, which address the following
research question:

Research Question 1: Which discourse analytic measures are successful


in distinguishing between writing samples at different DELNA (Diag-
nostic English Language Needs Assessment) writing levels?

For each variable under investigation, two pieces of information are presented.
Firstly, side-by-side box plots showing the distribution over the different DELNA
writing proficiency levels are provided. The box of each plot portrays the middle
50% of students, while the thick black line inside the box denotes the median. The
whiskers indicate the points above and below which the highest and lowest 10%
of cases occur. Cases lying outside this area are outliers, or extreme scores. The y-
axis on which these plots are charted represents the frequency (or proportion of
usage) of the variable in question, while the x-axis represents the average DELNA
mark, ranging from 4 to 82.

The second piece of information is a table presenting the descriptive statistics for
each variable at each DELNA level. As in the pilot study, the minimum and
maximum were chosen over the range to illustrate any overlap between levels.

6.2 Accuracy
Accuracy was measured as the percentage of error-free t-units3.

First, an inter-rater reliability analysis was undertaken. A correlation between the


scores of the two raters revealed that the scores were strongly related, r = .871, n =
50, p = .000, two-tailed. For all variables coded for inter-rater reliability, it was
decided that a correlation coefficient of .80 or higher would be seen as satisfac-
tory.

137
Figure 22: Distribution of proportion of error-free t-units over overall sample and DELNA
sublevels

The side-by-side box plots in Figure 22 depict the distribution of the proportion of
error-free t-units over the different DELNA bands. The variable successfully dis-
tinguished the different levels, with some overlap.

Table 36: Descriptive statistics - Proportion of error-free t-units


DELNA level M SD Minimum Maximum
4 .15 .18 0.00 0.58
5 .25 .18 0.00 0.87
6 .42 .24 0.00 1.00
7 .61 .22 0.09 1.00
8 .71 .21 0.23 1.00

Table 36 above shows the descriptive statistics for each of the five proficiency
levels. Because the equality of variance assumption was not violated in this case,
an analysis of variance (ANOVA) test was performed. The analysis of variance
revealed significant differences between the different band levels, F (4, 576) =
60.28, p = .000. The Games-Howell post hoc procedure showed statistically sig-
nificant differences between two adjacent pairs of levels, levels 5 and 6 and levels
6 and 7.

138
6.3 Fluency

6.3.1 Temporal Fluency

The variable chosen for temporal fluency was the average number of words per
script.

The box plots in Figure 23 and the descriptive statistics in Table 37 both indicate
that although the average number of words generally increased as the writing level
rose, there was much overlap. There also seemed to be a ceiling effect to the vari-
able, indicating that writers at levels 6, 7 and 8 seemed to produce a very similar
number of words on average. So while there was a clear difference between the
number of words produced on average between levels 4 to 6 (although with much
overlap in the range), for levels 6 and above the variable did not successfully dis-
criminate between the writers.

Figure 23: Distribution of number of words per essay over overall sample and DELNA sub-
levels

Table 37: Descriptive statistics – Average number of words per script


DELNA level M SD Minimum Maximum
4 226.67 42.86 151 320
5 244.98 58.63 47 424
6 273.66 79.18 68 628
7 281.00 67.69 121 454
8 273.32 54.33 155 390

139
Because the assumption of equal variances was not violated, an ANOVA was per-
formed. This analysis revealed a statistically significant difference between the
five band levels, F (4, 577) = 5.82, p = .000. The Games-Howell procedure re-
vealed that the only adjacent levels that were significantly different were levels 5
and 6.

6.3.2 Repair Fluency

The variable chosen for repair fluency was the number of self-corrections.

To ensure inter-rater reliability, a Pearson correlation coefficient was calculated


on the frequency counts produced by two raters coding a subset of fifty scripts.
The analysis showed a strong correlation, r = .918, n = 50, p = .000, two-tailed.

While the mean for all scripts was 14.13 self-corrections, the scripts ranged
widely. Over 50 writers made no self-corrections, while some scripts had as many
as 64.

Figure 24: Distribution of number of self-corrections over overall sample and DELNA sub-
levels
This variable also produced a large number of outliers as can be seen when the
number of self-corrections were plotted over the DELNA bands. Although there
was considerable overlap, the measure discriminated between the different
DELNA bands (see Figure 24 and Table 38 below), showing that the lower the
level of the writer, the more self-corrections were made4.

140
Table 38: Descriptive statistics - Number of self-corrections
DELNA level M SD Minimum Maximum
4 21.33 5.19 0 32
5 17.21 11.41 0 52
6 15.00 9.58 0 64
7 12.38 9.57 0 57
8 6.96 5.84 0 37

Because the assumption of equality of variances did not hold in this case, a Welch
test was performed which revealed statistically significant differences between the
different groups, F (4, 60.7) = 4.14, p = .005. However, the Games-Howell proce-
dure revealed that no immediately adjacent levels were significantly different.

6.4 Complexity

6.4.1 Grammatical complexity

The variable chosen to analyse grammatical complexity was clauses per t-units.

An inter-rater reliability check was undertaken for the coding of both clauses and
t-units. Both showed a strong positive relationship, with the correlation coefficient
for t-units, r = .981, N = 50, p = .000, being slightly higher than that for clauses, R
= .934, N = 50, p = .000.

Figure 25: Distribution of clauses per t-units over overall sample and DELNA sublevels

141
The box plots in Figure 25 and the descriptive statistics in Table 39 above show
that the variable failed to differentiate between scripts at different ability levels.
This means that, in contrast to what was expected, higher level writers did not use
more complex sentences (more subordination).

Table 39: Descriptive statistics - Clauses per t-unit


DELNA level Mean SD Minimum Maximum
4 1.45 .39 1.03 2.40
5 1.39 .27 .54 2.67
6 1.50 .29 .96 2.53
7 1.48 .32 .00 2.44
8 1.42 .30 1.09 2.30

Overall, very little subordination was used in the scripts as is indicated by the
mean of 1.46 for all scripts included. That is, fewer than every second t-unit in-
cluded sub-ordination.

Because the assumption of equality of variances held in this case, an ANOVA was
performed which returned a sta-tistically significant result, F (4, 575) = 3.08, p =
.016. The Games-Howell procedure showed that the only adjacent band level pair
that was significantly different was level 5 and 6.

6.4.2 Lexical complexity

Two separate variables were chosen for lexical complexity in the pilot study, the
average word length and sophisticated lexical words per total lexical words. As
part of the main analysis, the number of AWL words were also recorded, because,
forming part of the output of VocabProfile, the coding required no extra time.

Firstly, the average word length was investigated. The average word length for all
words in the whole sample was 4.78.

The box plots (Figure 26) and the table displaying the descriptive statistics (Table
40) show that the variable successfully discriminated between different levels of
writing, in that the higher the level of writing, the longer the average word.

142
Figure 26: Distribution of average word length over overall sample and DELNA sublevels

An analysis of variance revealed a significant difference between the different


DELNA band levels, F (4, 577) = 14.54, p = .000. The Games-Howell procedure
showed that two adjacent pairs of band levels were statistically significantly dif-
ferent, namely levels 6 and 7 and levels 7 and 8.

Table 40: Descriptive statistics - Average word length


DELNA level M SD Minimum Maximum
4 4.52 .30 3.95 4.91
5 4.69 .28 4.11 5.41
6 4.76 .28 4.09 5.61
7 4.85 .25 4.09 5.50
8 5.04 .27 4.51 5.76

The second variable investigated for lexical complexity was the number of so-
phisticated lexical words per total number of lexical words.

143
Figure 27: Distribution of sophisticated lexical words per total lexical words over overall
sample and DELNA sublevels

Figure 27 and Table 41 show that the higher the level of writing, the more sophis-
ticated lexical words per total lexical words were used by the writers5.

Table 41: Descriptive statistics - Sophisticated lexical words per total lexical words
DELNA level M SD Minimum Maximum
4 .13 .05 .03 .21
5 .15 .06 .00 .30
6 .17 .07 .00 .39
7 .18 .07 .00 .37
8 .21 .07 .00 .34

An ANOVA revealed statistically significant differences between the five differ-


ent band levels, F (4, 596) = 7.32, p = .000. The Games-Howell procedure showed
that no adjacent band levels were statistically significantly distinct.

Although not initially planned to be part of the analysis, the number of words in
the Academic Word List (AWL) were also recorded as part of the analysis of Vo-
cabProfile. As Figure 28 and Table 42 indicate, this variable differentiates well
between the different levels of writing6.

144
Figure 28: Distribution of number of AWL words over overall sample and DELNA sublevels

Table 42: Descriptive statistics - Number of words in AWL


DELNA level M SD Minimum Maximum
4 6.91 3.09 2 13
5 10.25 5.97 0 31
6 13.99 7.69 0 43
7 17.11 7.71 1 38
8 21.24 6.09 8 32

Because the assumption of equal variances was not satisfied in this case, a Welch
procedure was performed, which revealed statistically significant differences be-
tween the groups, F (4, 66.22) = 39.99, p = .000. The Games-Howell procedure
showed that all adjacent pairs of band levels differed significantly statistically.

6.5 Mechanics

Three variables were investigated as part of mechanics: spelling, punctuation and


paragraphing. The first variable was the number of spelling errors.

Inter-rater reliability for the variable was investigated by having a second coder
double rate a subset of 50 scripts. A Pearson correlation coefficient showed a
strong relationship between the two counts of errors, r = .959, N = 50, p = .000.

145
Many scripts displayed no or very few mistakes, suggesting that this variable
might not be suitable as a measure. Over a third of all scripts displayed no spelling
errors, while the overall mean for all scripts was 3.5 spelling errors per script.

Figure 29: Distribution of number of spelling errors over overall sample and DELNA sub-
levels

The box plots present the number of spelling mistakes for each DELNA band
level. It can be seen that this variable differentiated between levels.

Table 43: Descriptive statistics - Number of spelling errors


DELNA level M SD Minimum Maximum
4 8.27 14.87 0 51
5 3.96 4.92 0 33
6 3.67 3.91 0 21
7 3.06 2.94 0 12
8 2.00 1.47 0 6

However, the majority of scripts, with the exception of some outliers, did not dis-
play a large number of spelling mistakes and the differences between levels 5 to 7
were very small. In contrast, there was a large difference in the mean for scripts
scored at level 4 and 5. The mean for level 4 scripts was 8.27 while the mean for
level 5 scripts was just below 4 per script. The descriptive statistics for each level
are displayed in Table 43 above. Because the assumption of equal variances did
not hold in this case, a Welch procedure was used instead of an analysis of vari-

146
ance. The Welch test revealed statistically significant differences, F (4, 58.46) =
6.01, p = .000. The Games-Howell procedure showed that only levels 7 and 8
were statistically significantly different from each other.

The second variable investigated was the number of punctuation errors.

First, inter-rater reliability was established for this variable. A correlation showed
a strong relationship between the ratings of the two coders, r = .864, n = 50, p =
.000.

As with the number of spelling mistakes, this variable showed a positively skewed
distribution. For the overall sample of scripts, the average was 3.04 punctuation
errors.

Figure 30: Distribution of number of punctuation errors over overall sample and DELNA
sublevels

As with spelling, this variable also failed to differentiate between the five differ-
ent levels of writing and in this case there was very little differentiation in terms
of the mean scores of the five writing levels (Figure 30 and Table 44).

An analysis of variance revealed no statistically significant differences between


the groups, F (4, 575) = .396, p = .812.

147
The third and final variable investigated in the category of mechanics was para-
graphing, which was measured as the number of paragraphs (of the five para-
graph model) produced.

Table 44: Descriptive statistics - Number of punctuation errors


DELNA level M SD Minimum Maximum
4 2.55 2.66 0 8
5 2.92 3.04 0 14
6 3.10 2.70 0 14
7 3.14 2.87 0 15
8 2.56 2.31 0 8

An inter-rater reliability check was undertaken on a set of 50 scripts from the


sample. A significant relationship was found between the ratings of the two cod-
ers, r = .802, N = 50, p = .000.

Figure 31: Distribution of paragraphing over overall sample and DELNA sublevels

When the box plots (Figure 31) and the descriptive statistics (Table 45) for the
different DELNA proficiency levels were compared, it could be seen that writers
at level 4 produced only two of the expected paragraphs on average, whilst writers
at level 8 produced just under four. Students at levels 5, 6 and 7 had a very similar
mean (around three paragraphs) on this variable; however the box plots show a
clear differentiation between levels 5 and 6.

148
Table 45: Descriptive statistics - Paragraphing
DELNA level M SD Minimum Maximum
4 2.27 1.10 1 4
5 2.88 .85 1 5
6 3.09 .91 1 5
7 3.17 .91 1 5
8 3.68 .56 3 5

An analysis of variance revealed statistically significant differences between the


groups, F (4, 578) = 7.03, p = .000. The Games-Howell procedure showed that the
only adjacent levels that were statistically significantly different were levels 7 and
8.

6.6 Coherence

Before the analysis of coherence, an inter-rater reliability analysis was necessary.


The results for each structure appear in the table below (Table 46):

Table 46: Inter-rater reliability for topical structure analysis categories


Topical structure category Correlation coefficients
Parallel progression r = .835, N=50, p = .000
Direct sequential progression r = .921, N=50, p = .000
Indirect progression r = .796, N=50, p = .000
Superstructure r = .960, N=50, p = .000
Extended progression r = .821, N=50, p = .000
Coherence break r = .916, N=50, p = .000
Unrelated progression r = .828, N=50, p = .000

The inter-rater correlation for indirect progression was below .80, which was cho-
sen as the cut-off for this study. However, because it is a high-inference variable,
it was decided that this level would be acceptable.

Next, the following hypothesis was made: Parallel progression, direct sequential
progression and super-structure would all contribute towards coherence. There-
fore, there was an expectation that these might be produced more commonly by
more proficient writers. On the other hand, unrelated progression and coherence
breaks were thought to be reasons for coherence to break down and might there-
fore be produced by less proficient writers. No clear hypothesis could be stated for
indirect progression and extended progression.

149
However, it was decided, instead of having pre-conceived hypotheses about what
the writers might produce at different levels, to let the data speak for itself. There-
fore, a correlation analysis was undertaken, in which the proportion of usage of
each of these categories was correlated with the final score the writers received
from the two raters. The results from the correlation confirmed some of the hy-
potheses, whilst others were refuted. The correlations (Table 47 below) showed
that higher level writers used more direct sequential progression, superstructure
and indirect progression (resulting in significant positive correlations). Categories
used more by lower level writers were parallel progression, unrelated progression
and coherence breaks (resulting in significant negative correlations). Extended
progression was used equally by lower and higher level writers and therefore re-
sulted in a correlation close to zero.

Table 47: Topical structure categories correlated with final DELNA writing score
Topical structure category Final writing score
Parallel progression -.215**
Direct sequential progression .292**
Superstructure .258**
Indirect progression .220**
Extended progression -.07
Unrelated progression -.202**
Coherence break -.246**
n = 601; **p < .01

To identify differences over the different proficiency levels visually, side-by-side


boxplots were created for each of the categories of topical structure. These are
presented in Figures 32 to 38 below. Each of these suggests that whilst there was
considerable overlap between the different levels, there was usually a clear pro-
gression.

150
Figure 32: Distribution of proportion of parallel progression

Figure 33: Distribution of direct sequential progression

Interestingly, the use of parallel progression resulted in an upside down U-shape,


with writers at levels 4, 7 and 8 using it less commonly than the middle levels of
writing proficiency (see Figure 33 above). An analysis of variance revealed statis-
tically significant differences between the different groups, F (4, 576) = 7.29, p =

151
.000. The Games-Howell procedure showed that levels 6 and 7 were statistically
significantly different.

Direct sequential progression was used more frequently by higher level writers, In
particular, writers at level 8 used this type of progression for nearly a third of their
sentence topics. An analysis of variance revealed statistically signi-ficant differ-
ences between the groups, F (4, 575) = 2.86, p = .023. However, the Games-
Howell procedure showed no statistically significant differences between adjacent
groups.

Figure 34: Distribution of indirect progression over DELNA sublevels

152
Figure 35: Distribution of proportion of superstructure over DELNA sublevels

Figure 35 above demonstrates that the use of superstructure clearly increased as


proficiency level increased. A Welch test revealed statistically significant differ-
ences between the different levels of writing, F (4, 56.63) = 5.50, p = .001. The
Games-Howell procedure failed to show any significant differences between adja-
cent levels.

The result for indirect progression was less clear, but showed an increase accord-
ing to level. An analysis of variance revealed statistically significant differences
between the different levels of writing, F (4, 574) = 5.85, p = .000. However
again, the Games-Howell procedure resulted in no statistically significant differ-
ences between any adjacent band levels.

153
Figure 36: Distribution of proportion of extended progression over DELNA sublevels

Figure 36 above shows that extended progression was used more frequently by
lower level writers. The distribution over levels 6, 7 and 8 was very similar. An
analysis of variance, however, revealed no statistically significant differences be-
tween the groups, F (4, 577) = 1.62, p = .168.

Figure 37: Distribution of proportion of unrelated progression over DELNA sublevels

154
Unrelated progression (Figure 37 above), whilst being used in about a quarter of
all topic progressions at level 4, was only very rarely found in writing at level 8. A
Welch test revealed statistically significant differences between the groups, F (4,
576) = 6.40, p = .000. The Games-Howell procedure showed that the only border-
ing band levels that were statistically distinct from each other were levels 5 and
6.

Figure 38: Distribution of proportion of coherence breaks over DELNA sublevels

Coherence breaks occurred relatively frequently at level 4 (see Figure 38 above).


However at the higher levels this reduced substantially, with virtually no coher-
ence breaks evident at level 8.

A Welch test revealed statistically significant differences between the groups, F


(4, 56.98) = 9.24, p = .000. The Games-Howell post hoc procedure showed that
no neighboring levels were statistically distinct from each other.

6.7 Cohesion

Two variables were investigated as part of cohesion, firstly, anaphoric pronominal


references, and secondly, the number of linking devices.

First an inter-rater reliability analysis was undertaken for the anaphoric pronomi-
nals. This involved a second re-searcher double-coding 50 scripts. The correlation
co-efficient indicates a high level of inter-rater reliability, r = .969, n = 50, p =
.000.

155
Each anaphoric pronominal investigated as part of cohesion (after pronominals
used less than 50 times overall were deleted) was then correlated with the final
average score, using a Pearson correlation coefficient to establish if some were
used more commonly by either low or high level writers. The results of this corre-
lational analysis can be seen in Table 48 below.

Table 48: Correlations of anaphoric pronominals with DELNA writing score


they them it their
DELNA r= -.174** r= -.112** r= -.112** r= -.102**
writing p =.000 p = .006 p =.006 p =.012
score that these this
r= -.057 r=.119** r=.264**
p = .166 p =.004 p =.000
N = 601, ** = p<.01

Based on the correlations, the items were divided into two groups: firstly, items
that showed a negative correlation with the overall average score (and were there-
fore used more commonly by lower level writers), and secondly those that corre-
lated positively with the average score (and were therefore more commonly used
by higher level writers). The correlation for that was negative but not significant,
and was therefore excluded from the analysis. Table 49 below shows the two
groups.

Table 49: Anaphoric pronominals that correlate positively and negatively with the DELNA
writing score
Positive correlation Negative correlation
- these - they
- this - them
- it
- their

Then, these two groups of pronominals were plotted against the overall score (see
Figures 39 and 40 below) to provide a graphic representation of the distribution.
Table 50 below provides the relevant descriptive statistics.

156
Figure 39: Distribution of this, these over overall sample and DELNA sublevels

Table 50: Descriptive statistics – these, this


DELNA Level Mean SD Minimum Maximum
4 1.17 1.03 0 3
5 2.03 1.97 0 8
6 2.90 2.35 0 10
7 3.45 2.41 0 9
8 4.69 1.93 0 9

The overall distribution of this and these was slightly positively skewed as well as
peaked. It can be seen that the overall mean score was nearly three per script.
About one hundred writers in the sample did not use either of these items.

The box plots (Figure 39) for this and these show that the use of these two pro-
nominals increased as the proficiency level increased. The same can also be seen
in the table depicting the descriptive statistics (Table 50). A Welch test showed
statistically significant differences between the groups, F (4, 66.00) = 20.74, p =
.000. The Games-Howell procedure of pair-wise comparisons showed that adjoin-
ing levels 5 and 6, as well as 7 and 8, were statistically different from each other.

Next, the distribution of the group of pronominals that correlated negatively with
the overall DELNA score were investigated over the different DELNA levels. The
box plots in Figure 40 below show that with increasing proficiency level, fewer of
these pronominals were used. This is also evidenced in Table 51 below.

157
Figure 40: Distribution of they, them, it, their over overall sample and DELNA sublevels

A Welch test revealed statistically significant differences between the groups, F


(4, 61.37) = 7.52, p = .000. However, the Games-Howell procedure showed that
no adjacent band levels differed statistically from each other.

Table 51: Descriptive statistics – they, them, it, their


DELNA level Mean SD Minimum Maximum
4 6.17 4.72 2 9
5 4.86 4.26 0 20
6 3.81 3.85 0 24
7 3.00 3.46 0 27
8 2.12 2.21 0 7

The second variable for cohesion was the number of linking devices in the data.
An analysis of the number of linking devices over the different proficiency levels
(Figure 41) showed that when this variable was controlled for essay length, lower
level writers used far fewer linking devices than did higher level writers. This
variable was, however, slightly inconclusive when not controlled for length. Table
52 below shows the descriptive statistics.

158
Figure 41: Distribution of number of linking devices per total words over overall sample and
DELNA sublevels

The analysis of variance showed statistically significant differences between the


different band levels, F (4, 576) = 5.95, p = .000. The Games-Howell procedure
showed that the only adjacent pair of band levels that was statistically signifi-
cantly different was levels 6 and 7.

Table 52: Descriptive statistics – Number of linking devices per total words
DELNA level Mean SD Minimum Maximum
4 .03 .01 .02 .06
5 .02 .01 .00 .06
6 .02 .01 .00 .06
7 .02 .01 .00 .05
8 .02 .01 .00 .04

A qualitative analysis of the linking devices used at different proficiency levels


showed that there were three different groups of writers. At very low levels (levels
4 and 5) writers used mainly very simple linking devices, like and, but and be-
cause. At levels 6 and 7, writers used connectives very mechanically (e.g. by list-
ing ideas using firstly, secondly etc.). At high levels, connectors were more varied
and skilfully used.

159
6.8 Reader-Writer Interaction

Markers of reader-writer interaction were grouped into four categories: markers of


writer identity, hedges, boosters and attempted passive voice.

Over half of the scripts made no use of writer identity. However, although so
many scripts did not make use of this category, the mean was just under 2.5 mark-
ers per script. This shows that a number of writers used a large number of these
markers.

Figure 42: Distribution of features of writer identity over overall sample and DELNA sub-
levels

The box plots in Figure 42 and the table of descriptive statistics (Table 53) above
show the distribution over the different DELNA levels. It is clear that this variable
did not differentiate distinctly between the different proficiency levels.

Table 53: Descriptive statistics - Writer identity


DELNA level Mean SD Minimum Maximum
4 2.83 1.95 0 6
5 2.32 2.72 0 19
6 2.64 3.71 0 29
7 2.49 3.01 0 17
8 2.48 3.25 0 7

160
The analysis of variance showed no statistically significant difference between the
different levels of writing, F (4, 577) = 1.07, p = .368.

The second variable under investigation was the number of hedging devices. On
average, writers used just under six of these structures per script.

Figure 43: Distribution of hedging devices over overall sample and DELNA sublevels

When broken up into the different DELNA band levels, the use of hedging de-
vices can be seen to have quite clearly distinguished between different levels of
writing. This is revealed in the box plot (Figure 43 above) as well as in the table
summarising the descriptive statistics (Table 54 be-low). The table shows that
whilst writers at lower levels used on average about five hedging devices in their
writing, higher level writers used more than eight of these devices.

Table 54: Descriptive statistics – Hedges


DELNA level Mean SD Minimum Maximum
4 5.00 3.10 1 12
5 4.70 2.85 0 15
6 5.84 3.88 0 20
7 6.38 3.68 0 17
8 8.42 2.91 4 14

161
The analysis of variance revealed a statistically significant difference between the
groups, F (4, 596) = 7.39, p = .000. The Games-Howell procedure showed that
levels 5 and 6 as well as levels 7 and 8 were statistically distinct from each other.

The hedging variable was also investigated when script length was controlled.
This showed an even stronger difference between the different proficiency levels.

The final variable investigated in this category was boosters. The distribution per
band level (box plots in Figure 44) and the table indicating descriptive statistics
(Table 55) show that this category failed to distinguish between different levels of
writing because writers of all levels used on average about 2.5 boosters in their
writing.

Figure 44: Distribution of boosters over overall sample and DELNA sublevels

An analysis of variance revealed no statistically significant differences between


the different levels of writing, F (4, 596) = .157, p = .960.

Table 55: Descriptive statistics - Boosters


DELNA level Mean SD Minimum Maximum
4 2.5 1.45 0 9
5 2.5 2.14 0 16
6 2.4 2.05 0 12
7 2.44 2.06 0 11
8 2.46 1.88 0 14

162
Finally, the use of the attempted passive voice was investigated.

Inter-rater reliability was established by a Pearson correlation between the coding


of two raters on a sample of fifty scripts. The correlation coefficient shows a
strong relationship, r = .898, n = 50, p = .000.

Figure 45: Distribution of passives over overall sample and DELNA sublevels

The box plots (Figure 45) and the table above (Table 56) show that higher level
writers used the passive more frequently, whilst hardly any writers at level 4 at-
tempted this structure; however the differences between the different levels of
writing proficiency were very small on average.

Table 56: Descriptive statistics – Passives


DELNA level Mean SD Minimum Maximum
4 .33 .89 0 3
5 .80 1.09 0 5
6 1.03 1.25 0 6
7 1.05 1.25 0 8
8 1.38 1.70 0 5

An analysis of variance revealed no statistically significant differences between


the different levels of writing, F (4, 596) = 2.37, p = .052

163
Finally, it was of interest whether there was a relationship between the use of
markers of writer identity and the passive voice. It is conceivable, for instance,
that writers who use markers of writer identity (by projecting their own voice into
the text) use fewer passives. A correlation analysis was conducted which showed
a positive relationship between these two variables, r = .304, n = 583, p = .000.
This means that writers who used more passives also tended to use more markers
of writer identity.

6.9 Content

The final category investigated was content. Content was divided into three sec-
tions, closely following the three sections of the prompts: data description, data
interpretation and Content Part three.

Part one, the description of data was calculated as percentages of information de-
scribed.

Inter-rater reliability was established by having a second rater double code a sub-
set of fifty scripts. The relationship was significant, r = .821, n = 50, p = .000.

The mean for all scripts was 0.59, indicating that, on average, the writers included
just under 60% of the possible data. Whilst some writers did not attempt this sec-
tion of the task and therefore scored 0%, about 70 writers described all the pieces
of information deemed important by the expert writers and therefore scored 100%.
The largest number of writers (more than 150) described 50% of the data.

Figure 46: Distribution of proportion of data description over overall sample and DELNA
sublevels

164
The box plots in Figure 46 and Table 57 indicate that this variable, based on the
mean scores, splits the data set into clearly separate levels.

Table 57: Descriptive statistics - Data description


DELNA level Mean SD Minimum Maximum
4 .19 .07 .13 .33
5 .47 .25 .00 1.00
6 .58 ..23 .00 1.00
7 .61 .21 .00 1.00
8 .81 .20 .13 1.00

A Welch test was performed to investigate differences between the groups. The
analysis revealed statistically significant differences between the groups, F (4,
55.58, p = .032. However, no adjacent pairs were found to be statistically distinct
by the Games-Howell procedure.

The second part of each writing task, the interpretation of data, is scored in terms
of the number of reasons given for the facts described in the data.

An inter-rater reliability test established a strong correlation between the two cod-
ers, r = .811, n = 50, p = .000.

The data distribution over the five DELNA proficiency levels was investigated
using side-by-side box plots (Figure 47). The means (as seen in Table 58 below)
ranged from 1.6 to 4.5 reasons, showing a clear differentiation according to level.

A Welch test revealed statistically significant differences between the groups in-
volved, F (4, 57.26) = 5.78, p = .001. The Games-Howell procedure showed that
no adjacent band levels were statistically distinct from each other.

165
Figure 47: Distribution of interpretation of data over overall sample and DELNA sublevels

Table 58: Descriptive statistics - Interpretation of data


DELNA level Mean SD Minimum Maximum
4 1.64 .50 1.00 2.00
5 2.67 1.39 0.00 6.00
6 3.23 1.51 0.00 8.00
7 3.54 1.38 0.0 8.00
8 4.52 1.23 2.00 7.00

The final part of each prompt, content Part 3, required the writer to either de-
scribe how the current situation will or can be changed in the future or describe a
similar situation in their own country. This part was again scored by giving a
point for each proposition.

An inter-rater reliability check was undertaken to ensure reliability in the coding


of the variable. The resulting coefficient indicates a strong reliability, R = .807, n
= 50, p = .000.

When the number of propositions in part three of the prompt was plotted against
the overall DELNA score (Figure 48), it became clear that the data was separated
well by this variable. Descriptive statistics can be seen in Table 59 below.

166
Figure 48: Distribution of Content Part 3 over overall sample and DELNA sublevels

The analysis of variance revealed statistically significant differences between the


groups involved in the analysis, F (4, 576) = 9.61, p = .000. The Games-Howell
procedure showed that the only adjoining band levels statistically distinct from
one another were levels 5 and 6.

Table 59: Descriptive statistics - Content Part 3


DELNA level Mean SD Minimum Maximum
4 .73 .90 0.00 2.00
5 1.68 1.31 0.00 6.00
6 2.28 1.54 0.00 7.00
7 2.62 1.55 0.00 9.00
8 4.00 1.22 2.00 6.00

6.10 Conclusion

The results in this chapter are based on the analysis of 601 writing scripts pro-
duced as part of the 2004 administration of DELNA. Each measure was plotted
against the different proficiency levels, providing a clear visual overview of the
distribution. Inferential statistics were presented for each structure. The analysis
revealed that the variables in Table 60 below successfully differentiated between
the different levels.

167
Table 60: Variables successful in differentiating between levels
Construct Measure
Accuracy Percentage error-free t-units
Fluency Number of self-corrections
Complexity Average word length
Sophisticated lexical words / total lexical words
Number of AWL words
Mechanics Paragraphing
Coherence Parallel progression
Direct sequential progression
Superstructure
Indirect progression
Unrelated progression
Coherence break
Cohesion Anaphoric pronominals – these, this
Linking devices – qualitative analysis
Reader/writer interaction Number of hedges
Content Percentage data supplied
Number of propositions (Part 2 and 3)

The next chapter discusses the findings presented above. Here, relevant previous
research is related to the current data. Based on the findings in this chapter, the
new rating scale is developed.

---
Notes:
1
The n-size in each histogram differs, because of missing values resulting from the analysis.
2
No writing scripts scored at level 9 were included in this analysis, because overall as part of the
more than two thousand scripts there were only three writing samples that received a score of 9 by
both raters. These three scripts were excluded from any further analysis. Scripts that were scored
at level 9 by only one rater were rounded down as part of the calculation of the average of two
raters.
3
Percentages are represented as proportions of 1 in the data below.
4 A further analysis showed that if the variable is controlled for the number of words per script,
the variable is even more discriminating between levels. For reasons of space it was not repro-
duced in this chapter.
5
See Note 4
6
See Note 4

168
Chapter 7: Discussion – Analysis of Writing Scripts

7.1 Introduction

The main aim of this study was to investigate whether a more detailed, empiri-
cally developed rating scale for writing would result in more reliable and valid
rater judgements in a diagnostic context than a more intuitively-developed rating
scale. During Phase 1, 601 writing scripts produced as part of a normal opera-
tional administration of the DELNA (Diagnostic English Language Needs As-
sessment) were analysed using detailed discourse analytic measures. The method-
ology and results of Phase 1 were presented in Chapters 5 and 6. This chapter pre-
sents a discussion of the first subsidiary research question:

Which discourse analytic measures are successful in distinguishing between


writing samples at different DELNA writing levels?

In this chapter, key findings are summarized and discussed in relation to previous
literature. At the end of the discussion of each aspect of writing, the relevant new
trait scale is presented. The following principles were followed for the design of
the rating scales:

x Only measures that successfully distinguished between the different levels


of writing were used in the rating scale (i.e. measures that were statisti-
cally significant and did not result in a u-shaped or n-shaped distribution)
x The measures selected needed to be usable by raters in a rating situation
x The differences between levels needed to be large enough to be detectable
by raters
x The measures had to be reliable (as indicated by inter-rater reliability
measure). A reliability of over .80 was seen as acceptable
x Only measures that incorporated features that were found in most scripts
were included in the scale
x If several measures were available for a certain feature of writing, the one
that would be the easiest to apply in a rating situation and that had the best
discrimination between levels was chosen
x rating scales were designed based on either the numeric value of a measure
or an approximation (e.g. 50% was represented as ‘half’)

7.2 Accuracy

A number of measures identified in the literature were explored for the analysis of
accuracy in the pilot study. All measures trialed discriminated successfully be-
tween the different levels. However, most of these measures were difficult to ap-

169
ply in the rating process or did not account for differences in text length. It was
decided to use the percentage of error-free t-units in the main study.

This measure proved highly successful in distinguishing between the five different
band levels in the DELNA corpus although there was some overlap. The analysis
of variance showed that there were statistically significant differences between the
different levels of writing.

The measure of percentage of error-free t-units has been used repeatedly in the
literature. Wolfe-Quintero et al. (1998), in the most comprehensive review of
studies on measures of accuracy, fluency and complexity noted the varying suc-
cess of the measure. Twelve studies found a significant relationship between pro-
ficiency level and the percentage of error-free t-units while eleven did not. In this
study, the measure proved successful with some overlap between levels. It is pos-
sible that the overlap was partly due to some learners’ accuracy not increasing in a
linear fashion (as was for example found by Henry, 1996).

One reason why this measure was so successful in discriminating between the dif-
ferent levels of writing ability could be the way the scripts were grouped into lev-
els (one of the limitations of this study). The scripts were classified according to
scores awarded on the basis of the existing DELNA rating scale. However, some
authors (e.g. Weigle, 2002) have shown that raters base their decisions mainly on
a holistic rating of writing scripts. This holistic score is often highly correlated
with the number of errors in a script. Raters seem to base their decisions on the
accuracy of a script, as this is a very noticeable feature of a writing script. There-
fore, it is possible that the ratings used for the groupings of the scripts in this
study were mainly based on a holistic impression of the accuracy of a script.

7.2.1 Trait scale: accuracy

The rating scale for accuracy was designed so that the raters did not have to actu-
ally count each error-free t-unit. Instead, it required them to estimate the propor-
tion of error-free t-units when reading a script. It was further decided that raters
did not need to be trained to identify t-units in this data because a brief analysis of
t-unit borders showed that these coincided in over 90% of the cases with sentence
breaks.

Although the analysis of the scripts only showed five distinct levels of accuracy
(because no scripts at level 9 were included in the analysis), a sixth level was
added to the trait scale of accuracy to acknowledge completely error-free scripts.
The rating scale for accuracy is shown in Table 61.

170
Table 61: Rating scale - Accuracy
9 8 7 6 5 4
All sen- Nearly all About ¾ of About half About ¼ of Nearly no or
tences error- sentences sentences of sentences sentences no error-free
free error-free error free error-free error-free sentences

7.3 Fluency

7.3.1 Temporal fluency

Two types of fluency were investigated in both the pilot study and the main analy-
sis of the writing scripts: temporal and repair fluency. The measure chosen for
temporal fluency was the number of words produced within the time limit of 30
minutes. Although some doubt existed about this measure because of varying
findings of other research studies and the fact that not all students used the 30
minutes they were entitled to, this measure produced some promising results in
the pilot study. Therefore, the number of words were analysed in the main study.
Although the histogram showed large variation among the scripts in terms of the
number of words produced, this measure was not successful in distinguishing be-
tween the different proficiency levels.

Wolfe-Quintero et al. (1998), in their review of the literature, also found varying
results for this measure. Although ten studies found significant differences among
proficiency levels, seven did not. As in this study, Larsen-Freeman (1978; 1983)
and Henry (1996) found a ceiling effect around the higher levels or even a de-
crease at the advanced level. The findings of this study are also in line with Cum-
ming et al.’s (2005) investigation of TOEFL essays. The authors also failed to
find a significant difference between the two higher levels (levels 4 and 5). A
similar study looking at IELTS essays (Kennedy & Thorp, 2002) did not differen-
tiate between immediately adjacent levels, but looked only at differences between
essays at levels 4, 6 and 8. Although the authors fail to report means for each
level, the minimum and maximum number of words at each level also indicate a
large amount of overlap, even though the levels were not adjacent. Essays at level
4 ranged from 111 to 370 words, essays at level 6 from 184 to 485 words and es-
says at level 8 from 239 to 457 words. These ranges suggest that there was proba-
bly no statistical difference between the essays at higher levels. It could therefore
be argued that the number of words is more successful in distinguishing between
lower level writers, but is not a measure that can be expected to successfully dif-
ferentiate between students who have already been admitted to university.

171
7.3.1.1 Trait scale: temporal fluency

It was decided not to include temporal fluency in the rating scale because there
was little evidence from the analysis of the scripts that there are differences be-
tween the levels of writing in terms of the number of words that writers produce.

7.3.2 Repair fluency

The second measure of fluency was repair fluency, operationalised as the number
of self-corrections. This measure has not been applied to writing before but was
‘borrowed’ from research on speaking. The variable distinguished successfully
between the different proficiency levels but the differences between levels were a
lot less pronounced than for accuracy. The measure was included in the rating
scale but there is some doubt regarding its usefulness in the context of writing.
This will be discussed in more detail in the context of Research Question 2, in
light of the feedback from the raters.

7.3.2.1 Trait scale: repair fluency

On the basis of these findings, the rating scale for fluency was based only on the
variable ‘the number of self-corrections’. The scale largely followed the findings
from the analysis. The levels were slightly adjusted to allow for better distinctions
between bands. For example, band level 8 was designed to include no more than
five self-corrections although the analysis of band 8 resulted in a mean of nearly 7
and so on. As with accuracy, a sixth level (level 9) was added to the scale to ac-
knowledge scripts with no self-corrections. The rating scale for repair fluency is
shown in Table 62 below.

Table 62: Rating scale - Repair fluency


9 8 7 6 5 4
No self- No more 6-10 self- 11-15 self- 16-20 self- More than
corrections than 5 self- corrections corrections corrections 20 self-
corrections corrections

Overall, it can be said that the area of fluency in writing is under-researched. If


more appropriate and successful rating scale descriptors for fluency are to be de-
veloped, then this area needs more attention in the future. The area of second lan-
guage acquisition could contribute greatly to this endeavour.

172
7.4 Complexity

7.4.1 Grammatical complexity

Two different types of complexity were investigated in both the pilot study and
the main analysis of the writing scripts: grammatical and lexical complexity.
Grammatical complexity was operationalised as clauses per t-units. Although the
measure was successful in the pilot study, it failed to distinguish between the five
levels of writing in the main analysis. Interestingly, Wolfe-Quintero et al. (1998)
found that although a number of studies in their review returned non-significant
results for this measure, it seemed to generally increase at least with overall profi-
ciency level. However, a more recent study also undertaken in an assessment con-
text, in this case TOEFL (Cumming et al., 2005), also returned non-significant
results for this measure. The authors report very little difference between the pro-
ficiency levels, with the means ranging from 1.5 to 1.8. These are slightly higher
and more varied than those found in this study (which found means ranging from
1.4 to 1.5), but present a similar picture to the current findings. It is possible that
the context under which the data are collected plays a role in this measure. The
students taking the DELNA assessment were aware that their writing was going to
be assessed. It is possible, therefore, that when students are in an assessment situa-
tion, they employ a play-it-safe method and focus more on the accuracy (and lexi-
cal complexity) of their writing at the expense of grammatical complexity. It is
interesting though that the complexity of sentences is regularly included in rating
scales of writing. If this and other studies show that writers do not differ greatly
from each other in terms of the complexity of their sentence structure when in an
assessment context, this measure should perhaps not be included in rating scales
in the future. It might be important to make raters and rating scale designers aware
of the limitation of this measure.

It can further be argued that not having a successful measure for grammatical
complexity is a limitation of this study. If time had allowed it, it would have been
useful to pursue other measures of grammatical complexity. A possibility for fur-
ther research would be measures of the number of passives per t-units or complex
nominals per t-unit. However, Wolfe-Quintero et al.’s review shows that in previ-
ous research not many measures of grammatical complexity have been successful.

7.4.1.1 Trait scale: grammatical complexity

Based on these findings, the decision was made not to include this variable in the
rating scale.

173
7.4.2 Lexical complexity

The second type of complexity pursued was lexical complexity. Several measures
were examined in the pilot study and the three most promising measures were ex-
amined in the main analysis. These were the sophisticated words over total lexical
words, the average word length and the number of AWL words. All three meas-
ures were successful in distinguishing between the different levels of writing. The
measure of sophisticated lexical words over total words was used in a longitudi-
nal study by Laufer (1994). Like Laufer’s, this study was able to show that this
measure differentiates between proficiency levels. The concern was however, that
it would be difficult for raters to use in the rating process. The average word
length was also successful in distinguishing between the different proficiency lev-
els as it has been in other studies (e.g. Grant & Ginther, 2000). However, there
was a concern whether raters would be able to judge the average word length
when rating. Differences between the different proficiency levels were not pro-
nounced enough to be detected by human raters examining a hand-written writing
product. For this reason, the measure, although promising, was not included in the
rating scale. The only measure incorporated in the scale, was the number of AWL
words in a text. Although the original measure was the percentage of AWL words,
a brief investigation of the scripts in the sample showed that controlling for text
length in this manner made no difference to the result. It was therefore thought
that it might be easier for raters to look for the number of AWL words. No prior
research could be located for this measure, although it parallels the variable of so-
phisticated lexical words over total lexical words. There are, of course, several
problems with this measure. It does not control for students reusing the same word
on several occasions. It would possibly have been better to measure the number of
different AWL words. However, overall, the number of AWL words seems a
promising measure which might be usefully applied in other contexts.

7.4.2.1 Trait scale: lexical complexity

Two main considerations went into the design of the descriptors for lexical com-
plexity. Firstly, the variable used needed to be usable by raters in a testing situa-
tion. This excluded the average word length and sophisticated lexical words over
total lexical words because measuring these would be time-consuming. The num-
ber of AWL words was seen to be usable in a rating situation. Secondly, it was
decided that six levels of AWL words would be difficult to distinguish for raters.
Therefore, only four levels were created, joining levels 4 with 5 and 8 with 9. The
rating scale for lexical complexity can be seen in Table 63 below.

174
Table 63: Rating scale – Lexical complexity
8 7 6 5
Large number of Between 12 and 20 5 - 12 words from Less than 5 words
words from aca- AWL words AWL from AWL
demic wordlist
(more than 20)

7.5 Mechanics
Three measures were used for the analysis of mechanics: spelling, punctuation
and paragraphing. The number of spelling mistakes was a promising measure, but
the only really worthwhile difference was found between levels 4 (with eight mis-
takes on average) and level 5 (with just under four errors on average). The other
four proficiency levels were very similar and probably not distinguishable by rat-
ers. No prior research was found that investigated spelling mistakes over different
proficiency levels. However, it was interesting to see that although there were
slight differences between the levels, this measure was not very successful in dif-
ferentiating them. One explanation why the number of spelling errors did not dif-
ferentiate between the different writing levels is that lower level writers know
fewer words and although they produce many mistakes when spelling these
words, in relative terms they can make only a certain number of mistakes. The
words they know are often just very simple, easily spelt words. Writers at higher
levels have access to a larger vocabulary and therefore the chances of misspelling
words also increases. For these reasons, higher level writers produce the same
number of mistakes as lower level writers.

It is therefore not surprising that the measure of the number of spelling mistakes
was not successful in identifying differences between writers at different levels
(except between levels 4 and 5). However, spelling is regularly included in rating
scales of writing. If there are indeed very few differences between writers in the
number of mistakes they produce, then it might be necessary to bring this fact to
the attention of rating scale developers. Further research in this area is clearly
necessary.

The number of punctuation errors did also not distinguish between the different
levels of writing. Very little research was identified on this measure. Mugharbil
(1999) was able to show in his study that the full stop (the only punctuation mark
that was examined in this study) is acquired first by learners. Therefore, it is pos-
sible that there were very few differences between the learners in this study be-
cause all had reached post-beginner level as they were already at university. It
might have been better to include comma errors into the analysis, as Mugharbil
was able to show that the correct usage of the comma is what differentiates higher

175
and lower level learners. However, the comma was not included, as it would have
been difficult to achieve inter-rater reliability on this measure. Punctuation and
capitalisation, which were already identified in the pilot study as not being able to
differentiate between the learners, do not seem to be worthwhile measures to pur-
sue in the future.

The third measure used for the analysis of mechanics, was paragraphing. This
measure has not been used in this form in any previous studies and was created
specifically for the tasks used in the context of DELNA. A slightly similar analy-
sis of paragraphing in a study by Kennedy and Thorp (2002) failed to produce any
clear results between writers in the IELTS test, although they found that writers at
level 4 produced significantly more essays with only one paragraph than writers at
level 6. Although the measure used in this study distinguished between the differ-
ent proficiency levels, there are some problems with it. For example, it disregards
unnecessary paragraph breaks. That is, if a writer produces very short, two-
sentence paragraphs, this is not penalized. The measure also does not account for
the ordering of the information within a paragraph. So, whilst the measure can be
applied easily and seems to be successful in discriminating between different lev-
els of writing (as was shown by the analysis), it is very mechanical and has very
clear shortcomings. It would be useful if further studies could attempt to develop a
more sophisticated measure of paragraphing.

7.5.1 Trait scale: mechanics

Based on these findings, it was decided that spelling and punctuation would not be
included into the rating scale, as a clear differentiation between levels of writing
could not be shown. The trait scale for paragraphing can be found in Table 64 be-
low.

Table 64: Rating scale - Paragraphing


9 8 7 6 5
five para- four paragraphs three para- two paragraphs one paragraph
graphs graphs

The first consideration when designing the trait scale, was that although the
DELNA scale has six levels, for this new scale only five levels would be possible.
The histogram shows that only very small percentages of students in the sample
produced either one or five paragraphs and the majority produced three (nearly
half of the students). A decision had to be made as to which level to leave empty –
either the highest level (9) or the lowest level (4). As five paragraphs were seen as

176
the perfect response, level 4 was left empty and the descriptors were scaled from
levels 5 to 9.

7.6 Coherence

Seven topical structure categories were investigated, three of which had been used
previously in other studies (parallel progression, direct sequential progression and
unrelated progression), two were taken from the literature but adapted to suit the
data (indirect progression and extended progression) and two were newly devel-
oped (superstructure and coherence break).

The findings of the analysis of the scripts are generally in line with existing stud-
ies, showing that higher level writers use more direct sequential progression (as
was shown by Wu, 1997), less parallel progression (as shown by Burneikaité &
Zabiliúté, 2003; Schneider & Connor, 1990; Wu, 1997) and less unrelated pro-
gression (as found by Wu, 1997). As with Schneider and Connor, no difference
was found in the use of extended progression by higher and lower level writers.
The findings for the new categories followed the results of the pilot study. It was
shown that higher level writers use more superstructures and linkers to make their
writing coherent and make use of more indirect progression and fewer coherence
breaks.

The correlation coefficients that resulted from the correlation of the overall writ-
ing score with the different topical structure categories can be seen as rather weak.
There are two reasons for this. Firstly, the large sample of scripts (601) influences
the resulting correlation coefficients. Large sample sizes inevitably result in lower
values. A second reason is that there are a number of intervening variables at play.
The final writing score is a product of a number of different aspects. Earlier in this
chapter, a case was made for the fact that raters often put more emphasis on more
explicit features of a piece of writing, for example, accuracy and vocabulary.
Therefore, lower correlation coefficients can be expected. Overall, the results for
the analysis of coherence are very promising and might be applicable to other
contexts and studies.

Although the correlations of the topical structure categories with the overall writ-
ing score (see Table 47) were all of more or less similar strength, it is possible that
a multiple regression analysis might show that some categories are more indica-
tive of high or low level writing. If some categories predict certain levels of writ-
ing more than others, then it would be possible to reduce the number of categories
that are included into the rating scale. This would probably simplify the rating
process for the raters and make rater training significantly easier. However, as the
idea of a multiple regression analysis only emerged after the design and trialling

177
of the rating scale, it was not pursued but left as a possible avenue for further re-
search.

7.6.1 Trait scale: coherence

The rating scale that was based on these findings can be seen in Table 65 below.

Table 65: Rating scale – Coherence


8 Writer makes regular use of superstructures, sequential progression and possibly
indirect progression
Few instances of unrelated progression
No coherence breaks
7 Frequent: sequential progression, superstructure occurring more frequently
Infrequent: parallel progression
Possibly no coherence breaks
6 Mixture of most categories
Superstructure relatively rare
Few coherence breaks
5 As for level 4, but coherence might be achieved in stretches of discourse by overus-
ing parallel progress-sion
Only some coherence breaks
4 Frequent: Unrelated progression, coherence breaks and some extended progression
Infrequent: sequential progression and superstructure

The design of the trait scale for coherence was more difficult than for other scales,
as the results for a number of categories had to be considered and synthesized.
Firstly, the number of band levels needed to be decided. Because the analysis of
the scripts was based only on levels 4 to 8, it was decided that these levels would
also be used for this scale. The two outer levels (levels 4 and 8/9) were described
first. Here, information was included for raters on what features might be most
commonly or least commonly expected. The central three levels were scaled
based on the findings of the analysis, after a detailed scrutiny of the box plots for
these levels.

7.7 Cohesion

Two aspects of cohesion were investigated in the main study: anaphoric pronomi-
nals and the number of linking devices. The anaphoric pronominals this and
these were shown to be used more by writers of higher proficiency, whilst the re-
mainder (they, them, it, their) were used more by lower level writers (as shown by
a negative correlation with the overall score). The same finding was made by
Banerjee and Franceschina (2006) in the context of the IELTS test. They found a

178
strong increase of the use of this and these at higher levels. The reason why higher
level writers use more of these two demonstrative pronominals might be that it is
more difficult for lower level writers to refer anaphorically to ideas (which is the
main function of this and these).

Whilst the use of this and these produced a clear differentiation between the levels
of writing in terms of means (although with a lot of overlap in the distribution of
the levels), the number of linking devices (e.g. however, therefore) used was not
very strongly indicative of writing proficiency. However, it was interesting to ob-
serve that lower level writers produced slightly more of these devices than higher
level writers. It is also interesting that the findings in the literature are divided on
this. Most studies investigated the difference between native and non-native
speakers in the use of linking devices. Reid (1992) and Field and Yip (1992) both
found that non-native speakers overused these devices, although a large scale
study by Granger and Tyson (1996) was not able to confirm these findings. Ken-
nedy and Thorp (2002), however, in the context of IELTS, were able to show that
lower level writers used linking devices like ‘however’ and enumerative markers
like ‘firstly’ more frequently than writers rated at level 8. In the current study,
only a careful qualitative analysis of the type of devices used was able to differen-
tiate between writing levels (as was also suggested by Granger and Tyson, 1996).
Even this differentiation was not as discriminating as other measures used in this
analysis, and therefore resulted only in four levels in the new scale.

Lexical cohesion, as operationalised by Halliday and Hasan (1976) was not in-
cluded in the main analysis of the writing scripts. Counting lexical chains or the
length of chains proved too time-consuming to be included in the rating process.
The lack of a measure of lexical cohesion needs to be accepted as a limitation of
this study.

7.7.1 Trait scale: cohesion

The design of the trait scale for cohesion was the most difficult, because the
analysis of the frequency of use of linking devices did not produce clear findings
and the differences between the levels in the use of anaphoric pronominals were
small. It was decided not to include the pronominals used more by lower level
writers to avoid having too lengthy level descriptors. Because the analysis of the
number of linking devices was not as clear as hoped for, it was decided to include
the findings from the qualitative analysis. The analysis of the scripts did not make
it possible to easily distinguish between more than four levels and therefore levels
8 and 9 as well as levels 6 and 7 were combined. Table 66 shows the scale devel-
oped for this category.

179
Table 66: Rating scale – Cohesion
8 Connectives used sparingly and skilfully (not mechanically) compared to text
length, and often describe a relationship between ideas
Writers might use this/these to refer to ideas more than four times
7 Slight overuse of connectives compared to text length.
Connectives might be used mechanically (e.g. firstly, secondly, in conclusion)
One or two connectives might be misused
Some connectives skilfully used
This/these to refer to ideas possibly used up to four times
5 Overuse of connectives compared to text length
Connectives used are often simple (and, but, because).
Some might be used incorrectly.
This/these to refer to ideas used only once or twice
4 Overuse of connectives compared to text length
Connectives used are often simple (and, but, because).
Some might be used incorrectly.
This/these not or very rarely used.

7.8 Reader/writer interaction

Four different categories of reader/writer interaction were analysed: hedges,


boosters, writer identity and attempted passive voice. Both the pilot study and the
main analysis of the writing scripts showed that lower level writers use fewer
hedges than writers of higher proficiency. This in fact was by far the most dis-
criminating measure identified in this category. Very little research has looked at
differences over proficiency levels in the use of hedging. Most prior research has
focussed on the features of writing found in particular groups of writers (e.g. Chi-
nese L2 writers – Hu, Brown and Brown, 1982 or EFL writers – Bloor and Bloor,
1991) or a comparison between groups of writers (e.g. L1 and L2 students –
Hyland and Milton, 1997). Intaraprawat and Steffensen (1995), however, were
able to show that better L2 writers used twice as many hedges as poor writers. The
same finding was reported by Kennedy and Thorp (2002) in the context of the
IELTS writing task. All these findings, as well as the results of the current study,
suggest that including the category of hedges into the rating scale descriptors is
warranted, a practice not common in current scales. Further research on this topic
is desirable.

Another category of reader/writer interaction investigated in both the pilot study


and the main analysis, was boosters. A number of studies (Allison, 1995; Bloor &
Bloor, 1991; Hyland & Milton, 1997; Kennedy & Thorp, 2002) showed that L2
writers (especially at lower levels) overuse boosters in their writing. Similarly,
Intaraprawat and Steffenson (1995) were able to show that lower level ESL writ-

180
ers use more than double the number of boosters as higher level ESL writers. This
study, with a mixed cohort of L2 and L1 writers, was not able to show that lower
level writers used more boosters than higher level writers, as might be hypothe-
sised based on the findings of previous research. The pilot study indicated that this
might be the case but when the larger corpus was investigated, there was only a
very slight difference between the levels. It might be necessary to do a more fine-
grained, qualitative analysis of the type of boosters used to identify differences
between writers of higher and lower proficiency.

The third category of reader/writer interaction investigated was that of writer


identity. The prior research findings on this topic were mixed. Hyland (2002a;
2002b) showed that L2 writers use fewer personal pronouns than L1 writers and
that this inevitably resulted in a loss of voice. On the other hand, Shaw and Liu
(1998) were able to show that as L2 students develop their writing, they slowly
move away from the use of personal pronouns and toward using passive verbs.
This study showed that the students investigated in this context used very few ex-
pressions of writer identity. There were furthermore no significant differences be-
tween the writers at different levels. It is possible that the particular genre (exposi-
tory writing) investigated in this study does not lend itself to the expression of
writer identity. Or it may be that the cohort of students investigated has been
taught not to use too many markers of writer identity.

The fourth category of reader/writer interaction investigated was the use of the
passive voice, a category related to writer identity. In line with what Shaw and Liu
(1998) and Banerjee and Franceschina (2006) found, it was seen that as the writ-
ing proficiency level increased, more instances of passive voice were found.
However, overall the frequency was very low. This again might be a feature of the
genre of the task. There was furthermore no negative correlation between the use
of passives and instances of writer identity, as might be expected. The interaction
between these two devices clearly warrants further research.

7.8.1 Trait scale: Reader-writer interaction

After this analysis, it was decided that the only measure of reader-writer interac-
tion that could be transferred into the rating scale was the measure of hedging.
The rating scale can be seen in Table 67 below.

Table 67: Rating scale – Reader-writer interaction


9 8 7 6 5 4
More than 7-8 hedging 5-6 hedging 3-4 hedging 1-2 hedging No hedging
9 hedging devices devices devices devices devices
devices

181
The analysis showed distinct levels in the number of hedging devices. The de-
scriptors were scaled to match the levels in the DELNA scale and to allow for
clear differentiation between levels.

7.9 Content

Because of a lack of objective measures for assessing content, an empirical


method was developed. Expert writers were asked to produce ‘model answers’ to
the different prompts used in this study and the content was evaluated based on
these answers. The first part of the prompt, the description of the data, was evalu-
ated in terms of the percentage of information that was supplied by the DELNA
writer. There was a clear difference between the higher and lower level writers in
the corpus and a good spread across the different categories. Because no other re-
search was located on this topic, no comparison to existing literature was possible.
However, it seems that the measure developed for this study is promising as it dif-
ferentiated successfully between different levels of writing.

7.9.1 Trait scale: Data description

The trait scale for data description can be found in Table 68 below. It can be seen
that the descriptors were scaled to include all levels of the DELNA scale, ranging
from ‘data description not attempted’ to ‘all data described’. It was decided that
the level descriptors would be worded to only broadly represent the percentages
investigated in the quantitative analysis in Phase 1. The information in brackets in
most level descriptors was given to clarify the exact details of the data description
the raters should expect at each level.

Table 67: Rating scale – Data description


9 All data described (all trends and relevant figures)*
8 Most data described (all trends, some figures) (most trends, most figures)*
7 Data half described (all trends, no figures)*
6 Data half described (most trends, some figures)*
5 Data partly described (not all trends, no figures)
(some trends, some figures)*
4 Data description not attempted or incomprehensible*
* incorrectly supplied figures are not counted

The second and third parts of the prompts were evaluated in terms of the number
of reasons (or ideas) and the number of supporting ideas that writers supplied.
This was possible because these sections are clearly demarcated in the essays, and
because of the relatively short time limit. Ideas that did not relate to the topic and
were therefore also not found in the essays of the expert writers were not included

182
into the count. These measures were able to discriminate successfully between the
different levels of writing, although for both sections there was a lot of overlap
between writers at levels 5 and 6. As for the description of the data, no literature
on similar measures was located. It is interesting to see, however, that better writ-
ers were able to produce more relevant ideas in the existing time limit. The reason
for this might be that less space in their working memory is taken up by producing
sentence level gram-matical constructions and therefore more ideas can be pro-
duced in the time available.

7.9.2 Trait scale: Interpretation of data

The resulting scale can be found in Table 68 below. In this case, not all six levels
of the DELNA scale were used, because the analysis in Phase 1 did not provide
evidence for six levels. Only four levels were found, but a separate band at the
lowest level was added to provide for scripts that did not attempt this section of
the prompt.

Table 68: Rating scale – Interpretation of data


8 7 6 5 4
Five or more Four relevant Two to three One relevant No reasons
relevant rea- reasons and/or relevant rea- reason and/or provided
sons and/or supporting sons and/or supporting idea
supporting ideas supporting
ideas ideas

7.9.3 Trait scale: Content Part Three

As for the interpretation of ideas, the four levels found in the data were used in the
rating scale. A separate level was added at the bottom of the scale to include the
large number of scripts that did not attempt this section. The scale can be found in
Table 69 below.

Table 69: Rating scale – Content Part 3


8 7 6 5 4
Four or more Three relevant Two relevant One relevant No ideas pro-
relevant ideas ideas ideas idea vided

7.10 Conclusion

Overall, it can be said that most measures investigated were able to discriminate
between the different proficiency levels. A clear limitation of this study is, how-
ever, the way the independent variable of writing proficiency was mea-sured. Be-

183
cause no independent measure of writing ability (or language ability) was avail-
able, DELNA ratings were used as a basis on which the corpora of the different
sub-levels were created. However, the ratings are a product of the rating scale
used, which was the existing DELNA scale. Therefore, it is not clear if the rank-
ing of the candidates into the levels can be trusted. Since one criticism of the ex-
isting rating scale is that the descriptors are non-specific and therefore the ratings
can be seen as slightly unreliable, this is a problem. To alleviate this problem at
least to a certain extent, the ratings of the two raters were averaged. Therefore,
any erratic rating behaviour of individual raters was controlled for.

The aim of this section was to produce a theoretically-based and empirically-


developed rating scale for diagnostic assessment. The categories and measures
used for the analysis of the scripts were based on current theories of proficiency
and writing (extracted from the taxonomy presented in Chapter 4), whilst the level
descriptors were developed on an empirical basis, therefore reflecting what actu-
ally happens in student performances. Because some minor changes to the rating
scale were undertaken after an initial trial of the scale, the completed rating scale
will be presented in Chapter 8. The aim of the next section is to validate this pilot
scale.

184
Chapter 8: Methodology - Validation of Rating
Scale

8.1 Introduction

While the previous chapters presented the methodology, results and discussion of
the first phase of this study (the analysis of the writing scripts), the following
chapters provide the methodology, results and discussion chapters of the second
phase, the validation of the new scale. As was mentioned previously in Chapter 5,
the validation phase of this study involved two very different research designs.
The first part, the comparison of the measurement properties of the two different
rating scales, involved a quantitative methodology whilst the analysis of the ques-
tionnaires and interviews was conducted within a qualitative paradigm.

This chapter presents the methodology of the second phase of the study. First, an
overview of the design of the second phase is presented, after which the partici-
pants, instruments and procedures are described in more detail.

8.2 Design

The validation phase of the new rating scale took place in several stages. After
approval from the Human Participants Ethics Committee was obtained, raters
were recruited. This was done by contacting all DELNA raters via an email with
information about the study. Some raters immediately agreed to take part, whilst
others requested more details. The first ten raters who volunteered to participate
were recruited for the study. Therefore, raters were self-selected.
Raters first took part in a rater training session for the existing DELNA rating
scale. After this training session, each rater was given a rating pack of 100 scripts
to take home. They were asked to complete the rating of these scripts over a pe-
riod of three weeks in late January and early February 2006.

While the ten DELNA raters were completing their ratings, the new scale, a train-
ing manual and a questionnaire were trialed on a group of ten research students.
These students studied the training manual at home and were then asked to rate a
number of scripts at a plenary meeting. Some of these ten research students were
DELNA raters not taking part in the study; others had done no previous DELNA
rating although most had some experience of rating writing in other contexts. Dur-
ing the plenary session, the students were able to provide feedback on the scale
and the training manual. At the end of the session, they completed a trial ques-
tionnaire to provide further feedback on the rating scale. A number of changes
were made both to the scale and the training manual based on this feedback.

185
In early March 2006, the raters participating in the study were trained in using the
new scale. The same procedures were employed as during the rater training ses-
sion in January. Only eight raters were able to take part in this rater training and
the following rating round in May/June 2006. The eight raters filled in the ques-
tionnaire as soon as they completed their rating. The two remaining raters were
unavailable at the time of the training due to personal reasons. They were indi-
vidually trained during May 2006 (following the same procedures) and completed
the rating round during May and June 2006, after which they also completed the
questionnaire.

After these data collection procedures, a break of two months occurred, during
which the researcher analysed the data. During that time, the decision was made
to conduct in-depth follow-up interviews to elicit more detailed information from
the raters. These were undertaken in September 2006, after a refresher rating
round. The schedule for the validation activities outlined above can be found in
Table 70 below.

Table 70: Schedule of validation phase


Research activity Month and year
Recruitment of raters (N = 10) December 2005
Rater training – DELNA rating scale January 2006
Rating round – DELNA rating scale January/February 2006
Trial of new rating scale, training manual and ques- Week 1 - February 2006
tionnaire
Rater training – new rating scale (Raters 1-8) Week 1 – March 2006
Rating round – new rating scale (Raters 1-8) March 2006 (three weeks)
Rater training – new scale (Raters 9 and 10) May 2006
Rating round – new scale (Raters 9 and 10) May/June 2006 (three weeks)
Questionnaires When raters complete rating
round new scale – March/June
respectively
Refresher rating using both DELNA scale and new September 2006
scale
Interviews September 2006

Overall, the aim of the validation phase of this study was to keep all aspects cen-
tral to performance assessment (the raters, the tasks, the candidates and the rater
training) constant, whilst only the rating scale was varied. It was hoped that any
resulting differences in the ratings were therefore due to differences in the scales
alone.

186
8.3 Participants

Three groups of participants took part in the validation phase of the study. The
first group were the students who produced the writing scripts rated by the raters.
The second group of participants were the research student raters who took part in
a trial of the rating scale, the training manual and the questionnaire. This trial is
described in more detail under Procedures later in this chapter. The third group of
participants was the group of raters. All three groups of participants are described
in more detail below.

8.3.1 The writers

One hundred scripts were chosen for the validation phase of the new scale. These
scripts were selected to represent as closely as possible the larger sample of
scripts in terms of marks awarded and background characteristics of the writers.
The student writers were not actively recruited, but had agreed, when doing the
DELNA assessment, that their scripts could be used for research purposes. Very
little information was available about the writers apart from information recorded
in a self-report questionnaire completed before the administration of the assess-
ment. Below is a summary of the information available.

Most of the one hundred writing samples selected were produced by students who
reported either an Asian language or English as their first language. A smaller
group of nearly ten percent had a European language other than English as their
L1, whilst only two students reported a Pacific Island language or Maori as their
first language. Two students fell into the ‘Other’ category. A summary of the stu-
dents’ first languages can be seen in Table 71 below.

Table 71: First languages of writers of scripts chosen for validation phase (self-report ques-
tionnaire)
L1 N
Asian 40
English 47
European 9
PI/Maori 2
Other 2
Total 100

No information was available about the ages of the students but it is probable that
most of the students were in their late teens or early twenties, as all except one
were enrolled in undergraduate programmes. The gender distribution seen in Ta-
ble 72 below shows that more females than males were in the sample, a trend also

187
observed in the wider DELNA test taker population as well as the university as a
whole.

Table 72: Gender of writers of scripts chosen for validation phase (self-report questionnaire)
Gender N
Female 60
Male 40

Finally, the self-report questionnaire provided information on the programmes the


students were enrolled in when taking the writing test (Table 73 below). The ma-
jority of the students were enrolled in an undergraduate programme in Engineer-
ing (39), followed by students from the Faculty of Architecture, Property, Plan-
ning and Fine Arts (17) and students from the Faculty of Education (17). Over ten
percent of the students were enrolled in the Faculty of Arts (11). The remainder
were enrolled in courses in the Faculty of Medical and Health Sciences (6), the
Faculty of Science (5), or were taking a conjoint degree (2). Three students did
not state the degree they were taking.

Table 73: Faculties of writers of scripts chosen for validation phase (self-report question-
naire)
Faculty N
Engineering 39
Architecture, Property, Planning and Fine Arts 17
Education 17
Arts 11
Medical and Health Sciences 6
Science 5
Conjoint 2
No information available 3
Total 100

8.3.2 The research student raters

To trial the new scale, training manual and questionnaire, ten research students
were recruited. These students were part of a group that met weekly to discuss
their research or other topics of relevance or interest to their study. All research
students were PhD candidates in the Department of Applied Language Studies
and Linguistics at the University of Auckland. All had a background in teaching
English as a second language and therefore experience in marking essays written
by students at different proficiency levels. Two of the research students were also
DELNA raters not taking part in the main validation phase. The procedures em-
ployed in the trial of these instruments will be described later in this chapter.

188
8.3.3 The raters

The DELNA raters involved in the current study were drawn from a larger pool of
raters who were all experienced teachers of English and/or English as a second
language. All raters had high levels of English language proficiency although not
all were native speakers of English. Some raters were certificated IELTS examin-
ers whereas others had gained experience of marking writing in other academic
contexts. Table 74 below presents background information about each rater taking
part in the study. Specific qualifications relating to Language Teaching are noted
with a ‘LT’ next to the qualification. Because the pool of DELNA raters was not
very large, the background information has been kept to a minimum and general-
ised, so that individual raters cannot be easily identified.

Table 74: Background information – raters in validation phase


Rater Qualification Teaching ex- Rating experience (including
perience DELNA)
1 MA LT 5 years 5 years
2 PG Dip LT 5 years 2 years
3 MA 34 years 5 years
4 BA 36 years 9 years
5 MA 17 years 1 year
6 PG Dip LT 40 years 7 years
7 BEd 4 years 2 years
8 PG Dip LT 27 years 8 years
9 MA LT 3 years 2 years
10 PhD Arts 10 years 2 years

8.4 Instruments

Several instruments were used as part of the validation phase of this study. Firstly,
there were the one hundred writing scripts that were rated twice by the raters, first
using the existing rating scale and then using the new scale. Other instruments in-
cluded the two different rating scales, the rating sheets, the training manual, the
questionnaires and the interview questions used in the semi-structured interviews.
Each of these will be described in detail in the following sections.

189
8.4.1 Writing scripts

One hundred writing scripts were chosen from the larger pool of writing scripts
from the 2004 administration of DELNA. The scripts were chosen, as mentioned
in the section on the writers above, to represent the different DELNA levels and
the different background profiles of the candidates in the larger corpus. The dis-
tribution of the four different prompts (described in more detail in Chapter 5) used
in this study is shown in Table 75 below:

Table 75: Distribution of four different writing prompts in sample of one hundred scripts
Prompt N
1 22
2 20
3 38
4 20

The scripts were all photocopied from their original handwritten form, in such a
way that the students’ names and ID numbers could not be identified by the raters.
The scripts ranged from 173 to 450 words (deletions were not included in the
word count), with a mean of 261 words.

8.4.2 Rating scales

To compare the validity of the existing DELNA scale and the newly developed
scale, the raters rated one hundred scripts using both scales. The existing DELNA
rating scale can be found in Chapter 4 and the new scale can be found in this
chapter. Both were analytic rating scales; however, they differed in three ways.
The first and most obvious difference relates to the descriptor styles. Whilst the
existing rating scale had relative, vague descriptors which made use of adjectives
like ‘appropriate’ and ‘extensive’, the descriptors on the new scale were more
specific and mostly involved counting features of writing. Secondly, the DELNA
rating scale had level descriptors for nine categories (or traits) whilst the new
scale had descriptors for ten traits. A comparison of the traits of the two scales can
be seen in Table 76 below. Similar traits are noted in the same row of the table.
The third way in which the scales differed had to do with the number of levels as-
sociated with each trait. The DELNA scale had the same number of levels for
each trait (six band levels ranging from 4 to 9), whilst the new scale had a varying
numbers of levels for different traits. Some categories had only four levels whilst
others had six. The reason for this was that the analysis of the writing scripts con-
ducted in Phase 1 of this research did not provide evidence for the same number
of levels for each trait scale. The number of band levels for each trait scale can
also be seen in Table 76 below.

190
Table 76: Comparison of traits and band levels in existing and new scale
DELNA scale Band levels New scale Band levels

Grammatical accuracy 6 Accuracy 6

Sentence structure 6
Vocabulary and spelling 6 Lexical complex- 4
ity
Data description 6 Data description 6
Data interpretation 6 Data interpreta- 5
tion
Data – Part 3 6 Data – Part 3 5
Style 6 Hedging 6
Organisation 6 Paragraphing 5
Coherence 5
Cohesion 6 Cohesion 4
Repair Fluency 6

8.4.3 Rating sheets

For every assessment using DELNA, raters use a marking sheet based on the de-
scriptors for the current rating scale. Therefore, the same sheet was used in this
study when the existing rating scale was employed. A different marking sheet was
designed for the new scale, for two reasons. Firstly, for the purpose of this study,
the raters did not need to write any comments (as they usually do when using the
existing DELNA scale). Therefore, no space for comments was needed on the
marking sheet. Secondly, to save space, the traits were laid out as columns and the
scripts as rows. In this way, more ratings could fit on one page.

8.4.4 Training manual

Because the raters were all busy and could not put a large amount of time aside
for familiarizing themselves with the new scale, a training manual was produced
which could be studied at home before the rater training and so shorten the train-
ing session. In the manual, clear instructions were provided on how each trait was
to be rated. Traits that were more complicated to rate were further illustrated by
examples and practice exercises. For example, raters could practise identifying
words from the academic word list in a sample text or practise the identification of
the different topical structure analysis categories. At the end of the training man-
ual, the raters were provided with the correct answers.

191
8.4.5 Questionnaire

After the two rating rounds, a questionnaire was administered. The purpose of the
questionnaire was to elicit raters’ perceptions of the measurement efficacy and
usability of the new scale. Raters were asked to consider the categories in the new
scale, the band levels of each trait and the wording of the descriptors. Raters were
also encouraged to reflect on the rating process when using the scale. The ques-
tions can be found in Table 77 below.

Table 77: Questionnaire


Questionnaire
1) What did you like about the scale?
2) Are there any categories missing that you find necessary?
3) Are there any categories that you think do not have the right number of levels?
4) Are there any categories in which you think the wording of the descriptors needs to be
changed?
5) Are there any categories that you found difficult to apply? If yes, please say why.
6) Did you at times use a holistic (overall) score to arrive at the scores for the different cate-
gories?
If not, did it bother you that you did not know what the final score for each script would come
out as? Did that in any way influence the way you rated (for example compared to the existing
DELNA scale)?
7) Please write specific comments that you have about each of the scale categories below.
You could for example write how you used them, any problems that you encountered that you
haven’t mentioned above; you can draw comparisons to the existing DELNA scale or anything
else that you want to mention.

Two versions of the questionnaire were created: a hard copy and an electronic
version, so that raters could choose the way in which they wanted to complete it
(see procedures).

8.4.6 Interview questions

Two months after the rating rounds, more in-depth interviews were conducted
with a subset of seven raters. The questions in the interviews resembled those
found in the questionnaire but in this case, the raters were asked about both scales.
Because the interviews were semi-structured, the exact interview questions varied
slightly from participant to participant. The guiding questions and broad topics
can be found in Table 78 below. For each participant, more specific questions
were created which reflected individual rating patterns noticed during the analy-
sis. For example, if a rater consistently over- or under-used certain band levels of
a trait scale, then the rater was asked more in-depth questions about that particular
trait scale.

192
Table 78: Interview topics
DELNA rating scale:
1. Do you think that the way we have broken the rating down into many specific parts
helps or hinders the process?
2. How would you change the rating scale?
a. in terms of the wording
b. in terms of the categories on the scale
c. in terms of the number of levels on the scale
3. Are there any categories which you find particularly good or problematic?

New rating scale:


4. What did you like about the scale?
5. Are there any categories missing that you find necessary?
6. Are there any categories that you think do not have the right number of levels?
7. Are there any categories in which you think the wording of the descriptors needs to be
changed?
8. Are there any categories that you found difficult to apply? If yes, please say why.

Comparison of DELNA and new scale:


9. Do you think there were any differences in your rating behaviour when you were using
the two different scales?
10. Did you at times use a holistic (overall) score to arrive at the scores for the different
categories when using either of the scales?
11. If not, did it bother you that you did not know what the final score for each script
would come out as when using the new scale?
12. Do you think your rating was more consistent or reliable with either one of the scales?
13. Has your rating behaviour when using the DELNA scale changed since using the new
scale?
14. Do you think one scale is more time-consuming than the other?

8.5 Procedures

8.5.1 Trial of new rating scale, training manual and questionnaires

Before having the raters use the new rating scale, training manual and question-
naire, a trial was conducted. The forum for this was the weekly departmental
meeting of research students which was thought to be suitable for a trial for the
following reasons. Firstly, some of the research students were DELNA raters who
were not part of the study. It was therefore considered fitting to trial the scale on
this group of people, because at least some members of the group had experience
with rating DELNA writing scripts. Because the pool of DELNA raters was lim-
ited, no more DELNA raters could be found for the trial of the materials. Sec-
ondly, the meeting of research students is designed to trial ideas or materials. All
members of the group were in the process of conducting doctoral research them-
selves and therefore provided time to discuss others’ work.

193
8.5.1.1 Trial procedures:

The research students were asked to read the training manual before the session
and rate five scripts at the research meeting. During the session, they were able to
ask any questions and to criticize the material in any way they wanted to. The rat-
ings of the five scripts were then discussed by the group and at the end of the ses-
sion they completed the questionnaire.

8.5.1.2 Trial outcomes:

The comments from the group were extremely helpful for revisions of all three
instruments. Firstly, the group suggested several changes to the training manual.
These changes had to do with clarifications about how some scale categories
should be used. For example, they suggested the provision of a more comprehen-
sive list of hedging devices and a list of Academic Word List (AWL) headwords.
The research students also suggested including a comprehensive list of linking
devices in the section on cohesion.
Several changes were also made to the rating scale itself. It became clear that
some raters had problems applying the descriptors for lexical complexity based on
the AWL words. Because of these problems, the descriptors were extended to in-
clude more general descriptions. The changed descriptors for lexical complexity
can be seen in Table 79 below.

Table 79: Rating scale – Lexical complexity


8 7 6 5
Large number of Between 12 and 20 5 - 12 words from Fewer than 5
words from Aca- AWL words / AWL / vocabulary words from AWL
demic Word List makes use of a limited / uses only / uses only very
(more than 20) / number of sophisti- some sophisticated basic vocabulary
vocabulary exten- cated words words
sive – makes use of
large number of
sophisticated words

Apart from this, the group noticed several minor spelling mistakes and made sug-
gestions about the layout of the scale. All these ideas were helpful and were taken
into con-sideration before the final scale was completed. The final scale can be
seen in Table 80 on the following pages.

194
Accuracy Fluency Complexity
9 All sentences error-free No self- Large number of words from
corrections Academic Word List (more
8 Nearly all sentences No more than 5 than 20) / vocabulary exten-
error-free self-corrections sive – makes use of large
number of sophisticated
words
7 About ¾ of sentences 6-10 self- Between 12 and 20 AWL
error-free corrections words / makes use of a num-
ber of sophisticated words
6 About half of the sen- 11-15 self- 5-12 words from AWL / vo-
tences are error-free corrections cabulary limited, uses only
some sophisticated words
5 about ¼ of sentences 16-20 self- Less than 5 words from AWL
error-free corrections / uses only very basic vocabu-
4 Nearly no or no error- More than 20 lary
free sentences self-corrections

Mechanics Reader-Writer interaction


9 5 paragraphs More than 9 hedging devices
8 4 paragraphs 7-9 hedging devices
7 3 paragraphs 5-6 hedging devices
6 2 paragraphs 3-4 hedging devices
5 1-2 hedging devices
1 paragraph
4 No hedging devices

Data description Interpretation of data Part 3 of


task
9 All data described (all trends
and relevant figures)* Five or more relevant reasons Four or more
8 Most data described (all trends, and/or supporting ideas relevant ideas
some figures)*
7 Data half described (all trends, Four relevant reasons and/or Three rele-
no figures)* supporting ideas vant ideas
6 Data half described (most Two to three relevant reasons Two relevant
trends, some figures)* and/or supporting ideas ideas
5 Data partly described (not all One relevant reason and/or sup- One relevant
trends, no figures) porting idea idea
(some trends, some figures)*
4 Data description not attempted No reasons provided No ideas pro-
or incomprehensible* vided

195
Coherence Cohesion
9 Writer makes regular use of super- Connectives used sparingly but skilfully
8 structures, sequential progression and (not mechanically) compared to text
possibly indirect progression. Few length, and often describe a relationship
incidences of unrelated progression. between ideas. Writer might use this/these
No coherence breaks to refer to ideas more than four times.
7 Frequent sequential progression, su- Slight overuse of connectives compared to
perstructure occurring more fre- text length. Connectives might be used
quently. Infrequent parallel progres- mechanically (e.g. firstly, secondly, in
sion. Possibly no coherence breaks conclusion). One or two connectives might
6 Mixture of most categories. Super- be misused. Some connectives skilfully
structure relatively rare. Few coher- used
ence breaks. This/these to refer to ideas possibly used
up to four times or writer uses connectives
rarely, but some ideas could be more skil-
fully connected
5 As for level 4, but coherence might Overuse of connectives compared to text
be achieved in stretches of discourse length. Connectives used are often simple
by overusing parallel progression. (and, but, because). Some might be used
Only some coherence breaks incorrectly. This/these to refer to ideas only
used once or twice or hardly any connec-
tives used
4 Frequent: Unrelated progression, Writer uses few connectives, there is little
coherence breaks and some extended cohesion. This/these not or very rarely
progression. Infrequent: sequential used.
progression and superstructure

As for the trial of the questionnaire, all students noted that they found the scale to
be very usable. No categories were seen to be missing by the students, although
one thought that the scale should include descriptors on the use of passives. Gen-
erally, the group was positive about the scale.

8.5.2 Rater training

The procedures for rater training will be described in two sections.

8.5.2.1 Rater training: existing scale

For the existing rating scale, the raters were trained in a face-to-face training ses-
sion. In this session, the raters met in plenary and rated 12 scripts as a group. Af-
ter each script, their ratings were discussed and compared to the benchmark rat-
ings awarded to the scripts by four highly experienced raters. The rater training
session lasted for about two hours with a 15 minute break.

196
8.5.2.2 Rater training: new scale

The training procedures for the new scale were generally the same. Eight raters
read the training manual at home and then rated 12 scripts in a plenary session at
which the same procedures were employed as in the first training session. Two
raters, however, could not meet at the time of the plenary session for personal rea-
sons. These two raters were trained individually at a later date (see Table 70
above). The procedures for these sessions replicated those of the group session.

8.5.3 Data collection

8.5.3.1 Ratings of scripts using the two scales:

The one hundred writing scripts selected for the study were photocopied for each
of the ten raters and given a random ID number from one to one hundred. The
scripts were then put in random order into five envelopes of twenty scripts clear
labelling. The rating packs also included a copy of the relevant rating scale and
marking sheet, the prompts, and for the second session a list of AWL headwords
and the training manual. The scripts included no information which could identify
the student writers.

As described earlier, the raters produced ratings of one hundred writing scripts
under two conditions, first using the existing rating scale and then using the new
rating scale. Ideally, a counterbalanced design should have been used, in which
half of the raters first produced the ratings using the existing scale, whilst the
other half of the raters first used the new scale before the two groups changed
over. However, this was not possible for practical reasons. Because the raters
were mostly busy teaching at the time of the study, the rating rounds had to be
arranged around the holidays and the semester breaks. In addition, several raters
had to leave the country around the middle of May 2006. As the DELNA rating
scale was pre-existing, the decision was made to ask raters to produce the ratings
using this scale early in 2006, before the start of the first semester. At this stage,
the new scale had not been completed or trialed. Therefore, all raters rated the
scripts using first the existing scale and then the new scale. It was assumed that
raters would not be able to remember details of individual scripts over the period
of two to three months between the two rating rounds and therefore no order ef-
fect was anticipated.

In each rating round, the raters were given all one hundred scripts at the same time
in five envelopes of twenty scripts each. The raters were instructed to rate no
more than ten scripts in one session to avoid fatigue, and were given three weeks
to complete the ratings. Once the scripts had been rated, the envelopes were

197
handed back to the researcher. The results were immediately entered onto an Ex-
cel spreadsheet.

The raters were paid the current DELNA rate per script so that they spent the
same amount of time on each script as they would do under a regular administra-
tion of the assessment.

8.5.3.2 Administration of questionnaires

At the end of the second rating round, the raters were asked to keep the new rating
scale and the training manual, and were provided with the questionnaire. They
then had three days to complete the questionnaire. This time limit was set so that
their memories would still be fresh. The raters were able to either complete the
paper version of the ques-tionnaire or ask for an electronic version via email
which they could fill in on the computer and return in the same way to the re-
searcher.

8.5.3.3 Administration of interviews

Three months after the second rating round, all raters were invited to participate in
interviews. Seven raters agreed to participate. Therefore the raters in the inter-
views were self-selected. Before the interviews, the raters rated five scripts using
both the existing and the new rating scale to refresh their memories. The inter-
views were undertaken in a quiet conference room at the university and digitally
recorded using an Olympus Digital Voice recorder WS-100. The interviews were
semi-structured and lasted for about 30 minutes. All interviews were later tran-
scribed broadly by the researcher using the Olympus DSS Player A-400 transcrip-
tion software. This software enables the transcription to be carried out with the
help of a foot pedal to stop, start and rewind the recording, reducing the transcrip-
tion time considerably.

The long break of three to four months between the last rating round and the in-
terviews occurred because all data needed to be analyzed first, so that more de-
tailed questions could be devised for each participant (see instrument section).

8.5.4 Data analysis

The analysis of the data will be described in three sub-sections. The first two sec-
tions describe the details of the analysis of the rating data, whilst the third section
briefly comments on the analysis of the interviews and questionnaires respec-
tively.

198
8.5.4.1 Rasch analysis

The rating data was analyzed using the multi-faceted Rasch measurement program
FACETS version 3.59.0 (Linacre, 1988, 1994, 2006; Linacre & Wright, 1993).
The basic Rasch model was first advanced by Rasch (1960; 1980) as a mathe-
matical representation of the interaction of person ability and item difficulty.
These two parameters were modeled on a common scale of log odds units or
‘logits’ (McNamara, 1996). This logit scale has the advantage that it is an interval
scale. It can therefore not only tell the researcher that one item is more difficult
than another, or that one person has more ability than another, but also how large
this difference is. The basic Rasch model was designed to be used for dichoto-
mously scored data (i.e. right/wrong scoring).

The initial Rasch model was further extended into the Rating scale model by An-
drich (1978). This model not only expresses the overall difficulty of a particular
item but is also able to calculate the step difficulty between each step in a rating
scale. A further extension of the Rasch model was proposed in the early 1980s by
Masters and Wright (Masters, 1982; Wright & Masters, 1982). This model has the
added ability of being able to work with items scored using a partial credit scoring
system.

The Rasch model used in this analysis, the multi-faceted Rasch model, is an ex-
tension of both the rating scale model and the partial credit model. The multi-
faceted Rasch model proposed by Linacre (1989) was designed to include any
number of facets pertinent to the assessment situation. A typical basic multi-
faceted Rasch analysis would include candidates, items and raters as facets. The
researcher is able to analyze rating data by summarising overall rating patterns in
terms of group-level main effects for the raters, candidates, traits and any other
variables of the rating situation. In the analyses, the contribution of each facet is
separated out to determine if the various facets are functioning as intended. The
analysis further allows the researcher to look at individual-level effects of the
various elements within a facet (i.e. how individual raters, candidates, or traits in-
cluded in the analysis are performing).

The multi-faceted Rasch model is an additive linear model based on logistic trans-
formation of the observed ratings to a logit scale. In this model, the logit can be
viewed as the dependent variable, while the various facets function as independent
variables influencing these logits (Myford & Wolfe, 2003). When the analysis is
undertaken, the various aspects (or facets) are analysed simultaneously but inde-
pendently and are then calibrated onto the common logit scale. This makes it pos-
sible to measure rater severity on the same scale as candidate ability and trait dif-
ficulty and thus carry out comparisons between different facets.

199
For each element of each facet, the analysis provides a logit measure, a standard
error (which provides information about the precision of the calibration) and fit
indices (which supply information about how well the data of this element fit the
expectation of the measurement model).

The multi-faceted model is now the most general Rasch model and all other mod-
els can be derived from it (McNamara, 1996).

The form of the many-faceted Rasch model used in this study (a multi-faceted
version of the partial credit model) can be represented by the following mathe-
matical model:
Log (Pnijk/Pnijk-1) = Bn - Cj - Di - Fik

Pnijk = the probability of candidate n being awarded a rating of k when rated by


rater j on item i.
Pnjik-1 = the probability of candidate n being awarded a rating of k-1 when rated by
rater j on item i
Bn = the ability of candidate n
Cj = the severity of rater j
Di = the difficulty of item i
Fik = the difficulty of achieving a score within a particular score category k on a
particular item i.
To make the multi-faceted Rasch analysis used in this study more powerful, a
fully crossed design was chosen. That is, all ten raters rated the same one hundred
writing scripts on both occasions. Although such a fully crossed design is not nec-
essary for FACETS to run the analysis, it makes the analysis more stable and
therefore better conclusions can be drawn from the results (Myford & Wolfe,
2003).
As with any other set of statistical procedures, several assumptions need to be
considered before running a multi-faceted Rasch analysis. Some assumptions of
the Rasch model are shared with traditional assessment measures. It is, for exam-
ple, assumed that abler candidates have more ability and will therefore score
higher than less able candidates and that the scores increase as candidate ability
increases. The Rasch model further assumes ‘local independence’, which means
that how candidates answer one item does not affect how they answer any other
item. The Rasch model also assumes ‘unidimensionality’. This means that an as-
sessment taps or captures a single underlying latent trait. For an extensive discus-
sion of the issue of unidimensionality, see McNamara (1996, p. 268 ff).

Multi-faceted Rasch analysis was chosen for this study because analytical tools
from classical test theory have several limitations. For example, an ANOVA-
based ap-proach could be chosen to study group-level rater effects as well as

200
rater-effect interactions. However, ANOVA has the limitation that possible inter-
action effects can contaminate main effects, making the interpretation of the main
effects more difficult (Wild & Seber, 2000). As mentioned earlier, multi-faceted
Rasch measurement goes beyond the de-tection of main effects and interaction
effects, as it allows for the detection of individual level effects. In this respect,
multi-faceted Rasch measurement is superior to ANOVA-based approaches and
regression approaches (Myford & Wolfe, 2003).

Another approach possible when working with rating data is generalisability the-
ory (or G-theory). One limitation of G-theory, which is addressed in multi-faceted
Rasch measurement, is that although it identifies sources of variance attributed to
each facet and its interactions, the impact of such differences on the candidates’
scores during a particular examination is not corrected. Therefore, the candidates
receive the raw scores they earn from the raters they encounter, and not an ad-
justed raw score due to rater differences or other attributes of the examination, as
is produced in multi-faceted Rasch measurement.

FACETS (Linacre, 1988, 2006) makes it possible to analyze data based on an ana-
lytic rating scale both as a whole (to see the functionality of the rating scale as a
whole) or, by employing a partial credit model, with respect to each individual
trait scale. It is also possible to investigate the rating behavior of all raters in the
study as a group or individually or to investigate how each rater employs each in-
dividual trait scale.

For the purpose of the following results chapter, the rating behavior of all raters as
a group was investigated both when using the scale as a whole, and when utilizing
individual trait scales.

The rating behavior of individual raters was also analyzed (when using the scales
as a whole and the individual trait scales), but this is not presented in the results
section as it is not relevant to answering the research questions, which are con-
cerned with group rather than individual behaviour. The individual analysis was
undertaken merely for the purpose of developing more detailed questions for the
interviews.
Before the data was analyzed, a number of hypotheses were developed for com-
paring the two rating scales. Each of these hypotheses related to a group of statis-
tics generated by the FACETS program. These were:

1) discrimination of the rating scale (candidate separation)


2) rater separation
3) rater reliability
4) variation in ratings

201
5) scale step functionality

Why each of these was chosen and the hypothesis relating to each group of statis-
tics will be discussed in detail below.

8.5.4.1.1 Discrimination of the rating scale:

The first hypothesis was that a more discriminating rating scale can be seen as su-
perior. It is important for an assessment to be able to differentiate between candi-
dates. In the case of performance assessment, the tool that is used to achieve this
is the rating scale. The more levels of candidate ability a group of raters can dis-
cern with the help of a rating scale, the better the scale is functioning.

When a rating scale is analyzed, the candidate separation ratio is an excellent in-
dicator of the discrimination of the rating scale. The candidate separation ratio
measures the spread of candidates’ performances relative to their precision
(Fisher, 1992). According to Myford and Wolfe (2003, p.410), this separation is
expressed as a ratio of the ‘true’ standard deviation of ratee performance measures
over the average ratee standard error as in the equation G = True SD / RMSE,
where True SD is the standard deviation of the ratee performance measures (i.e.
true standard deviation = the observed standard deviation – average measurement
error), and RMSE is the root mean-square of the standard errors of the ratee per-
formance measures, or the statistical ‘average’ measurement error of the ratee per-
formance measures. The candidate separation ratio is an excellent indicator of the
discrimination of the rating scale. The higher the separation ratio, the more dis-
criminating the rating scale is. FACETS reports two more measures of candidate
separation, the candidate fixed chi square value and the reliability of the candidate
separation. However, neither of these provides any additional information about
the candidate separation and are therefore not reported in the following chapter.

8.5.4.1.2 Rater separation:

The next hypothesis made was that a well functioning rating scale would result in
small differences between raters in terms of their leniency and harshness as a
group. If a scale is functioning well, the raters will be able to discern the ability of
a candidate easily and do this in agreement with other raters. Thus, raters will not
vary greatly in terms of leniency and harshness. For this reason, a rating scale re-
sulting in a smaller rater separation ratio is seen to be superior. The rater separa-
tion ratio, like the candidate separation ratio, provides a measure of the spread of
the rater severity measures relative to the precision of those measures (Myford &
Wolfe, 2004). The higher the rater separation ratio, the more the raters differed in
terms of severity in their ratings.

202
8.5.4.1.3 Rater reliability:

The third hypothesis was that a necessary condition for validity of a rating scale is
rater reliability (Davies & Elder, 2005). A scale that results in higher levels of
rater reliability can be seen as superior.
FACETS provides two measures of rater reliability: (a) the rater point-biserial
correlation index (or single rater - rest of raters correlation), which is a measure of
how similarly the raters are ranking the candidates, and (b) the percentage of exact
rater agreement, which indicates the percentage of how many times raters
awarded exactly the same score as another rater in the sample. Both types of rater
reliability statistics were deemed necessary based on Stemler’s (2004) paper, in
which he cautions against the use of just one statistic to describe inter-rater reli-
ability.

The single rater - rest of raters (SR/RR) correlation or point-biserial correlation,


summarizes the degree to which a particular rater’s ratings are consistent with the
ratings of the rest of the raters. Each individual rater’s ratings are compared to the
ratings of all the other raters in the analysis. The formula that denotes the exact
computation of the SR/RR correlation can be found in Myford and Wolfe (2003,
p.421-2). Myford and Wolfe note that SR/RR correlations of less than .30 are seen
to be low, whilst .70 or higher is seen to be high for rating scale data composed of
several categories. In this case, as we are first looking only at individual trait
scales, higher values can be expected. If a SR/RR correlation coefficient of a par-
ticular rater is very low, that indicates that the rater orders the candidates in a dif-
ferent way from the other raters. Because the raters in this study were analyzed as
a group rather than individually, the average SR/RR correlation coefficient for the
whole group is presented in each of the tables in Chapter 9.

Recent versions of FACETS (Linacre, 2006) also include a function to calculate


the percentage of exact agreement. This shows the percentage of times each rater
awarded exactly the same score as another rater. The figure reported in the tables
in the following chapter indicates the average percentage of exact agreement for
all ten raters in the group. It could be argued that exact agreement of raters could
also be achieved if a large number of raters utilized play-it-safe methods and
overused the inside categories of the rating scale. If that was the case, more raters
would be awarding 6s or 7s, and there would be a higher chance that agreement
would be achieved. However, this would also lower the discrimination of the scale
(as indicated by the candidate separation ratio reported above) and would also re-
sult in less variation in the ratings, as indicated by very low rater infit mean square
values (which will be discussed below).

203
8.5.4.1.4 Variation in ratings:

Because rating behaviour is a direct result of using a rating scale, it was further
contended that a better functioning rating scale would result in fewer raters rating
either inconsistently or overly consistently (by overusing the central categories of
the rating scale). The idea behind this was that if a rater is unsure what level to
award when using a rating scale, the rater might either rate inconsistently or resort
to a play-it-safe method and overuse the inner categories of a rating scale and
avoid the outside band levels.

The measure indicating variability in raters’ scores is the rater infit mean square
value. Rater infit means square values have an expected value of 1 and can range
from 0 to infinity. The closer the calculated value is to 1, the closer the rater’s rat-
ings are to the expected ratings. Infit mean square values significantly higher than
1.3 (following McNamara, 1996 and Myford and Wolfe, 2000) denote ratings that
are further away from the expected ratings than the model predicts. This is a sign
that the rater in question is rating inconsistently and therefore showing too much
variation. This is called ‘misfit’. Similarly, values lower than .7 indicate that the
observed ratings are closer to the expected ratings than the Rasch model predicts.
This is called ‘overfit’. This could indicate that a rater is rating very consistently.
However, it is more likely that the rater concerned is overusing certain categories
of the rating scale, normally the inside values. This can be confirmed by scrutiniz-
ing the FACETS analysis of individual raters using the individual trait scales.
Only raters actually overusing the inside categories were included in the summary
report on rater variation in Chapter 9.

8.5.4.1.5 Scale step functionality:

The final hypothesis was that a better functioning rating scale would result in bet-
ter scale step functionality. A rating scale is made up of a number of different
band levels. It is important for each level to function appropriately for the entire
scale to perform efficiently.

Linacre (1999) reports a number of statistics that need to be scrutinized when rat-
ing scale functionality is of interest. All features of scale functionality are reported
in a single table in the output of FACETS, entitled category statistics. Figure 49
below presents the category statistics for grammatical accuracy of the existing
scale (as an example). For the output to be valid (i.e. for FACETS to return reli-
able results), each scale category (band level) needs to include at least ten obser-
vations.

204
Figure 49: Scale category category statistics: Grammatical accuracy – existing scale

This can be verified in the second column of the rating scale category statistics
table. Linacre (1999) further argues that it is important that the observations (as
seen in column 2) are regular (i.e. more or less normally distributed with only one
peak). He also suggests that the average measures (seen in column 5) should ad-
vance monotonically. These measures show the average logit value of the candi-
dates rated at each band level. The outfit means square measures (column 7) indi-
cate the difference between the observed average measure and the expected aver-
age measure in each category. The expected outfit mean square value is one. If a
band level displays an outfit mean square value of over 1.4, this indicates unex-
pected ‘noise’ in the category. The reason for this could be found in either the
scale, the candidates or the traits, and needs to be further investigated (Mike Lina-
cre, personal communication, May 2006). Finally, the step calibration measures
(column 8) are the rating scale category thresholds, the point at which a candidate
of that measure of ability has a 50% possibility of being graded into either of the
adjacent band levels. These should advance monotonically and the steps should
advance by at least 1.0 (for a five level scale) and by less than 5.0 logits.

Each table presenting the statistics of two trait scales in the following chapter also
provides a short comment on how the different rating scale band levels (entitled
scale properties) were utilized. Usually this comment focuses on extremely un-
derused categories. Underutilization of levels can occur for two reasons. Firstly, it
could mean that the raters did not use that band level because the descriptors were
not clear to them or did not represent what is actually displayed in the writing
scripts. Alternatively, it could also mean that there were no student performances
at that level in the sample of scripts that the raters rated.

205
The last piece of information provided by FACETS that is of interest to research-
ers interested in scale step functionality are probability curves. Probability curves
are a visual representation of the rating scale category statistics. Figure 50 below
presents the probability curve for the trait scale ‘grammatical complexity’ of the
existing DELNA rating scale (as an example). The horizontal axis represents the
candidate proficiency scale (in logits) and the vertical axis denotes the probability
of a score being awarded (from 0 to 1).

When examining probability curves, the chief concern is whether there is a sepa-
rate peak for each scale category probability curve and whether the curves appear
as an evenly spaced series of hills. If there are some categories (band levels) that
never become most probable (and therefore do not have separate peaks), then that
may suggest that one or more raters are experiencing problems when using the
rating scale. The intersection points of each scale curve are referred to as the rat-
ing scale category thresholds (which can also be found in Figure 49 above, the
rating scale category statistics). According to Myford and Wolfe (2003), a rating
scale category threshold represents the point at which the probability is 50% of a
candidate being rated in one or the other of these two adjacent categories, given
that the candidate is in one of them (Andrich, 1998). Figure 50 below, for exam-
ple, shows that the band levels of the grammatical accuracy trait scale were more
or less evenly spaced, although level 8 had a slightly lower peak than the other
band levels.

Very few noteworthy problems were identified with the scale probability curves.
To limit the length of Chapter 9, any problems are noted in the relevant tables un-
der scale properties only.

Figure 50: Scale category probability curve: Grammatical complexity – existing scale

206
8.5.4.2 Correlational analyses

To explore the number of underlying dimensions measured by the two rating


scales, an exploratory factor analysis was used. An exploratory factor analysis
was chosen over confirmatory techniques, as no pre-existing theory was available
on how the raters would use the two scales. It was rather of interest to explore,
without pre-conceived ideas, how the raters would be using the two scales and
which traits would pattern together on the basis of their ratings.
Principal axis factoring (or principal factor analysis) is a technique that can be
used for uncovering the underlying (latent) structure of interrelationships of a set
of observed variables. Put simply, principal factor analysis (PFA) reduces a large
set of variables to a small number of underlying factors. Although producing very
similar results to principal components analysis on most data sets (L. Wilkinson,
Blank, & Gruber, 1996), PFA was chosen over principal components analysis for
this study because it focuses only on the common variance among the variables
and therefore does not artificially inflate the factor loadings (Farhady, 1983;
Vollmer & Sang, 1983).

After choosing the appropriate factor analysis technique, it is important to deter-


mine the number of factors. Several methods have been described in the literature
for deciding on the number of factors that should be retained in the analysis. First
of all, scree plots can be used in this decision. For these, Cattell (1966) argued that
the cut-off point for selecting factors should be the inflection of the curve. Sec-
ondly, the numeric values of the eigenvalues of the factors can be used. Kaiser
(1960), for example, recommended keeping all factors with eigenvalues greater
than 1. This criterion is based on the idea that the eigenvalues represent the
amount of variation explained by a factor and that an eigenvalue of 1 represents a
substantial amount of variation. More recently, however, Jolliffe (1986) suggested
that Kaiser’s criteria are too strict (see also Farhady, 1983) and that factors with
eigenvalues over 0.7 should be retained.

It is also essential to choose an appropriate rotation when conducting a factor


analysis. Rotation aims to make the output more comprehensible and is usually
necessary to facilitate the interpretation of factors. Varimax rotation, an orthogo-
nal rotation, was chosen (as suggested by Farhady, 1983). Varimax rotation aims
to maximize the variance of the squared factor loadings on all the variables in a
factor matrix. This results in a differentiation of the original variables on the ex-
tracted factors. Each factor will tend to have either large or small loadings of any
particular variable (A. Field, 2000).

PFA, like all inferential statistics, has several assumptions. First of all, the data set
should not include any extreme outliers. This was not a problem with this data set.

207
Further, none of the variables in the data should correlate too highly. SPSS in-
cludes a test for multicollinearity or singularity (the determinant). The results of
these tests are reported in the results section. Otherwise, PFA assumes linearity
and multivariate normality, assumptions which are both met by the data. It also
requires a large enough sample, which can be tested with the Kaiser-Meyer-Olkin
measure of sampling adequacy (A. Field, 2000).

Finally, to establish if the ratings based on the two scales are ranking the candi-
dates similarly, the candidate ability measures resulting from the two FACETS
analyses were correlated using a simple Pearson correlation.

8.5.4.3 Analysis of questionnaires and interviews

Both questionnaires and interviews were saved as text files and then coded manu-
ally by the researcher. Some coding categories (or themes) were devised a priori
based on the questions, and some emerged during the coding process. To refine
the coding process, the questionnaires were first read thoroughly and categories
were identified. Then the data were grouped according to these themes. A second
researcher was asked to verify the selection of categories and code a subset of
three questionnaires.

The broad, overarching categories (or themes) identified for both questionnaires
and interviews can be found in Table 81 below.

Table 81: Categories identified when coding questionnaires and interviews


1) Themes relating to new scale
a) negative
b) positive
2) Themes relating to DELNA scale
a) problems
i) strategies for coping with problems
b) positive
3) Themes relating to both scales

208
Chapter 9: Results – Validation of Rating Scale

The following chapter presents the findings from the validation phase of the
study. The first section displays the findings of the quantitative analysis (Research
Question 2a), while the second part of the chapter presents the findings from the
questionnaires and interviews (Research Question 2b).

9.1 Comparison of the ratings for the two scales


The following section attempts to answer Research Question 2a:

Do the ratings produced using the two rating scales differ in terms of (a) the
discrimination between candidates, (b) rater spread and agreement, (c) vari-
ability in the ratings, (d) rating scale properties and (e) what the different
traits measure?

The results for Research Question 2a are presented in two parts to aid comprehen-
sion. Firstly, the analysis of the individual trait scales is presented. To compare
the two rating scales (the existing DELNA scale and the new scale), the results for
corresponding trait scales are presented together. After the results for the individ-
ual trait scales, the scales as a whole are compared.

9.1.1 Comparison of individual trait scales

As discussed in Chapter 8, FACETS computes a number of relevant statistics for


the comparison of the functionality of individual trait scales as described in the
previous chapter. These have been placed into four groups. The first group pro-
vides information about how discriminating a rating scale is, i.e. how well it
spreads the candidates. The second group of statistics presents information about
rater reliability, in terms of both differences in severity and rater agreement. The
third group reports information on variability observed in the ratings. The fourth
group provides insight into rating scale functionality. Each of the trait scales will
be compared in terms of these four aspects.

9.1.1.1 Accuracy scales

The first two trait scales are those relating to accuracy (see Table 82 below). It can
be seen that the candidate separation ratio for the new scale was higher than that
for the existing DELNA scale, which suggests that the new scale was more dis-
criminating. The statistics indicating rater separation and reliability show that the
raters rated more similarly in terms of leniency and harshness when using the new
scale (indicated by the lower rater separation ratio) and ranked the candidates

209
more similarly (rater point biserial) and also chose the same band level of the rat-
ing scale more often (percentage exact agreement).

Table 82: Rating scale statistics for accuracy


DELNA scale - Grammatical accuracy New scale - Accuracy
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 4.68 Candidate separation ratio: 5.07
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 5.85 Rater separation ratio: 3.71
Rater point biserial: .80 Rater point biserial: .91
% Exact agreement: 37.8% % Exact agreement: 46.8%
Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 0%
% Raters infit low: 10% % Raters infit low: 0%
Scale properties: Scale properties:
Scale: slight central tendency – 4 and 9 Scale: 9 underused – otherwise well
underused spread

Table 82 above also presents the percentage of unusually high or low infit mean
square values exhibited by the raters. If, for example, two of the ten raters dis-
played very high infit mean square values, then the table indicates that twenty
percent of raters showed this tendency. Whilst three raters displayed either unac-
ceptably high or low infit mean square values when using the DELNA scale, no
raters rated with too little or too much variation when applying the new scale for
accuracy.

Finally, the section on scale properties indicates that for the existing scale both
outside levels were underutilized, whilst for the new scale, only level 9 was un-
derutilized.

In summary, it can be argued that when the two accuracy scales were compared,
all indicators point to the fact that the new scale functioned better.

9.1.1.2 Vocabulary and spelling/lexical complexity scales

Table 83 below shows a comparison of the four groups of statistics for the two
rating scales focussing on lexis. In this case, the discrimination of the new scale
was only slightly greater than that of the existing scale. That the candidate separa-
tion of the new scale was higher than that of the existing rating scale is surprising
given that the new scale had two fewer band levels.

210
Table 83: Rating scale statistics for vocabulary/spelling and lexical complexity
DELNA scale – New scale –
Vocabulary and spelling Lexical complexity
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 4.38 Candidate separation ratio: 4.54
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 3.48 Rater separation ratio: 6.65
Rater point biserial: .78 Rater point biserial: .85
% Exact agreement: 40.6% % Exact agreement: 49.7%
Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 10%
% Raters infit low: 20% % Raters infit low: 10%
Scale properties: Scale properties:
Scale: Scale: 4, 5 and 9 underused, strong Scale: well spread
central tendency
Probability curve: low peak for level 5

Because the discrimination of the two scales cannot be easily compared in this
way, Mike Linacre (personal communication, July 2006) offered the formula in
Equation 1 below (an application of the Spearman-Brown Prophecy formula to
this situation) to equate the separation ratios of two scales with differing band lev-
els.

Candidate separation new scale = square root (no. of levels in new scale – 1/ no. of levels in
old scale – 1)* candidate separation old scale
Equation 1: Equation used to predict the candidate separation of two scales with differing
levels

If the empirical candidate separation ratio (new scale) found in the FACETS
analysis exceeds what the formula predicts, then the new scale is more discrimi-
nating. The result of the formula predicts that if converted to only four levels, the
existing scale would only have a candidate separation ratio of 3.39. Therefore, the
new scale was clearly more discriminating, even though it had fewer band levels.

As was found with the accuracy trait scales, both the rater point biserial correla-
tion coefficient and the exact agreement were higher for the new scale. However,
interestingly, the rater separation ratio indicates that the raters were more spread
out in terms of severity when using the new scale. So, although they seemed to be
ranking the candidates more similarly when using the new scale, the raters as a
group were more varied in terms of severity.
When using the new scale, fewer raters rated either inconsistently or overly con-
sistently (only 20% of raters compared to 40% of raters when applying the exist-
ing descriptors).

211
The raters strongly underused levels 4, 5 and 9 in the DELNA trait scale for vo-
cabulary and spelling. The category probability curve indicates a very low peak
for level 5. When the new scale for lexical complexity was used, the ratings were
well spread over all levels.

9.1.1.3 Sentence structure scale

The existing DELNA scale has level descriptors for sentence structure. The de-
scriptors in this scale refer both to accuracy and complexity of sentences. Accu-
racy of sentences is covered in the new scale in the trait scale for accuracy, whilst
the analysis of the writing scripts showed no differences in grammatical complex-
ity between the different levels of writing. Therefore, for completeness of the re-
sults section, the scale statistics for the DELNA sentence structure trait scale are
presented here.

Table 84: Rating scale statistics for sentence structure


DELNA scale – Sentence structure
Candidate discrimination:
Candidate separation ratio: 4.16
Rater separation and reliability:
Rater separation ratio: 3.90
Rater point biserial: .77
% Exact agreement: 40.2 %
Variation in ratings:
% Raters infit high: 20%
% Raters infit low: 20%
Scale properties:
Scale: central tendency, 4 and 9 underused

Table 84 above shows that the trait scale was discri-minating. The rater separation
ratio indicates that the raters were not all rating exactly alike in terms of severity.
The single rater-rest of raters correlation (point biserial) was high, as was the ex-
act agreement. However, of the ten raters, four rated with either too much varia-
tion (inconsistently) or with too little variation (underusing the extreme levels of
the scale). An analysis of the use of the different band levels revealed that the
outer categories of this trait scale were generally underused by the raters.

9.1.1.4 Repair fluency scale

One trait scale not found in the existing rating scale, but included in the new scale,
was the trait scale for repair fluency. Table 85 below displays the rating scale sta-
tistics for this new scale.

212
Table 85: Rating scale statistics for repair fluency
New scale – Repair fluency
Candidate discrimination:
Candidate separation ratio: 5.82
Rater separation and reliability:
Rater separation ratio: 5.34
Rater point biserial: .93
% Exact agreement: 61.9%
Variation in ratings:
% Raters infit high: 20%
% Raters infit low: 20%
Scale properties:
Scale: 9 underused, otherwise functioned well
Distribution: multimodal distribution
Probability curve: level 8 very high peak

The candidate separation ratio indicates that the dis-crimination of this new scale
was high. The inter-rater reliability, as indicated by the point biserial correlation,
was also high (.93), more so than for all other trait scales examined so far. The
same can be said for the exact agree-ment (61.9%). The rater separation ratio,
however, indicates large differences in severity between the most severe and the
most lenient rater in the group.

Nearly half of the raters were identified as rating either inconsistently or too con-
sistently.

The band levels of the scale were generally well used, except level 9, which was
not utilized a lot by the raters. The scale category statistics indicate that there are a
number of problems with this trait scale. Firstly, not enough raters utilized level 9,
which might cause problems with the stability of the statistics measured for this
trait scale. Secondly, the distribution over the rating scale levels was not normal
with one peak, but multimodal with several peaks. Thirdly, the scale category
probability curve indicates problems with level 8 of the scale, which had a very
high peak.

9.1.1.5 Organisation/Paragraphing scales

The next section presents the findings for the trait scales associated with para-
graphing (Table 86 below). As was the case with the two trait scales for vocabu-
lary, the two rating scales focussing on paragraphing did not have the same num-
ber of band levels. Whilst the DELNA scale had six levels, the new rating scale
for paragraphing was designed with only five levels. To compare the candidate
separation ratios of the two scales and therefore the discrimination of the two

213
scales, the equation described in Equation 1 would have to be used. However, it is
very clear from the candidate separation ratios in Table 86 below, that the new
scale was a lot more discriminating. Also, the inter-rater reliability as indicated by
the rater point biserial coefficient and the exact agreement were higher and the
raters rated slightly more similarly in terms of severity.

Table 86: Rating scale statistics for organisation and paragraphing


DELNA scale – Organisation New scale - Paragraphing

Candidate discrimination: Candidate discrimination:


Candidate separation ratio: 2.98 Candidate separation ratio: 4.86
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 3.66 Rater separation ratio: 3.08
Rater point biserial: .64 Rater point biserial: .83
% Exact agreement: 41.7 % % Exact agreement: 58.5%
Variation in ratings: Variation in ratings:
% Raters infit high: 30% % Raters infit high: 10%
% Raters infit low: 20% % Raters infit low: 10%
Scale properties: Scale properties:
Scale: strong central tendency – 4, 5, 9 Scale: 5 and 9 underused
underused, skewed
Level 4: less than 10 observations

When using the existing scale, half of the raters could be identified as rating in-
consistently or with too little variation, whilst only 20% of the raters displayed
similar behaviour when using the new scale. For both scales, the outer categories
were underutilized, but more so when the existing scale was used as the basis for
the ratings.

9.1.1.6 Cohesion and coherence scales

Table 87 below shows the findings of the trait scales for cohesion and coherence.
The existing scale had no category for coherence, and therefore the trait scale for
coherence of the new scale is displayed in the same table for space reasons.

The results for the trait scales for cohesion were mixed. The candidate separation
ratio of the new cohesion trait scale was lower than that of the existing cohesion
scale. However, this was another trait scale that had fewer levels than the existing
scale (only four levels compared to the six levels of the existing scale). When the
formula in Equation 1 was applied, the candidate separation ratio of 3.18 of the
new scale had to be compared to a candidate separation ratio of 2.79 for the exist-

214
ing scale. Therefore, if the number of levels in the two scales were equivalent, the
new scale would be more discriminating.

Table 87: Rating scale statistics for cohesion and coherence trait scales
DELNA scale - Cohesion New scale - New scale -
Cohesion Coherence
Candidate discrimination: Candidate discrimination: Candidate discrimination:
Candidate separation ratio: Candidate separation ratio: Candidate separation ratio:
3.62 3.18 3.56
Rater separation and reli- Rater separation and reli- Rater separation and reli-
ability: ability: ability:
Rater separation ratio: 5.53 Rater separation ratio: 4.12 Rater separation ratio: 4.62
Rater Pt. Bis: .71 Rater Pt. Bis: .65 Rater Pt. Bis: .72
% Exact agreement: 37.9% % Exact agreement: 51.5% % Exact agreement: 36.1%
Variation in ratings: Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 0% % Raters infit high: 0%
% Raters infit low: 20% % Raters infit low: 0% % Raters infit low: 0%
Scale properties: Scale properties: Scale properties:
Scale: strong central ten- Scale: 4 underused, other- Scale: 4 underused, other-
dency – 4, 5 and 9 under- wise functioning well, wise functioning well,
used, skewed slightly skewed slightly skewed
Level 4: less than 10 obser-
vations

The raters rated slightly more similarly in terms of severity when using the new
scale. When the inter-rater reliability of the two scales was compared, the results
were mixed. The rater point biserial correlation coefficient (the single rater – rest
of rater correlation coefficient) of the new cohesion scale was slightly lower than
that for the existing cohesion trait scale (.65 and .71 respectively). However, the
percentage of exact agreement was considerably higher for the new scale (51.5%)
than for the existing scale (37.9%). The reason for these differences can be attrib-
uted to the different number of levels in the scales. If there are fewer levels in a
scale, the chance of raters choosing the same category is likely to be higher.

The variation in the raters’ ratings shows that nearly half of the raters rated with
too much or too little variation when applying the existing scale, whilst none fell
into this category when applying the new scale. With the existing scale, the raters
as a group displayed a central tendency effect, because levels 4, 5 and 9 were un-
derused, whilst when using the new cohesion trait scale level 4 was underused.
The distribution of the ratings based on the existing scale was skewed and fewer
than ten observations were recorded for level 4, which might cause problems for
the reliability of the results.

215
Table 87 above also displays the trait scale of coherence. This new scale had no
equivalent in the existing scale although raters were trained to apply the cohesion
scale as a coherence and cohesion scale. The coherence trait scale was as dis-
criminating as the existing cohesion trait scale (although it has one level less) and
also displayed a similar level of inter-rater reliability (as measured by the rater
point biserial correlation coefficient and the percentage of exact agreement).
However, the raters rated slightly more similarly in terms of severity and there
were no raters identified as rating with too little or too much variation when using
this new coherence trait scale. All levels of the scale were applied by the raters,
although level 4 was slightly underused.

9.1.1.7 Style/hedging scale

The next comparison between two rating scales focussed on the trait scales relat-
ing to style. The existing scale had descriptors pertaining to academic style in
general, whilst the new scale had descriptors only for the use of hedging because
the analysis of the writing scripts reported in Chapter 6 could find no other aspects
of academic style that clearly differentiated between the DELNA levels. Table 88
below displays the rating scale statistics for the two trait scales. The new scale for
hedging was clearly more discriminating (in this case both scales have six levels)
with a candidate separation ratio of 5.86 compared to the separation ratio of 3.32
for the existing scale.

Table 88: Rating scale statistics for style


DELNA scale - style New scale - hedging
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 3.32 Candidate separation ratio: 5.86
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 5.56 Rater separation ratio: 1.94
Rater point biserial: .78 Rater point biserial: .89
% Exact agreement: 37.2% % Exact agreement: 53.7%
Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 10%
% Raters infit low: 20% % Raters infit low: 10%
Scale properties: Scale properties:
Scale: central tendency - 4, 5 and 9 under- Scale: 9 underused, skewed
used, skewed
Level 4: less than 10 observations

216
The raters rated considerably more similarly in terms of severity when using the
new scale. Furthermore, both the inter-rater reliability statistics were significantly
higher than those of the existing scale. Fewer raters rated with too much or too
little variation (only 20% of the raters com-pared to 40% of raters when applying
the existing scale). A closer scrutiny of the use of the different band levels showed
that the raters (as a group) displayed a strong cen-tral tendency effect when using
the existing scale – levels 4, 5 and 9 were underutilized. When using the new
scale, only level 9 was underused. So few instances of 4s were identified by the
raters when using the DELNA scale that the results of the FACETS analysis
might not be stable.

9.1.1.8 Content – data description scales

The final group of rating scale statistic comparisons focuses on the trait scales re-
lating to content. Firstly, the two trait scales for data description were compared
(see Table 89 below). Both trait scales for data description had six band levels.
The new scale was more discriminating as can be seen by the higher candidate
separation ratio.

Table 89: Rating scale statistics for data description


DELNA scale – data description New scale – data description
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 2.97 Candidate separation ratio: 4.49
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 2.98 Rater separation ratio: 4.29
Rater point biserial: .64 Rater point biserial: .80
% Exact agreement: 33% % Exact agreement: 41.5%
Variation in ratings: Variation in ratings:
% Raters infit high: 10% % Raters infit high: 0%
% Raters infit low: 10% % Raters infit low: 0%
Scale properties: Scale properties:
Scale: 4 and 9 underused Scale: 4 underused

When using the existing trait scale for data description, the raters rated more alike
in terms of leniency and harshness. The inter-rater reliability statistics were, how-
ever, in favour of the new scale. No raters were found to be rating with too much
or too little variation when applying the new scale, whilst 20% of raters were
identified in these categories when using the existing scale. When applying the
existing DELNA descriptors, the outer levels of the scale, levels 4 and 9 were un-
derutilized, whilst when the new scale was used, only level 4 was used rarely.

217
9.1.1.9 Content – interpretation of data scales

The next two trait scales compared in this analysis were the trait scales pertaining
to the interpretation of data (see Table 90 below). In this case, the new scale had
one level less than the existing scale (only five compared to six levels). But even
without applying the formula in the equation above, it can be seen that the new
scale was more discriminating between candidates. Both the rater point biserial
correlation coefficient and the percentage of exact agreement were higher for the
new scale, whilst the rater separation ratio was almost identical for the two scales.

Table 90: Rating scale statistics for data interpretation


DELNA scale – data interpretation New scale – data interpretation
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 4.06 Candidate separation ratio: 4.70
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 4.04 Rater separation ratio: 4.03
Rater point biserial: .76 Rater point biserial: .84
% Exact agreement: 36.8% % Exact agreement: 54.3%
Variation in ratings: Variation in ratings:
% Raters infit high: 20% % Raters infit high: 0%
% Raters infit low: 10% % Raters infit low: 20%
Scale properties: Scale properties:
Scale: central tendency - 4 and 9 under- Scale: 4 underused, slightly skewed
used

The analysis of the variation in the ratings showed that two raters rated inconsis-
tently when applying the existing scale, whilst none fell in this category when ap-
plying the new scale. However, whereas only one rater rated with too little varia-
tion when using the existing scale, two raters displayed a central tendency effect
when applying the new scale. Overall, the two outside categories (band levels 4
and 9) were underused in the case of the existing scale, whilst level 4, the lowest
level was underused when the new scale was employed. For both scales, the peaks
of band level 7 on the probability curves were low.

9.1.1.10 Content – Part three scales

The final two trait scales compared in this section were the two trait scales per-
taining to the rating of the content of part three of the prompt, in which the writers
were asked to either compare the data to the situation in their own country or ex-
tend the ideas developed in the interpretation section to the future. The summary
statistics for the two trait scales can be found in Table 91 below.

218
Table 91: Rating scale statistics for part three of the content
DELNA scale – Part three of content New scale – Part three of content
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 3.89 Candidate separation ratio: 4.65
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 4.19 Rater separation ratio: 3.66
Rater point biserial: .77 Rater point biserial: .90
% Exact agreement: 36.1% % Exact agreement: 57.5%
Variation in ratings: Variation in ratings:
% Raters infit high: 30% % Raters infit high: 10%
% Raters infit low: 20% % Raters infit low: 10%
Scale properties: Scale properties:
Scale: category 9 underused, skewed Scale: 4 underused, slightly skewed
Probability curve: very flat

As was the case with the two trait scales used for the interpretation of the data, the
number of band levels differed for these two trait scales. Again, the new scale had
only five levels, whilst the existing scale had, like all DELNA trait scales, six lev-
els. Without applying the formula in Equation X above, it can be seen that al-
though the new trait scale had a level less, it was more discriminating. Further-
more, the rater reliability and separation statistics show that the new rating scale
was applied more similarly by the raters. Both the rater point biserial correlation
coefficient and the percentage of exact agreement were higher for the new scale
(.90 and 57.5% respectively) than those for the existing scale (.77 and 36.% re-
spectively) and the raters rated more alike in terms of severity when applying this
new scale.

Considerably fewer raters were identified to be rating with too much or too little
variation when applying the new scale (only 20% of raters) than the existing scale
(half of the raters). Finally, when the underutilized band levels on the trait scales
were established, the scales were very similar. In the case of the existing scale,
level 9, the highest level, was underused by the raters, whilst in the case of the
new scale, level 4, the lowest level, was used very little. The probability curve for
the existing scale was very flat.

9.1.2 Comparison of whole scales

After the individual trait scales were analysed and compared, it was of further in-
terest to explore how the two scales as a whole performed. Figures 51 and 52 be-
low present the Wright maps of the two rating scales.

219
220
Figure 51: Wright map of entire DELNA scale including all trait scales
Figure 52: Wright map of entire new rating scale including all trait scales

221
The left-hand column of each Wright map displays the logit values ranging from
positive values to negative values. The second column shows the candidates in the
sample indicated as asterisks. Higher ability candidates are plotted higher on the
logit scale, whilst lower ability candidates can be found lower on the logit scale.
The next column in the Wright map represents the raters, here indicated as num-
bers. Raters plotted higher on this map rated more severely than those plotted
lower on the map. The wide column in the centre of each Wright map shows the
traits in each rating scale. More difficult traits are plotted higher on the map than
easier traits. Finally, the narrow columns on the right of each Wright map repre-
sent the trait scales (with band levels) in the order they were entered into
FACETS. For the existing scale, these are from left to right: organisation (S1),
cohesion (S2), style (S3), data description (S4), data interpretation (S5), part three
of prompt (S6), sentence structure (S7), grammatical accuracy (S8) and vocabu-
lary/spelling (S9). For the new scale, these are from left to right: accuracy (S1),
repair fluency (S2), lexical complexity (S3), paragraphing (S4), hedging (S5), data
description (S6), data interpretation (S7), part three of prompt (S8), coherence
(S9) and cohesion (S10). These columns display the different band levels avail-
able to raters for each particular trait scale and how these were spread in terms of
difficulty. Myford and Wolfe (2003) describe the horizontal dotted line across a
column (rating scale threshold) as an indicator of the point at which the likelihood
of a candidate receiving the next higher rating begins to exceed the likelihood of a
candidate receiving the next lower rating. For example, a candidate pictured on
the same logit as a category boundary between two band levels has a 50% chance
of being awarded either of these two band levels.

When the two Wright maps were compared, the following observations could be
made. First of all, when the raters used the existing scale, the candidates were
more spread out, ranging over five logits. When the raters employed the new
scale, the candidates were spread over only three logits. Therefore, although most
individual trait scales on the new scale were more discriminating (as shown in the
section on individual trait scales earlier in this chapter), it seems that as a whole,
the existing scale was more discriminating. This is also confirmed by the first of
the rating scale statistics for the whole scale, the candidate separation ratio, dis-
played in Table 92 below.

It also became apparent that the raters were a lot less spread out when using the
new scale. Their severity measures (in logits) ranged from .25 (for the harshest
rater) to -.21 (for the most lenient rater), a range of less than half a logit. When
employing the existing scale, the raters were spread from .64 to -.74 logits, a
range of nearly one and a half logits. That the raters rated more similarly in terms
of severity could also be seen by the inter-rater reliability statistics in Table 92,
which showed that the exact agreement was higher when the new scale was used

222
(51.2%) than when the existing scale was applied (37.9%). The rater point biserial
correlation coefficient, however, was lower when the new scale was used.

Table 92: Rating scale statistics for entire existing and new rating scales
DELNA scale New scale
Candidate discrimination: Candidate discrimination:
Candidate separation ratio: 8.15 Candidate separation ratio: 5.34
Rater separation and reliability: Rater separation and reliability:
Rater separation ratio: 8.67 Rater separation ratio: 4.19
Rater point biserial: .47 Rater point biserial: .38
% Exact agreement: 37.9% % Exact agreement: 51.2%
Variation in ratings: Variation in ratings:
% Raters infit high: 40% % Raters infit high: 0%
% Raters infit low: 10% % Raters infit low: 0%
Trait statistics: Trait statistics:
Spread of trait measures: .78 to -.37 Spread of trait measures: .53 to -.76
Trait separation: 9.14 Trait separation: 12.47
Trait fit values: data and part three much Trait fit values: repair fluency and data
over 1.3, no low slightly high, lexis and coherence low
Scale properties: Scale properties:
Scale: central tendency, 4 and 9 underused Scale: 9 underused, otherwise good

Next, the number of raters displaying too much or too little variability in their rat-
ings were scrutinized. For the existing scale, half the raters fell into one of these
categories whilst no raters did for the new scale.

The scale property statistics indicate that on the existing scale the outer levels
were underused. Only level 9 on the new scale was not sufficiently utilized.

When the different traits were examined on the two Wright maps, it became clear
that the traits on the new scale were slightly more spread out in terms of difficulty,
ranging from .78 on the logit scale for repair fluency to -.74 for cohesion, a differ-
ence of one and a half logits. On the DELNA scale the traits spread from .78 (for
Data – part three) to -.37 (for style), a difference of just over one logit. In a crite-
rion-referenced situation as was the case when these rating scales were used, it is
not necessarily a problem to have a bunching up of traits around the zero logit
point, as is found in Figure 51 with the traits on the existing rating scale. How-
ever, it indicates that raters had difficulty dis-tinguishing between the different
traits or that the traits were related or dependent on each other (Carol Myford,
personal communication, July 2006). The fact that the traits in Figure 52 (new

223
scale) were more spread out shows that the different traits were measuring differ-
ent aspects.

Apart from indicating the difficulty of the traits, the columns on the right of each
Wright map representing the different trait scales also provide a visual display of
how the different band levels were used by the raters as a group. Wider band lev-
els show a higher probability of a candidate being awarded that level. The outside
levels of the trait scales are indicated in brackets only, because they are infinitely
wide. When the two Wright maps were compared, it becomes clear that the rating
scale thresholds of the new scale were a lot less tidy. There were greater differ-
ences between the thresholds than when the existing scale was employed. This is
another indication that not all the traits in the new scale were measuring the same
aspects of writing. If the traits were not measuring the same underlying construct,
then this explains why both the candidate separation of the new scale and the rater
point biserial correlation coefficient of the new scale were lower than that of the
existing scale.

To explore this idea further, the item correlations were scrutinized. Like the rater
point biserial correlation, the item point biserial correlation coefficient measures
how similar the different traits are in terms of what they are measuring. Each cor-
relation coefficient shows the correlation of that particular trait with all the other
traits in the rating scale. Table 93 below presents the different traits in each rating
scale with the associated point biserial correlation coefficient. The average corre-
lation coefficient for each scale can be found in the last row of the table.

Table 93: Item correlation coefficients for existing DELNA scale and new scale
DELNA scale Pt Bise- New scale Pt Bise-
rial rial
Accuracy .55 Accuracy .52
Sentence structure .54 Repair fluency .27
Vocabulary and spelling .55 Lexical complexity .50
Cohesion .53 Coherence .47
Style .50 Cohesion .43
Organisation .39 Hedging .27
Data description .35 Paragraphing .07
Data interpretation .48 Data description .20
Data Part 3 .38 Data interpretation .29
Data Part 3 .32
Mean .47 Mean .33

224
The correlation coefficients in Table 93 above show that whilst the traits in the
existing scale seemed to be measuring similar aspects of writing, the correlation
coefficients in the new scale were dissimilar to each other. In particular, para-
graphing, hedging, repair fluency and all trait scales pertaining to content resulted
in very low correlation coefficients.

Because Table 93 above indicates only that the traits were measuring different
underlying abilities, but not how many different groups of traits the data was
measuring, a principal axis factor analysis (or principal factor analysis – PFA)
was performed on the rating data of each of the two rating scales. This was done
because PFA is able to reduce a set of variables to a smaller number of underlying
factors. To ensure suitability of the data for a PFA, the determinant was calcu-
lated, which tests for multicollinearity or singularity. The determinant of the R-
matrix should be greater than 0.00001 (A. Field, 2000). In addition, the Kaiser-
Meyer-Olkin (KMO) measure of sample adequacy was calculated, which should
be greater than .5. The results of these can be found in Table 94 below.

Table 94: Determinants and KMO statistics for principal factor analyses
DELNA scale New scale
Determinant .001 .071
KMO .937 .814

The results in Table 94 above indicate that both data sets were suitable for a prin-
cipal factor analysis.

Table 95: Principal factor analysis: existing DELNA scale


Component Eigenvalue % of variance Cumulative %
1 5.803 64.472 64.472
2 .694 7.983 72.455
3 .657 7.691 80.146
4 .549 6.341 86.487
5 .415 4.756 91.243
6 .275 3.076 94.319
7 .209 2.326 96.645
8 .168 1.827 98.472
9 .138 1.528 100.000

PFA reduces the data in hand into a number of components, each with an eigen-
value representing the amount of variance of the components. Components with
low eigenvalues are discarded from the analysis, as they are not seen to be con-

225
tributing enough to the overall variance. Table 95 (DELNA scale) and Table 96
(new scale) below show the results from the principal factor analysis.

Table 96: Principal factor analysis: new scale


Component Eigenvalue % of variance Cumulative %
1 3.434 34.34 34.34
2 1.276 12.76 47.10
3 1.154 11.54 58.64
4 .863 8.63 67.28
5 .817 8.17 75.44
6 .763 7.63 83.07
7 .577 5.77 88.85
8 .491 4.91 93.75
9 .389 3.89 97.64
10 .236 2.36 100.000

The scree plots, which provide a visual representation of the eigenvalues, can be
found in Figure 53 below.

226
Figure 53: Scree plots of principal factor analysis
– DELNA scale (top) and new scale (bottom)

Both the scree plots and the tables displaying the results from the principal factor
analysis show that when the existing rating scale was analyzed, only one major
component was found. This component had an eigenvalue of 5.8 and accounted
for about 65% of the entire variance. All other eigenvalues were clearly below 1
(following Kaiser, 1960) and below .7 (following Jolliffe, 1986) and there was no
further leveling off point on the scree plot. When the new scale was analyzed,
however, the results were different. The principal factor analysis resulted in three
components with eigenvalues over 1, and a further three above 0.7 after which a
leveling off could be seen in the scree plot (resulting in six components).

The next step in a PFA is to identify which variables load onto which component.
For this, a rotation of the data is necessary. However, because only one compo-
nent was identified for the existing scale, no factor loadings can be displayed. A
varimax rotation was chosen to facilitate the interpretation of the factors of the
new scale. A trait was considered to be loading on a factor, if the loading was
higher than .4 (as indicated in bold font). The six factor loadings for the new scale
can be seen in Table 97 below.

227
Table 97: Loadings for principal factor analysis
Component
1 2 3 4 5 6
Accuracy .796 .119 -.045 .037 .168 -.035
Repair fluency .155 -.005 .067 -.062 .959 -.064
Lexical complex- .731 .112 .288 .116 .106 .004
ity
Paragraphing .009 -.037 .072 .029 -.061 .992
Hedging .174 .945 .056 -.009 -.030 -.041
Data description .141 .017 .025 .971 -.062 .030
Interpretation of .338 .448 -.340 .253 .241 .009
data
Content – Part 3 .269 .105 .850 .090 .139 .092
Coherence .867 .097 .009 .083 .016 .030
Cohesion .875 .089 .064 .064 .039 .021

The factor loadings for the six different components can be described in the fol-
lowing way. The largest factor, accounting for 34% of the variance was made up
of accuracy, lexical complexity, coherence and cohesion. This factor can be de-
scribed as a general writing ability factor. The second factor, which accounted for
a further 13% of the variance, was made up of hedging and interpretation of data.
This is, at first glance, an unusual factor. However, it can be argued that writers
need to make use of hedging in the section where the data is interpreted since the
writer is speculating rather than stating facts. For this reason, a writer who scored
high on hedging might also have put forward more ideas in this section of the es-
say. The third factor, which accounted for 12% of the variance, consisted of part
three of the content, the section in which writers are required to extend their ideas.
The fourth factor, which accounted for 9% of the variance, was another content
factor, the description of data. That all three parts of the content load on separate
factors shows that they were all measuring different aspects of content. Repair
fluency was the only measure that loaded on the fifth factor, which accounted for
another 8% of the variance. The last factor, which also accounted for 8% of the
variance, had only paragraphing loading on it. The six factors together accounted
for 83% of the entire variance of the score, whilst the single factor found in the
analysis of the existing rating scale accounted for only 64% of the data.

It can therefore be argued that the ratings based on the new scale accounted for
not only more aspects of writing ability, but also for a larger amount of variation
of the scores. In other words, there was less unaccounted variance when the new
scale was used.

228
A further investigation was conducted into the score profiles produced by individ-
ual raters. The bunching up of traits around the zero logit on the existing scale and
the single factor found in the principal factor analysis described above, indicate
that raters probably displayed a halo effect when using the existing scale. There-
fore, the score profiles for each individual candidate were scrutinized. If a rater
awarded the same score seven or more times (of the nine or ten traits on the two
scales respectively), this was considered to be rating with a halo effect. Table 98
shows that raters displayed a halo effect in the case of nearly half the scripts when
employing the DELNA scale, whilst 12% of scripts were rated with a halo effect
when the new scale was employed.

Table 98: Percentage of scripts rated with halo effect


Rater DELNA scale New scale
1 49% 13%
2 53% 21%
3 52% 11%
4 42% 7%
5 52% 10%
6 56% 25%
7 41% 10%
8 26% 5%
9 35% 11%
10 64% 11%
Mean 47% 12.4%

It was then of interest whether the two rating scales resulted in similar rankings of
the one hundred candidates. For this purpose, the candidate ability measures were
correlated by means of a Pearson correlation and a scatterplot was created to rep-
resent the results visually. The scatterplot (seen in Figure 67 below) indicates a
reasonable correlation between the two variables ‘old’ for the existing scale and
‘new’ for the new scale, although there was some scatter in the lower and higher
ability categories. The correlation coefficient showed a strong positive correlation,
r = .891, p = .000. This indicates that the two scales, when employed as a whole,
ranked the candidates reasonably similarly, especially around the middle ability
levels.

9.1.3 Summary of quantitative findings:

The findings described above show that, in general, the individual trait scales of
the newly developed rating scale were functioning better in a variety of respects
when compared to the existing scale. However, when the two scales as a whole
were analyzed, the existing scale resulted in a higher candidate discrimination. A

229
principal factor analysis showed that this can be explained by the number of dif-
ferent aspects that the two scales appeared to be measuring. The trait scales on the
existing scale generally appeared to be evaluating the same aspect of writing,
whilst a number of trait scales of the new scale seemed to be assessing different
information about candidates’ writing performance not measured by the existing
scale. The factor analysis of the new scale resulted in six factors before a leveling
off could be seen in the scree plot, whilst the factor analysis of the scores of the
existing scale only produced one large factor. All raters rated fewer scripts with a
halo effect when using the new scale than when using the DELNA scale. Overall,
the two scales ranked the candidates very similarly.

9.2 Raters’ perceptions of the rating scales

The following section which is dedicated to Research Question 2b will present the
qualitative findings based on the questionnaire and interview results:

What are raters’ perceptions of the two different rating scales for writing?

9.2.1 Questionnaire results

To answer this research question, the findings from the questionnaire will be re-
ported first. The questionnaire focussed on a number of questions ranging from
the raters’ overall impression of the new scale, and the rating process employed
when using the new scale, to questions about each trait scale individually. An im-
portant point to note here is that the questionnaire focussed only on the new scale.
Questions about the existing scale were asked only during the interviews reported
later in this chapter. The questionnaire questions can be found in Table 77 in
Chapter 8.

Because the answers provided to the questionnaire were generally very short and
very few themes emerged, the findings for each question will, where possible, be
provided in tables with illustrative excerpts.
The first question asked the raters what they liked about the new scale. Raters
mentioned that they thought it was good that the scale was ‘more objective’ (Rater
10), that it ‘gives more guidance’ (Rater 1) and that it made them ‘focus on the
underlying language skills’ (Rater 8). Five raters indicated that they thought that
the rating scale made the rating process easier, because ‘the categories were de-
scriptive’ (Rater 2), the scale made the rating process ‘more mechanical’ (Raters 4
and 6) and it was therefore ‘quick’ (Rater 5) and ‘easy to arrive at a score’ (Rater
7), as well as the fact that the scale was ‘clearly set out without any ambiguities’
(Rater 3). Finally, Rater 2 thought that this kind of ‘descriptive’ rating would

230
benefit the students in that ‘more specific information about where they did well,
and what they need to work on’ could be provided.

The next question in the questionnaire asked the raters if they thought any catego-
ries were missing in the new scale. The findings are summarized in Table 99 be-
low:

Table 99: Summary of answers to Question 2


Question 2: Are there any categories missing that you find necessary?
No: 5 raters Yes: 4 raters
Categories thought to be missing:
x category to penalize informal vocabulary, abbreviations
x overall assessment
x spelling, punctuation, capitalization
x academic style
x organisation within paragraphs
x quality of expression (clarity of ideas, choice of words, conciseness)
x strength of ideas (appropriacy, development)
x grammar
x sentence structure

Half of the raters thought that no categories were missing. The four raters who
listed aspects of writing which they thought the new scale did not account for,
however, produced a list of nine different categories1.

The third question asked if the raters thought any of the trait scales had the wrong
number of band levels. Table 100 below presents an overview of the results from
that section:

Table 100: Summary of answers to Question 3


Question 3: Are there any categories that you think do not have the right number
of levels?
No: 4 raters Yes: 5 raters
x Paragraphing: ‘even the poorer students gained high/fairly high levels in spite
of me following the instructions. It was too easy to get 7, 8 or 9 for paragraphs
of very little merit’ (Rater 3)
x Lexical complexity: ‘I had a problem with lexical complexity. I don’t think the
Academic Word List is comprehensive enough and that writers who use cer-
tain words esp. phrases not in the list were disadvantaged (e.g. the word ‘job’
appears on the list, but not ‘career’ or ‘employment’)’ (Rater 7)
x Hedging
x Coherence
x Cohesion

231
Five of the raters thought that some of the categories did not have the right num-
ber of levels. Only two raters, however, supplied reasons for their choices. The
two comments provided by Rater 3 (about paragraphing) and Rater 7 (about lexi-
cal complexity) can be seen in Table 100 above.

The fourth question asked the raters if they thought the wording of any of the de-
scriptors needed changing. A brief summary of the results is presented in Table
101 below:

Table 101: Summary of answers to Question 4


Question 4: Are there any categories in which you think the wording of the de-
scriptors needs to be changed?
No: 7 raters Yes: 3 raters
x lexical complexity: I only used 4-7 for 99% (Rater 5)
x accuracy: I used mainly 8/9 or 4/5 (Rater 5)
x data description: 7 seems too high for all trends but no figures (Rater 8)

As Table 101 above shows, most raters were satisfied with the wording in the rat-
ing scale. Rater 5 criticized lexical complexity and accuracy for not allowing him
to use the whole range of levels, whilst Rater 8 criticized one particular level de-
scriptor for data description.

The fifth question asked the raters if they thought any of the descriptors were dif-
ficult to apply. All the raters stated that the descriptors for coherence were a little
difficult to apply, although four of the raters (Raters 1, 2, 8 and 9) mentioned that
they got used to using them with practice. Rater 5 also found accuracy, fluency
and paragraphing difficult to apply, but failed to give more specific reasons and
two raters (Raters 4 and 10) reported having problems applying the descriptors for
cohesion.

In the sixth question raters were asked if they at any time resorted to thinking of a
holistic (overall) score for a script to arrive at a band level for a particular trait
scale. Typical rater responses to this question can be found in Table 102 below.

Table 102: Summary of answers to Question 6


Question 6: Did you at times use a holistic (overall) score to arrive at the scores
for the different categories?

No: 7 raters
x ‘I felt the scale didn’t allow for that. To make a script fit to a descriptor, I had
to actually use the descriptors as guidance, much more than I would do using
the DELNA scale’ (Rater 1)

232
Yes: 3 raters
x ‘Occasionally for coherence’ (Raters 7 and 8)
x ‘I occasionally [did this] if I felt the scripts would rate too high on the scale,
for example, high on paragraphing, accuracy, hedges, fluency, (the areas eas-
ier to get high marks), which would bring up the final mark, when I feel the
student required more help. I think it is useful sometimes to have an idea about
the final score’ (Rater 2)
Did it bother you that you did not know what the final score for each script would
come out as?
No: 10 raters
x ‘liberating not always having to assess so subjectively’ (Rater 4)
x ‘I abandoned my idea of a script as ‘roughly seven’, for example, as I could
see that scripts were coming out with considerable variation in scores’ (Rater
10)

The raters were further asked if it bothered them not to know how the final score
for each script would be derived. Table 102 above shows that none of the raters
was concerned with the final score. Two illustrative quotes by Raters 4 and 10
exemplify this.

The final part of the questionnaire asked the raters to provide any specific com-
ments about the different trait scales. The remarks made about each trait scale are
summarized in Table 103 below.

Table 103: Summary of answers to Question 7


Accuracy
Positive remarks: 8 raters
x ‘I felt this category was better than in the DELNA scale. I felt I was judging
the students not on how many errors they had produced or how grave I
thought their errors are, but rather on how much they produced correctly’
(Rater 1)
x ‘this way of marking is easier, clearer’ (Rater 2)
Negative remarks: 2 raters
x ‘somewhat unfair to the writers to assess someone with one error per sen-
tence in the same way as someone with many errors per sentence’ (Rater 4)
x ‘no account of very short scripts taken’ (Rater 3)
Repair fluency
Positive remarks: 6 raters
x ‘I found this category easy to apply, although sometimes very good writers
seemed to score low on this category’ (Rater 1)
Negative remarks: 4 raters
x ‘seemed like a ‘difficult’ category as this is not normally penalised in exams
and tests’ (Rater 10)
x ‘repair fluency seemed to have little correlation to English ability’ (Rater 5)

233
Table 103: Summary of answers to Question 7 (cont.)
Lexical complexity
Positive remarks: 7 raters
x ‘the Academic Word List helped and after a while I found I got better at apply-
ing this category. It made me look in much more detail at the vocabulary pro-
duced’ (Rater 1)
x ‘this category is very indicative of competence’ (Rater 5)
Negative remarks: 3 raters
x ‘I feel the graph and the topic are more relevant than Academic Word List.
For example, has the student used language that is applicable to the task?’
(Rater 2)
x ‘too strict’ (Rater 7)
Paragraphing
Positive remarks: 3 raters
x ‘picked up many writers who did not have a clear grasp of paragraphing
structure and of one main item per paragraph, and introduction and conclu-
sion’ (Rater 4)
x ‘logical’ (Rater 7)
Negative remarks: 5
x ‘illogical paragraphing not penalized’ (Rater 3)
x easy category for students to score well in’ (Rater 10)
No comments: 2 raters
Hedges
Positive comments: 6 raters
x ‘this is a great addition as the previous scale lacked such subtlety’ (Rater 7)
x ‘good, interesting, effective’ (Rater 8) and Rater 1 noted that ‘the scale was
easy to follow’
Negative comments: 4 raters
x ‘some good scripts managed with no hedging. Is it really necessary? Does it
show lack of understanding of academic style to do without? Probably, yes.
Many picked up hedging from the question. That should have given all the hint
that it was necessary to remember to use it’ (Rater 4)
x ‘I was not quite satisfied with this category as I felt that it only measured ex-
plicit hedging devices’ (Rater 10)
Content – data description
Positive comments: 8
x ‘it was very easy to apply. Clearer than the DELNA scale’ (Rater 1)
x ‘was useful to know what to look for as opposed to the DELNA scale’ (Rater 2)
No comments: 2 raters
Content – interpretation of data and Part 3 (the comments were identical for both
categories and are therefore reported together)
Positive comments: 8 raters
x ‘the scale really helped – having a number of ideas made it easier to score’
(Rater 7)
x ‘this was an improvement on the DELNA - very specific and clear’ (Rater 3)

234
Negative comments: 2 raters
x ‘as the emphasis was on quantitative measurement in the descriptors, the qual-
ity of ideas, appropriateness, clarity of expression, depth of explanation and
support, any repetition got ignored a little’ (Rater 5)
x ‘whether an idea was supported is what should get more marks. Two well-
supported reasons should score higher than three unsupported reasons’ (Rater
9)
Coherence
Positive comments: 1 rater
x ‘it is doing the writers more justice in this way’ (Rater 1)
Negative comments: 7 raters
x ‘descriptors are long and sometimes confusing’ (Rater 5)
x ‘I had trouble coming to grips with coherence’ (Rater 6)
No comments: 2 raters
Cohesion
Positive comments: 5 raters
x ‘nicely specific and easier to make accurate judgments than the DELNA de-
scriptors’ (Rater 3)
Negative comments: 3 raters
x ‘difficult to categorize with fairness. Connected to lexical complexity and co-
herence’ (Rater 4)
No comments: 2 raters

Overall, Table 103 above indicates that there was general support for the catego-
ries of accuracy, lexical complexity and the three content trait scales. The remarks
were slightly more mixed, but still overall positive, for repair fluency, hedging
and cohesion. More raters were less in favour of the descriptors for paragraphing
and coherence.

Finally, Rater 10 wrote a noteworthy comment at the end of her questionnaire.


She remarked that ‘I think I prefer the existing DELNA scale because I like to
mark on ‘gut instinct’ and to feel that a script is ‘probably a six’ etc. It was a little
disconcerting with the ‘new’ scale to feel that scores were varying widely for dif-
ferent categories for the same script. However, this may also be because I am
more used to marking with the ‘old’ scale, I am sure I could get used to the new
scale if it was introduced’.

9.2.1.1 Summary of questionnaire findings

In summary, it can be said that in general the new rating scale was perceived as
positive. The raters were happy with the categories included in the scale, although
some raters made suggestions for other traits that could be added. There were also
some suggestions about levels and descriptors that needed changing, but no clear
consensus could be reached among the participants. The raters seemed to use the

235
new scale as intended, in an analytic fashion, by focussing on the individual cate-
gories, but some indicated that they reverted to holistic overall impressionistic rat-
ing when they found descriptors problematic or they did not agree with the cali-
bration of a scale. Most individual trait scales were positively received, but a
number of raters had reservations about the descriptors for coherence and para-
graphing.

9.2.2 Interview results

As mentioned in the methodology section, all interviews were conducted after the
analysis of the quantitative data from the rating rounds was completed. This was
done in the hope that some interesting interview topics might emerge from the
results of the data analysis. The broad interview topics can be seen in Table 78 in
Chapter 8.

The rater comments are presented in this section without any evaluation or discus-
sion. A discussion of the interview data can be found in the following chapter
(Chapter 10).

Several themes emerged from the analysis of the interviews. These themes have
been grouped into three broad categories for the presentation of results. Firstly,
themes emerging about the existing DELNA scale are discussed. Then, any
themes pertaining to the new scale are described. The final section presents a
number of more general themes relating to a comparison of the two scales.

9.2.2.1 Section 1: Themes relating to DELNA scale

This section is divided into two parts. The first looks at the problems that raters
experienced when using the DELNA scale and the strategies that they used to
cope with these problems whilst the second section describes positive aspects of
the DELNA scale.
The most regularly emerging theme in the interviews was that raters often experi-
enced problems when using the DELNA scale. A broad outline of these problems
can be seen in the following list. Each will be discussed in turn:

x descriptors are vague


x descriptors mean different things to different raters
x descriptors mix several aspects of writing into one scale category
x difficulty in differentiating the scale categories from each other

One of the most commonly mentioned problems was that the raters thought the
descriptors were often too vague for them to easily arrive at a score. In the extract

236
below, for example, Rater 4 talked about the problems she encountered when de-
ciding on a score for Content part three. The sections in italics refer to the word-
ing in the DELNA descriptors. Any comments in square brackets were added by
the researcher to aid understanding and set the context if necessary:

Rater 4: [...] And here relevant and supported, I find that tricky support,
what exactly is support. Because sometimes it is actually, sometimes you
have a number of ideas but there is not much support for them and what is
sufficient. You see, I can, several times I thought that is sufficient, but oth-
ers have said there is not enough. You know, I think, oh well, that is okay
for a page essay or a page and a half or whatever you are supposed to
write. You just can’t, there is nothing specific there to hang things on. I
mean you get the idea and then you feel a bit mean if you just, if there is
just one idea and they might have supported it adequately, so is that idea
not sufficient, but it is certainly relevant and it is certainly expressed
clearly and it is certainly appropriately supported, but there might be only
one idea. Mmh and so where do you put it?

Problems with the vagueness of the DELNA descriptors were also reflected in the
comments by Raters 5 and 9 below:

Researcher: Would you like to change anything in the actual wording [of
the DELNA scale]?
Rater 5: [...] Sometimes I look at it [the descriptors] I’m going ‘what do
you mean by that?’ [...] You just kind of have to find a way around it
cause it’s not really descriptive enough, yeah.

A number of raters pointed directly to the adjectives used as being the problem for
the vagueness of the scale. In the example below, Rater 10 talked about the de-
scriptors for vocabulary and spelling:

Rater 10: Well there’s always a bit of a problem with vocabulary and
spelling anyway in deciding you know the difference between extensive,
appropriate, adequate, limited and inadequate. So there’s sort of adverbial
[sic]. Yeah, it’s really just a sort of adverbial thingy anyway isn’t it so I
think I just go with gut instinct on that one. I probably would privilege vo-
cabulary I think over spelling also bearing in mind that they’re under
exam conditions
Researcher: Do these kinds of words like you just mentioned, are they a
problem in other categories as well for you?
Rater 10: Um, well I just end up going with my gut feeling. It’s quite simi-
lar I guess with things like, um, interpretation of data in part two with the

237
content where we go from wholly appropriate, sufficient or appropriate,
generally adequate, just adequate or often inaccurate.

Another problem that raters described was a concern that descriptors might mean
different things to different raters. This is, for example, reflected in Rater 4’s
comment about rating the category style:

Rater 4: I find that tricky, that some understanding or adequate under-


standing of academic style. I differ from others on that too. Mmh, and
some evidence. I think, you know, I think, yeah, that is okay and other
people have gone right down here [to a four], you know. Because they
think there is little understanding. But you know, there is usually in
DELNA candidates, there is usually some evidence of academic style. Just
what adequate means, you never really know.

Some raters also noted problems with several aspects of writing being mixed into
one scale category (as is the case, for example, with vocabulary and spelling).
Rater 3 raised this topic when she was asked what she would like to change about
the DELNA rating scale.

Rater 3: Well, I would say the vocabulary and spelling one when I think
about it now [...] I mean they could be a poor speller but have good vocab
and I’ve mentioned that, about the risk taking. It’s a matter then of judge-
ment, whether you can credit them and I think people do need to be cred-
ited otherwise you do have people being great spellers who never use
words that are very taxing.

Finally, raters sometimes had trouble differentiating rating scale categories from
each other. The extract below illustrates the problems raters seemed to experi-
ence, for example, in differentiating between the categories of sentence structure
and grammatical accuracy. Both scale categories refer to accuracy in the band de-
scriptors.

Researcher: Sentence structure, how do you differentiate between that and


grammatical accuracy?
Rater 5: Yeah, tricky, (laughs) because there is some overlap so whether
you can categorize an error as being a grammatical one or mmh is it the
form of the verb, you got that wrong, it is just a grammatical thing and that
impacts on the sentence structure, so you know, mmh, are the basic parts
of the sentence there, the subject, verb, have they sort of made a mistake,
it is either the order of the sentence, that might be a sentence structure
thing, but then, no hang on that is just a grammatical thing, because it

238
doesn’t have a major impact on the sentence stucture overall, so maybe
that should be grammatical. So that is quite tricky. I find it hard to distin-
guish between those

Although most raters reported having problems deciding on band levels with the
DELNA scale, the methods of coping are quite different for different raters. A va-
riety of strategies (both conscious and subconscious) emerged from the inter-
views. These were as follows:

x assigning a global score


x rating with a halo effect
x rating with a central tendency effect
x comparing scripts to each other
x disregarding the DELNA descriptors
x asking for directions from the administrator

Each of these will be illustrated in turn by quotes from the interviews.

The first strategy that a number of raters referred to in their interviews was as-
signing a global score to a script, usually
after the first reading.

Rater 5 below describes his rating process:


Rater 5: Mmh, yeah, I always automatically think, this is a native speaker,
this is a non-native speaker. How well will this come across, will it be suf-
ficient for academic writing and then that is sort of borderline between six
and seven quite often and then is it a better seven or is it an eight or is it
less than a six, or is it five. Sort of that range between five and eight is
usually where things fall, you just got to decide which category each one
goes into.

One rater, Rater 10, provided a reason for preferring this global type of marking.
She is from a background where impressionistic rating is more commonly used:

Rater 10: [...] I am more used to impressionist marking, because I tend to


mark more on literature actually, rather than language, so I am used to
saying I know this is a B+, so I am quite comfortable with this mode of
operating.

This overall, holistic type rating often results in a halo effect, where a rater
awards the same score for a number of categories on the scale. Below, Rater 10

239
talked about awarding scores for the three categories grouped under fluency in the
DELNA scale, organisation, cohesion and style:

Rater 10: For style, again, I just tend to go with the gut instinct. And I sus-
pect I often tend to give the same grade or similar grades for cohesion and
style. Probably for the whole of fluency. [...] So in a way, it is almost like
giving a global mark for the three things in consideration. With, if some-
one had no paragraphing, but everything else was good, maybe a bit of
variation.

Apart from holistic rating and rating that results in a halo effect, some raters also
reported mainly using the inside categories of the rating scale when using the
DELNA scale, which results in a central tendency effect. This is illustrated by the
comment by Rater 5 below:

Researcher: Do you feel you use the whole range of scores for style?
Rater 5: Again, I think it is similar to [...], I think that is probably similar
for me throughout. I tend not to use four and nine. I tend to move within
the other four. Probably, the most common marks I am sure I use are five,
six and seven and sometimes an eight.

Other raters, when having problems assigning a score, resorted to the strategy of
comparing scripts to each other. This is shown in Rater 10’s comment below:

Rater 10: I think it’s a bit like the case we talked about below where we go
from appropriate, appropriately adequate, da da da, that they [the descrip-
tors] are a little bit vague, but it does seem to work out in practice that I
just go with my gut instinct and I guess it’s really about comparing differ-
ent scripts so that when you don’t have a lot of them you can say ok this
one, yeah.

Some raters seemed to clearly disregard the DELNA descriptors and override
them with their own impression of a script. Rater 10 (below) was talking about the
score she would award for organisation to a script that had no clear paragraph
breaks but was otherwise well organised. The DELNA descriptors recommend
awarding either a five or a six.

Rater 10: Mmh [...] well, I think I would, (sighs), looking at this it ought
to be a six, but it is possible particularly if I suspected that it was a native
speaker, and that it was someone that wasn’t so strong in academic writing
but actually had very good English, I might even go up to a seven, but I
[...] yeah, if I had other reservations about the language and stuff, then I

240
would give it a six or even a five if it is really bad. But if I was sort of
convinced by the writer in every other way, I might well push the score up
in a way not to pull them down. Just for the paragraphing.

The final strategy reported when dealing with vague descriptors was to ask the
administrator for directions. For example, Rater 9 described how she deals with
the category of organisation for scripts that have not answered the third part of the
prompt:

Rater 9: This was the direction from the administrator. If there is no para-
graph for part three, then it isn’t fluent and organised effectively.

Above, problems emerging out of the use of the existing rating scale were dis-
cussed. The following section turns to positive themes that emerged about the
DELNA scale.

Whilst some raters seemed to agonize over differentiating between the different
band levels in the DELNA scale, other raters reported that they found them very
easy to use and had no problems assigning levels. This is illustrated in a quote
from Rater 5:

Researcher: What is the difference between a five and a six for cohesion
Rater 5: It is just a matter of degree [...], this [level 5] would cause me a
lot of trouble reading, I would find it hard to read it ... But between here
[between five and six], this is when a non-native speaker is really strug-
gling, a five in cohesion is really struggling and it is really hard for me to
read it. [...] This [six] is like that, but not as bad. And then a little better
[for a seven]. So the difference between acceptable and not quite accept-
able and pretty bad, quite bad. Maybe I wouldn’t really go much between
eight and nine. I mean if it is really really well written, then a nine maybe,
if it really flows, no trouble reading it, wow, that is great.

Rater 9, as shown in the quote below, used her feeling of how a native speaker
would write as a measure or benchmark when deciding on a score for cohesion:

Rater 9: [...] if it was a native speaker for example, this would jump out at
you. .. This would be a kind of subliminal thing. On first reading, which I
would have already done for the content, I would have picked up on
whether it was effortless for me to follow the message. Effortless is differ-
ent from appropriate. Yeah, I think effortless is something you just notice
or not. And I don’t have to judge, I don’t have to go, oh, they used these
types cohesive devices, I just know. And this is, yeah, and then for this

241
one (eight), it is really not bad at all, like a first draft for example by a
good writer. And again, the middle area (sighs), excellent second language
writer (seven) who would cause me slight strain, but who is not a particu-
larly great writer, but has other things going. But again, that is an impres-
sion.

9.2.2.2 Section 2: Themes emerging about the new scale


The themes emerging about the new scale can also broadly be classified into posi-
tive and negative comments. Positive comments referred to the following topics:

x descriptors in the new scale were seen as more explicit


x a higher level of intra- and inter-rater reliability with the new scale
x changed rating behavior with DELNA scale since taking part in research
project

As already reflected in the questionnaire results, the raters liked the fact that the
descriptors in the new scale were more explicit. Further evidence of this can be
found in the following extracts from the interviews:

Researcher: Do you feel you used the whole range there [accuracy in the
new scale]?
Rater 10: Yes, yes, I did. Yeah. I think I would be more likely to. Because
I thought I had something to actually back it up with, it had a clearer
guideline for what I was actually doing, so I was more confident for giv-
ing nines and fours. And I think also because I didn’t, I let go of the sense
of this is a seven, so I have to make it come out as a seven and I’d say,
well, sorry, if they have no error-free sentences they get a four and I don’t
care if it is something that might otherwise get a six or a seven and yes, if
they can write completely error-free then I can give them a nine. I have no
problems with that.

The idea of being able to arrive precisely at a score was also echoed in the com-
ment by Rater 7 below:

Rater 7: It is interesting, I found that it [the new scale] is quite different to


the DELNA one and it is quite amazing to be able to count things and say,
I know exactly which score to use now.

The more explicit criteria in the new scale resulted in a number of raters reporting
that it was not possible for them to use impressionistic marking when using the
new scale. It is important to remember that the raters did not know how the final
score would be derived for the new scale. The following quote is from the ques-

242
tionnaire administered immediately following the rating round using the new
scale. The raters were asked if they ever used a holistic score to arrive at a level
for a script.

Rater 1: I felt that the scale didn’t allow for that. To make the script fit to a
descriptor, I had to actually use the descriptors as guidance, much more
than I would do using the DELNA scale

Raters were also asked in their interviews which scale they thought would result
in higher intra- and inter-rater reliability for their rating. A typical rater response
can be seen in the extract below:

Researcher: With which scale do you think you are more self-consistent?
Rater 7: Probably with the new scale, because it is less subjective, you
know, you can say look there is five self-corrections whereas with the
DELNA one you have, you know, organisation it looks good today,
maybe next week I will think it is not.
Researcher: How about inter-rater reliability?
Rater 7: I think it is going to be more consistent with the new one. For the
same reason, because you can count, fluency, complexity, mechanics,
reader-writer interaction, content. Cohesion and coherence I am probably
a bit open to various ideas, but it is mostly, more than half of it, I am able
to say, no no look that is nine or that is eight.

One of the most unexpected themes emerging from the interviews was the fact
that almost all raters reported a changed rating behaviour since using the new
scale. The first rater interviewed (Rater 3) raised the topic and it was then in-
cluded in the interviews that followed. Here is what Rater 3 said:

Rater 3: Yeah. I found the first time round, there was definitely an im-
provement in my DELNA marking
Researcher: In what way?
Rater 3: It made me more aware, I hadn’t really thought about hedging
very much, I have to say, mmh, so that then I started to notice them, so
there is, it has had a very positive spin-off. It has pinpointed things, be-
cause the DELNA one is less specific, it is less specific, so this, the two
kind of go together quite nicely, this [the DELNA scale] pinpoints things.
But by marking with the new scale, it has, I have got in my mind now, I
can see hedging
Researcher: So maybe like a training scale?
Rater 3: Yeah, it definitely has been very useful. It is sort of more aware-
ness of things which I might have glossed over [...]

243
This very interesting idea of the scale being useful as a training tool will be dis-
cussed further in the following chapters.

Whilst all the comments about the new scale reported above shed a positive light
on the scale, some less positive comments were also made by the raters. These can
be grouped into three categories:

x aspects of writing not assessed by the new scale


x aspects of writing not assessed by the DELNA scale but included in the
new scale
x information lost by the scale being too specific

The first three comments below point to aspects of writing which are not as-
sessed by the new scale but were mentioned by raters to be missing. Rater 3, for
example, challenged the fact that spelling was not included in the new scale:

Rater 3: But for spelling, the DELNA one actually considers spelling, that
suits me in that way, I think this is quite a ‘me’ scale
Researcher: Because you are used to it?
Rater 3: No, that is what I am saying, I would always be looking at spell-
ing, that would be one thing, it’s ideas, but spelling does really bug me

The second group of negative comments related to aspects of writing in the new
scale that are not included in the DELNA scale. The first comment below relates
to the inclusion of hedging in the new scale:

Rater 10: I just wasn’t sure with, I guess the hedging devices would be an
example, mmh, sometimes I might think it was actually a pretty good
script, but they just hadn’t put any hedging devices in and so I felt like I
was marking them down for something that they didn’t know they were
supposed to do. And that they could maybe produce a pretty good piece of
work without having hedging devices and no kind of account was taken.
So I guess this [the new scale] seemed a bit more rigid to me and maybe
not fitting each individual case.

Two raters criticized the inclusion of repair fluency into the new scale. Rater 5
talked about his own writing style as a comparison:

Researcher: So are you saying it is irrelevant to what we are measuring


how many self-corrections they are producing?
Rater 5: Yeah, we are trying to measure their ability to write well and
communicate ideas and answer the prompt in an academic way. So yeah, I

244
would say it is irrelevant. Maybe that is clouded by my own experience
writing. I am a very hesitant writer. Maybe that is a reflection of ability,
because people’s processes of writing go differently, their ability to write
a first draft quite well. Maybe it is a reflection of writing. But it doesn’t
seem to me from my experience.

Raters further criticized the fact that some information was lost because the de-
scriptors in the new scale were too specific. Rater 5, for example, argued that a
simple count of hedging devices could not capture variety and appropriateness:

Researcher: You said that, other than hedging, style wasn’t really consid-
ered.
Rater 5: Yeah, it does seem a bit limited. And then they might repeat the
same hedge and they might copy the one from the prompt and so they get
automatic points which I suppose is a strategy you can use when you are
doing academic writing, but quite often sort of non-native speakers will
rely on one or two hedges all the way through [...] Whereas the good writ-
ers will very sparingly use hedges but they will use them just right and
they will vary them. [...] So maybe something about variety of hedges and
appropriateness as well. I suppose that is similar [to the DELNA scale] it
sort of relies on the marker’s knowledge of English in a more kind of
global way sort of. But maybe that is the inter-rater reliability issue com-
ing up. So the DELNA scale allows me the flexibility to use my own
judgement about a script in all categories.

Whilst the first and second section above described themes arising out of the in-
terviews pertinent to the DELNA and the new scale separately, the third section
below looks at topics that emerged about both scales.

9.2.2.3 Section 3: Themes pertaining to both scales

Two themes relevant to both scales were identified in the interviews:


x Time taken when using each scale (practicality)
x Providing feedback to students

The first theme relates to the time taken when using the two scales. This question
was included in the interviews, as it emerged from the questionnaires. Interest-
ingly, the raters differed greatly in the time it took them to rate scripts with the
new scale. One rater, Rater 3, reported taking about three to four times longer
when using the new scale. On the other hand, Rater 5, reported rating a lot faster
when using the new scale. Most raters, however, suggested that it took only

245
slightly longer with the new scale and that the time taken was within reason. The
comment by Rater 4 reflects the sentiment of most raters interviewed:

Researcher: Did it take you longer to use the new scale?


Rater 4: Mmh, well, I was trying to do it carefully, well I try to do it care-
fully with the DELNA scale too. I can’t remember. There was no obvious
difference. I got faster with the new scale. And every time I come to either
of them, I have to go through it again, so I don’t think either of them was
quicker than the other really.

The final theme pertaining to both scales is providing feedback to students. This
question was included in the interview after the analysis of the quantitative data
revealed that the new scale seemed to be measuring more aspects of writing. As
DELNA is a diagnostic assessment, it was thought that it would be interesting to
see which scale the raters would perceive as more appropriately capturing the dif-
ferent aspects of writing and therefore being more useful for providing feedback
to students. Not all raters were able to answer this question, probably because this
is not an aspect of the assessment situation they are usually confronted with. Rater
9, however, had experience working in the English Language Self Access Centre
(ELSAC), a facility which is often recommended to DELNA candidates, and she
thought in her interview that the DELNA scale might be more useful for provid-
ing feedback:

Researcher: So in terms of feedback, the DELNA scale is more useful?


Rater 9: Yeah, summing up, I think that students would respond very well
to the evaluative, the more emotional type vocabulary type vocab that is
used in the descriptors and they would find it easy to understand what
those meant. But on the other hand, great to say to them, you have only
used one hedge or two hedges or something like that.

Rater 10, another rater with experience in working at ELSAC, however, thought
that the new scale might be more useful:

Researcher: Which scale do you think is more useful to give students


feedback on their writing?
Rater 10: That is interesting. That is a bit hard to comment because I actu-
ally don’t know what the global score would look like for the new scale.
Researcher: Think about more detailed feedback.
Rater 10: [...] I am actually not sure if I could differentiate that. I think
they would probably both be of similar usefulness in terms of feedback.
Like if a student is going to be told you need to come up with more rea-
sons in your argument, that would probably emerge out of [...] I guess the

246
new scale might offer some more explicit feedback, I guess we might
really be saying, you need to have more than one or two reasons, whereas
here it might say your interpretation was not adequate. So I guess this
might provide some more concrete things to give to students and the data
description one. Yeah, again, if you could say you described the trends,
but you didn’t put any figures in, that might be better than saying your
data description was inadequate. So I think if you could get some preci-
sion out of this [the new scale], and things like the vocabulary, that would
be easier to back that up. And certainly the accuracy.

Overall, no clear consensus was reached on this question on the basis of the inter-
views.

9.2.3 Summary of qualitative findings

The first section presented themes that emerged about the DELNA scale. Most
raters reported experiencing some problems when using the DELNA scale. The
interviews also brought a number of strategies to light that were used when raters
encountered problems. A small number of raters, however, preferred the DELNA
descriptors.

The second section illustrated the themes that emerged about the new scale. These
were divided into positive and negative aspects. The raters found the band de-
scriptors more explicit because for many of them they were able to count features
and noted that this probably resulted in an increase in intra- and inter-rater reli-
ability. They also noted a positive spin-off on their rating behavior. All negative
comments related to aspects that are different from the existing scale, be it aspects
missing in the new scale or included in the scale but not normally found in other
scales.
The themes reported in the third section above related to broader overall themes.
Raters differed in the time they took to apply the two scales. Finally, raters dif-
fered in their opinions about which scale might be more useful for providing
feedback to students.

9.3 Conclusion
Chapter 9 presented the results in response to research questions 2a and 2b. The
following chapter, Chapter 10, will attempt to answer the overarching research
question guiding this study.

---
Notes:
1
One rater failed to answer this question

247
Chapter 10: Discussion – Validation of Rating
Scale

This chapter focuses on the validation phase of the new scale. The previous two
chapters, Chapters 8 and 9, presented the methodology and results of this phase,
whilst this chapter presents the discussion of the findings.

Although the results in Chapter 9 were reported in two parts (divided into research
questions 2a and 2b), the discussion in this chapter will focus on answering the
overarching research question:

To what extent is a theoretically-based and empirically-developed rating


scale of academic writing more valid for diagnostic assessment than an exist-
ing, intuitively developed rating scale?

The aim of this chapter is to build a validity argument. In contrast to other studies
that aim to validate one test or measure, this study set out to compare the validity
of two rating scales. Validity is therefore established through a comparison of the
two scales. To determine which rating scale is more valid, Bachman’s (2005) and
Bachman and Palmer’s (forthcoming) Assessment Use Argument was used as a
basis. To facilitate the comparison, a table was created in which the relative valid-
ity of the two scales was noted. The empty grid can be seen in Table 104 below.

Table 104: Table used to establish rating scale usefulness


Justification/evidence New scale DELNA Scale which provides more
scale evidence for Assessment Use
Argument
Construct validity
Reliability
Authenticity
Impact (Test consequences)
Practicality

As can be seen from the rows in Table 104, to establish validity, Bachman and
Palmer’s (1996; forthcoming) facets of test usefulness were used as guidelines.
The authors define test usefulness in terms of six aspects: construct validity, reli-
ability, authenticity, interactiveness, impact and practicality. These will provide
the structure of this chapter. It is important to point out, that the facets of test use-
fulness used in the table above were designed for the validation of entire tests and
not rating scales. However, because most aspects can be modified and applied to
rating scale validation, the decision was made to follow this framework. Because

249
interactiveness cannot be established with respect to rating scales, this concept has
been excluded from any further discussion. To guide the discussion of the remain-
ing five facets of test usefulness, a number of warrants have been formulated. A
warrant is a statement which has been devised to represent an ideal situation and
the discussion will establish how closely each rating scale reflects this.

10.1 Construct validity

Bachman and Palmer (1996) define construct validity as ‘the meaningfulness and
appropriateness of the interpretations that we make on the basis of test scores’ (p.
21). According to Weigle (2002), construct validation refers to the process of de-
termining whether a test is actually measuring what it is intended to measure. To
establish construct validity for a rating scale, we need to understand what the pur-
pose and the context of an assessment are, and whether the rating scale is helping
raters to arrive at scores which represent the abilities in question. A variety of
types of evidence can be used to establish construct validity. Of the types men-
tioned by Chapelle (1998), content analysis and empirical investigation will be
used. Three warrants focussing on construct validity have been formulated. The
first two warrants will employ content analysis and empirical investigation, while
the third warrant will involve a consideration of the procedures used during rating
scale development.

10.1.1 Warrant 1: The scale provides the intended assessment outcome appropri-
ate to purpose and context and the raters perceive the scale as representing the
construct adequately

DELNA is a diagnostic assessment system. To establish construct validity for a


rating scale used for diagnostic assessment, we need to turn to the limited litera-
ture on diagnostic assessment. Alderson (2005), as reported in Chapter 2 of the
literature review, compiled a list of features which distinguish diagnostic tests
from other types of tests.
Four of Alderson’s 18 statements are central to rating scales and rating scale de-
velopment. These are shown in Table 105 below.

Table 105: Extract from Alderson’s (2005) features of diagnostic tests


1. Diagnostic tests are designed to identify strengths and weaknesses in a learner’s
knowledge and use of language.
2. Diagnostic tests should enable a detailed analysis and report of responses to items or
tasks.
3. Diagnostic tests thus give detailed feedback which can be acted upon.
4. Diagnostic tests are more likely to be [...] focussed on specific elements than on
global abilities.

250
Each of these four statements will now be discussed in turn.
Alderson’s first statement calls for diagnostic assessments to identify strengths
and weaknesses in a learner’s knowledge and use of language. Both rating scales
compared in this study were analytic scales and were designed to identify
strengths and weaknesses in different aspects of the learners’ writing ability.
However, the principal factor analysis showed that the new scale distinguished six
different writing factors, accounting for 83% of the variance, whilst the current
DELNA scale resulted in one large factor accounting for 64% of the variance.
Therefore, it could be argued that the new scale was more successful in identify-
ing different strengths and weaknesses as well as accounting for more variance in
the final score. This result also shows that test takers’ abilities in different areas of
writing performance do not develop in parallel (as has been suggested by Young
1995 and Perkins and Gass 1996) but rather develops at different rates and at dif-
ferent times.

The main reason that the ratings based on the DELNA scale resulted in only one
factor was the halo effect displayed by most raters. Although developed as an ana-
lytic scale, the existing scale seemed to lend itself to a more holistic approach to
rating. It is possible that, as hypothesized in this study, that the rating scale de-
scriptors do not offer raters sufficient information on which to base their decisions
and so raters resort to a global impression when awarding scores. This then would
explain why, when using the empirically developed new scale with its more de-
tailed descriptors, the raters were able to discern distinct aspects of a candidate’s
writing ability.

Some studies have in fact found that raters display halo effects only when encoun-
tering problems in the rating process (e.g. Lumley, 2002; Vaughan, 1991). Lum-
ley, for example, found that when raters could not identify certain features in the
descriptors, they would resort to more global, impressionistic type rating. This
study suggests that the halo effect and impressionistic type marking might be
more widespread than has so far been reported. The halo effect is usually seen in
the literature as being a rater effect that needs to be reduced or even eliminated by
rater training. However, Cascio (1982, cited in Myford and Wolfe, 2003, p. 396)
notes that the halo effect is ‘doggedly resistant to extinction’. This study has
shown that simply providing raters with more explicit scoring criteria can signifi-
cantly reduce this effect. It could therefore be argued that the halo effect is not
necessarily only a rater effect, but also a rating scale effect.

However, what was not established in this study was whether the raters rated ana-
lytically because they were unfamiliar with the new scale. It is possible that ex-
tended use of the new scale might also result in more holistic rating behavior. This

251
point will be taken up in the suggestions for further research in the following
chapter.

As mentioned above, when using the new scale, the raters were able to determine
six different factors of writing ability. However, it is also important to examine
the usefulness of these six factors. The first factor, made up of accuracy, lexical
complexity, coherence and cohesion, can be described as a general writing factor.
This is a very useful factor for students to receive feedback on, as it seems to be a
good indicator of overall writing ability.

The second factor, made up of hedging and interpretation of data, is a factor that
lower level writers often struggle with. They are often able to describe data pro-
vided in a graph or table, but struggle to interpret the information and to present it
in the appropriate style (e.g. by using hedging devices). This would again, be a
useful factor for students to receive separate feedback on.

The third factor represented content – Part 3, where students are asked to extend
their ideas. Because the interpretation of data and Part 3 of content load on sepa-
rate factors, it can be argued that students who perform well in one part might
struggle with the other section. This could mean that the time-limit is not suffi-
cient to enable good students to receive high marks on both sections. Further re-
search into this is clearly necessary.

The fourth factor represents the description of the data. From the factor analysis, it
is clear that very different skills are required from learners when describing data
and interpreting them. Therefore, it is again a useful factor to report separately to
students. These results also provide evidence that averaging the different catego-
ries relating to content into one score results in a loss of important information.

The fifth factor represents repair fluency. The usefulness of this measure is doubt-
ful because the quantitative findings were not as convincing as those of other
scales and the raters were generally not convinced of the efficacy of this scale.
However, it is of course possible to argue that copious self-corrections are dis-
tracting to the reader. Even though more and more writing is produced on the
computer, students are expected to produce hand-written answers in exams and
too many self-corrections might distract from their writing. Therefore, although
this measure did not produce very convincing quantitative results, it can be argued
that it could be useful.

The final factor, factor 6, consists of paragraphing. Although the quantitative


analysis showed that this scale was functioning well, a number of raters objected
to the scale because it was too simplistic and did not account for organisation

252
within a paragraph. Also, if the scale was publicly available to students, achieving
high on paragraphing would be easy for students, as formatting their writing into
the five paragraphs would be simple. It would therefore be beneficial to develop a
more meaningful scale for paragraphing, but until then, the current scale descrip-
tors are arguably of some use to students.

Alderson’s (2005) second and third statements assert that diagnostic assessments
should enable a detailed analysis and report of responses to tasks and that this
feedback should be in a form that can be acted upon. Both rating scales lend
themselves to a detailed report of a candidate’s performance. However, as evident
in the quantitative analysis, if the raters at times resort to a holistic impression to
guide their marking when using the DELNA scale, this will reduce the amount of
detail that can be provided to students. If most scores are, for example, centred
around the middle of the scale range, then this information is less useful to stu-
dents than if they are presented with a more jagged profile of some higher and
some lower scores which therefore affords a clear indication of which aspects of
their writing they need to focus on.

For both scales, it is unclear to what extent the statements in the scales are of use
to students. It might not help much for a student to know that he or she has ‘little
understanding of academic style’, whilst the more detailed descriptors in the new
scale - ‘less than four hedges used’ - might be more informative. Technical vo-
cabulary, like the term ‘hedges’, would of course have to be defined for students.
The descriptors in the DELNA scale are often very broad and general and might
therefore not provide enough detail to be useful as a basis for instruction. If the
feedback is to be useful to stakeholders, it needs to be detailed and in a form that
can be understood by candidates and their future instructors. Therefore, it might
be useful for testing centres to develop two different types of feedback; one which
is designed solely for students who might have less metalinguistic knowledge and
the other for potential instructors with more technical vocabulary. This idea will
be further developed in the section on practical implications in Chapter 11.

Alderson’s fourth statement states that diagnostic tests are more likely to be fo-
cussed on specific elements rather than on global abilities. If a diagnostic test of
writing is aimed at focussing on specific elements, then this needs to be reflected
in the rating scale. Therefore, the descriptors need to lend themselves to isolating
more detailed aspects of a writing performance. The descriptors of the new scale
were more focussed on specific elements of writing because they were based on
discourse analytic measures. The band descriptors on the DELNA scale generally
reflect more global abilities, with vaguer, more general band descriptors. It was
interesting to observe, however, that the raters thought that, because of the more
specific descriptors, important information was lost. Some raters even remarked in

253
their questionnaires that they thought a scale with band descriptors for an overall
assessment was missing.

The way the scores are reported is also important. It is not effective to use an ana-
lytic scale and then average the scores when reporting back to stakeholders (as is
currently the case with the existing scale), because this will result in a more global
impression of the performance and important information is therefore lost. Cur-
rently the writing scores are reported to test takers as one averaged score with
brief accompanying descriptions about their performance in fluency, content and
form. Academic departments only receive one averaged score. As described in
Chapter 5, students also receive a recommendation on where to receive appropri-
ate help for their level of English proficiency. This could be either the English
Language Self-Access Centre, the Student Learning Centre, or the advice might
be that they should enrol in ESOL credit papers if their scores are found to be suf-
ficiently low. None of this advice, however, focusses on details of their writing
performance. In this way, the current practice is more representative of profi-
ciency tests or placement tests.

To arrive at a truly diagnostic assessment of writing, all categories in an analytic


scale need to be reported back to stakeholders individually, otherwise the diagnos-
tic power of the assessment is lost. On the other hand, if there is no value in
stakeholders knowing the subscores of a writing assessment, then there is very
little use reporting these (Alderson, 1991). This, then, would result in a test that is
not diagnostic, because detailed feedback is one of the key elements of diagnostic
assessment.

Finally, it was also important to establish the stakeholders’ perceptions of the effi-
cacy of the two scales for diagnostic assessment. Only the raters’ opinions were
determined. Raters’ perceptions of the scale usefulness are important as they pro-
vide one perspective on the construct validity of the scale. They are, for example,
able to judge whether the writing construct is adequately represented by the scale.
Just as important as the raters’ perceptions of the usefulness of the scale for diag-
nostic assessment would have been the test takers’ views, as well as the judge-
ments of stakeholders, such as teachers of the students. These were, however, not
canvassed as the scope of this study did not allow for this.

Raters were asked during the interviews which scale they thought might be more
useful for providing feedback to learners. Not all raters commented on this topic.
The raters who were able to answer this question were divided on this issue. Some
raters thought that the DELNA descriptors were more useful as the basis for feed-
back, whilst others considered the new descriptors to be better.

254
In the course of the interviews and questionnaires it became apparent that most
raters treated DELNA as a proficiency or placement test rather than a diagnostic
assessment. For example, Rater 10 wrote in her questionnaire: I think I prefer the
existing DELNA scale because I like to mark on ‘gut instinct’ and to feel that a
script is ‘probably a six’ etc. It was a little disconcerting with the ‘new’ scale to
feel that scores were varying widely for different categories for the same script.
Similarly, Rater 5 mentioned in his interview: I notice these things [features of
form] as I am reading through, but I try not to focus too much on them. I try to go
for broad ideas and sort of are they answering the question. Are they communi-
cating to me what they need to communicate first of all. And how well do they do
that.’ Also, some raters suggested in their questionnaires that they would have
liked to see descriptors assessing the overall quality of a script. It seems therefore
that the purpose of the assessment was not clear to them. The findings of this
study suggest that raters need to be made aware of the purpose of the assessment
in their training sessions, so that they recognize the importance of rating each as-
pect of writing separately. This might result in raters displaying less of the halo or
central tendency effects.

Summarizing the evidence for Warrant 1, it can be said that the new scale is able
to provide more information about the strengths and weaknesses of learners, as
more different aspects of writing ability were distinguished and a larger amount of
variance could be explained. It is therefore better equipped to form the basis of
detailed feedback profiles. The descriptors of the new scale also focus more on
specific elements of the writing product than the more general descriptors in the
DELNA scale. The raters’ perceptions of the efficacy of the two scales were di-
vided but slightly in favour of the new scale. However, there was evidence that
the raters were not aware of the purpose of the assessment; a number of their
comments showed that they were treating the assessment as a proficiency test.

10.1.2 Warrant 2: The trait scales successfully discriminate between test takers
and raters report that scale is functioning adequately

Although the main focus of this section is the discrimination power of the scales,
aspects such as reliability contributing to the discrimination of the scale are also
discussed.
The discrimination power of the whole scale is less important in the context of a
diagnostic assessment, as results should be reported back for each trait individu-
ally. In this section, the focus is therefore on the trait scales and the raters’ percep-
tions of these. After the discussion of the individual scales, the raters’ perceptions
of the validity of the scales as a whole will be considered.

255
The trait scales on the new scale generally resulted in a higher candidate separa-
tion ratio, which means that they were able to discriminate between more levels of
candidate ability. The main reason for the increased candidate separation ratio was
the fact that the raters were rating more similarly to each other and were using
more levels on the rating scale. If raters rate with large differences, their ratings
cancel each other out and this reduces the candidate separation ratio and therefore
also the validity of the assessment.

Overall, this study was able to show that the trait scales of the new scale func-
tioned better in most aspects than the DELNA trait scales. However, there were
some exceptions. For example, the discrimination (as measured by the candidate
separation ratio) was slightly lower for the new trait scales of cohesion and coher-
ence if the formula for equating the number of band levels (see Chapter 9) was not
applied. There are two possible explanations for these results. Firstly, all the trait
scales on the new scale were found to be inferior to the existing trait scale where
the focus was on features of writing are difficult to describe in precise detail. For
example, cohesion and especially coherence are aspects of the writing product
which are inherently difficult to quantify and to grade. It is therefore possible that
some aspects of writing lend themselves more to the type of descriptors used in
this study, whilst others are just as successfully applied even if they are not em-
pirically-based. However, it is also possible that more training and more experi-
ence in using these new descriptors might help raters in rating these traits more
reliably.
To provide further evidence for the warrant above, each trait scale on the new
scale will be evaluated individually, with evidence taken from both the statistical
analysis and the rater questionnaires and interviews. Based on these findings, rec-
ommendations for future revisions of the new scale will be made.

10.1.2.1 Accuracy:

Most raters remarked positively about the category of accuracy. However, there
was some disapproval of the measure of the percentage of error-free t-units. The
criticism raised reflects similar criticisms in theoretical discussions of this meas-
ure (see for example Wolfe-Quintero et al., 1998). One rater was not convinced
that it was fair that writers with one error per t-unit should be penalized in the
same way as writers with many errors per t-unit, a criticism also raised by re-
searchers such as Bardovi-Harlig and Bofman (1989). Overall, though, it could be
said that the rating scale category of accuracy functioned well in terms of its quan-
titative aspects and was also generally well perceived by the raters. It should
therefore be adopted in any future use of the rating scale. It might be useful to col-
lapse levels 8 and 9 on the new scale as the top band level was underused, espe-
cially since level 9 was created without any empirical basis. Collapsing the two

256
top levels to read ‘nearly all or all error-free sentences’ would also more accu-
rately mirror the lowest level, which reads ‘nearly no or no error-free sentences’.
However, Myford and Wolfe (2004) caution against collapsing adjacent band lev-
els. It is possible that a larger sample size would have shown that the top band is
in fact necessary.

10.1.2.2 Repair fluency:

The quantitative findings for the category of repair fluency were rather mixed.
Although it resulted in very high candidate discrimination and rater reliability, the
raters differed substantially from each other in terms of leniency and harshness
and nearly half of the raters were shown to be rating with too much or too little
variation. Raters’ comments collected as part of the questionnaire were divided.
About half the raters made positive comments. However, these related generally
to the ease of use of the category rather than its validity. Four raters were not con-
vinced of the efficacy of this scale. One rater suggested that there was little corre-
lation between the results for this category and English ability. Another rater sug-
gested that writers should be encouraged to self-correct rather than be penalized
for too many self-corrections. The reasons for these problems with the category of
repair fluency lie in its origins in speaking. A breakdown in fluency in speaking is
likely to be more problematic than in writing. A writer has time to self-correct
without immediately influencing the reader. Also, while speech occurs in real
time, it is not clear when a self-correction occurs in writing purely on the basis of
the writing product. If the self-correction was made during the initial writing
process, then this could be considered a breakdown in fluency. However, if the
correction occurs as a result of a revision at a later time, it should not be consid-
ered a breakdown in fluency. Whilst all the previous points made focussed on the
writer, it is also possible to view repair fluency from a reader’s perspective. Copi-
ous self-corrections could be seen to have a negative influence on someone read-
ing a text.

Overall, there are arguments for, as well as against, keeping this trait scale. Exces-
sive self-corrections might be distracting to the reader, but a case can also be
made for encouraging self-corrections as these provide evidence of writers’ revi-
sion processes. The quantitative findings of this trait scale were mixed, so if it is
reused in future administrations of the assessment, further revisions are clearly
necessary. If the scale is deleted, then this leaves the problem that no suitable
measure of fluency is available and this could be seen as a weakness of the scale.
However, it could also be argued that fluency is a lot less vital in writing than it is
in speech and thus does not need to be assessed.

257
10.1.2.3 Lexical complexity:

The quantitative findings for lexical complexity were generally positive when
compared to the vocabulary and spelling category on the existing scale. All as-
pects that were compared resulted in better values, except for the rater separation
ratio, which was slightly higher for the new scale, indicating that the raters were
spread more in terms of leniency and harshness. It seems that raters were able to
rank candidates more similarly when using the new descriptors but they differed
from each other in severity. The reason for this could lie in the fact that the de-
scriptors consisted of two parts: one focussed raters on the Academic Word List,
whilst the other part was more general. This second part was designed for raters
experiencing problems identifying words from the Academic Word List. It is pos-
sible that raters who used one type of descriptor produced more lenient ratings
than raters using the other type of descriptor. It would have been useful to ask rat-
ers which type of descriptor they used, to see if this explained the differences
among them.

The questionnaire comments on the trait scale of lexical complexity were gener-
ally positive. One rater for example remarked that this category is very indicative
of competence. This is also what the analysis of the writing scripts during Phase 1
suggested. This finding is further supported by other research. Loewen and Ellis
(2004), for example, were able to show that vocabulary knowledge is a good pre-
dictor of academic success as measured by grade point average (GPA), especially
if only the GPAs for language and writing-rich courses were used. A similar find-
ing was reported by Elder and von Randow (2002).

Overall, it seems that the trait scale of lexical complexity functions successfully
and therefore should be included if the new scale is adopted. All band levels on
the scales were sufficiently used by the raters, so that there is no reason to change
the number of levels.

10.1.2.4 Paragraphing:

The quantitative results for the category of paragraphing clearly favoured the new
scale. On the other hand, the qualitative results were not quite so clear-cut. Three
raters commented positively, but five raters were less convinced. They remarked
for example that this category did not account for the ordering of information
within paragraphs. This needs to be acknowledged as a clear weakness of the trait
scale. However it is very difficult to design a more detailed scale for paragraphing
without returning to more open-ended, vague descriptors. No previous research
measuring paragraphing empirically was available and it is clear that more work
in this area is necessary. On a purely mechanical basis, the scale functioned well

258
enough to be included in the new scale. It is not clear, though, if the number of
levels should be changed. The analysis showed that both outer levels were slightly
underused. It might be necessary to collect more data to see if there is any empiri-
cal basis for collapsing any levels.

10.1.2.5 Hedging:

The category of hedging performed well when the quantitative data was analyzed,
outperforming the existing rating scale of style in all aspects. The raters’ com-
ments in the questionnaire were also generally positive, although some raters
thought that a script could be highly successful without hedging devices. It is clear
that the category of hedging provides a substantially narrower picture of a writers’
academic style than its broader counterpart in the DELNA scale. The vaguer de-
scriptors in the DELNA scale, however, resulted in a central tendency effect.
Hardly any raters used the outside scale categories. This was possibly the case be-
cause raters did not know what specific features to focus on. In Phase 1 of this
study, several aspects of style were pursued, but the only one that successfully
discriminated between the levels was the category of hedging. Further research
resulting in the detailed de-scription of academic style is necessary. In the inter-
views, a number of raters suggested that whilst lexical complexity focusses on
academic vocabulary, a good discriminator of academic style would be the use (or
non use) of informal vocabulary. Although slightly subjective, this is an avenue
that might be worthwhile pursuing further. The category of hedging, although
functioning well, is clearly just one aspect of academic style. Future revisions of
the scale will hopefully include a wider variety of features of academic style. For
example, it might be interesting to investigate whether the category of voice used
by Cumming et al. (2005) is a meaningful measure for the type of writing genre
investigated in this study. In terms of the number of band levels, the highest level
(band 9) was slightly underused. Future research should investigate if it should be
combined with band level 8.

10.1.2.6 Content – data description:

The quantitative findings for the comparison of the two scales of data description
favored the new scale, although the raters were further spread in terms of leniency
and harshness. All raters were generally positively disposed towards this scale and
it should therefore be included in any future use of the new scale. Level 4 on the
scale was underused. However, an argument could be made for keeping this level
in the scale to cover instances in which candidates misread the question and do
not describe the data. If further use of the scale shows that this level is underused,
then it should perhaps be collapsed into the next higher level.

259
10.1.2.7 Content – Interpretation and Part 3:

All categories in the quantitative comparison of the interpretation of data and con-
tent Part 3 of the two rating scales pointed to the two trait scales in the new scale
functioning better than the equivalent trait scales in the DELNA scale. The ques-
tionnaire results showed that almost all the raters were positively disposed to-
wards these new trait scales. One rater, however, preferred using the existing trait
scales because it left him more room to bring in his own knowledge and experi-
ence. He suggested that this more quantitative measurement of content was not
able to evaluate the quality of ideas, their appropriateness and the clarity of ex-
pression or the depth of explanation and support. As was the case with other scale
categories, it is clear that the way of measuring content used in the new trait scales
has its limitations. To be able to arrive at a more reliable judgment, more explicit
categories had to be developed. Aspects such as the clarity of the expression of
ideas are very subjective and were therefore not included in the analysis. In both
the content categories, level 4 was slightly underused. Further use of the scale
needs to establish if these categories need to be combined with the next higher
level.

10.1.2.8 Coherence:

The quantitative findings for coherence were mixed. The new scale was not more
discriminating, but the rater separation was lower; that is, the raters differed less
in their severity and fewer raters were found to be rating with too little or too
much variation. A problem with the trait scale can however be found in the raters’
responses to the questionnaire. Almost all the raters found the category of coher-
ence too difficult or too time-consuming to use. Some raters stated that they got
used to the category as they marked more scripts, so there is some evidence that
the trait scale might become more usable with more training and experience.
However, since the data were collected another suggestion has been put forward.
It might be useful to undertake a multiple regression analysis with the different
categories used for the analysis of coherence as independent variables and with
the average DELNA score as the dependent variable. In this way, it might be pos-
sible to identify two or three of the seven categories used in the new scale, which
are more indicative of writing ability than others. If, for example, raters could fo-
cus only on the categories of superstructure, coherence breaks and direct sequen-
tial progression, this might make the rating task substantially easier. This would
also mean that a more simplified rating scale could be designed based on the find-
ings of the multiple regression analysis. All raters were asked as part of the inter-
views if they thought a simplified version of the coherence scale might be useful.
All seven raters thought that this might make training and use of this scale suffi-
ciently easier for the scale to be useful in future administrations of the assessment.

260
Compared to other rating scales in use for coherence, which have been criticized
for being too vague (Watson Todd et al., 2004), this approach to rating scale de-
velopment promises to provide raters with more guidance in the rating process. It
is, however, also possible that aspects like coherence, as suggested above, are not
suitable for scales which aim to provide more explicit descriptors. It is possible
that the two very different scales of coherence compared in this study will always
result in very similar statistical findings, because an aspect of writing like coher-
ence will always defeat more detailed description. In terms of the number of band
levels, level 4 was slightly underused in the new scale. More research is necessary
before any decision on the band levels is possible.

10.1.2.9 Cohesion:

The quantitative findings for cohesion were very similar to those for coherence
except that the raters did not rank the candidates more similarly than with the ex-
isting scale. The questionnaire comments were generally positive. Five raters
mentioned that they found the new descriptors easier to use than the existing ones.
Three raters were less convinced, with one commenting that it is difficult to di-
vorce cohesion from other aspects of writing, such as coherence and lexical com-
plexity. The descriptors for cohesion clearly need to be further revised. None of
the measures used was sufficiently clear in discriminating between the writers.
However, some aspects that might be more successful in distinguishing between
different writing ability levels, like lexical cohesion and cohesive chains (Halliday
& Hasan, 1976; Hoey, 1991; Neuner, 1987; Reynolds, 1995, 1996) are impractical
as they cannot be easily rated. Level 4 of the new scale was underused, but, this is
an improvement on the existing scale where levels 4, 5 and 9 were underused.
Overall, the trait scale for cohesion seems promising, but more research is clearly
necessary so that the band descriptors can be refined.

10.1.2.2 Raters’ perceptions of whole scale:

When considering the scales as a whole, the raters’ comments were mixed. Most
raters reported encountering problems when using the DELNA descriptors. Al-
most all of these comments were related to the descriptors being too vague or non-
specific for raters to be able to easily decide on a score. One reason mentioned in
this respect is the use of adjectives like ‘extensive’, ‘appropriate’ or ‘adequate’.
Raters were very aware that these could mean different things to different raters.
This problem has also been pointed out by a number of researchers (for example
Brindley, 1998; Mickan, 2003; Upshur & Turner, 1995; Watson Todd et al.,
2004). Furthermore, recent evidence from think-aloud protocols of the rating
process lends support to the fact that raters struggle with vague descriptors. Smith
(2000), for example, found that raters had ‘difficulty interpreting and applying

261
some of the relativistic terminology used to describe performances’ (p. 186).
Shaw (2002) noted that about a third of the raters he interviewed reported prob-
lems when using the criteria but he did not specify what specific problems they
encountered. Similarly, Claire (cited in Mickan, 2003) reported that raters regu-
larly debate the rating scale descriptors in rater training sessions and describe
problems in applying descriptors with terms like ‘appropriately’. It can therefore
be said that there is a growing body of research available that supports the results
obtained for the interviews. Raters do often seem to find the descriptors vague and
consider this to be a problem.

Some raters did not report any problems when using the DELNA descriptors.
However, these raters usually gave some indication in their interviews that they
were rating mostly holistically, and therefore were probably not aware of the need
for a diagnostic assessment to report back individual scores to test takers. It is
possible that rater background played a role in their perception of the descriptors.
Two raters mentioned in their respective interviews that they preferred the non-
specific descriptor style of the DELNA descriptors because their background was
in English literature and they were accustomed to more holistic type rating. That
ESL trained teachers and English Faculty staff rate differently has previously been
shown in studies conducted by O’Loughlin (1993), Song and Caruso (1996) and
Sweedler-Brown (1993). Similarly, it is possible that because most of the raters
have gained their rating experience in the context of proficiency tests (like e.g.
IELTS) that specific instructions during rater induction training need to focus on
the differences between these two assessment types.

The rater comments about the new scale as a whole were generally positive. The
raters liked the fact that the level descriptors were more explicit and objective and
provided more guidance than the descriptors raters were used to. They reported
that it was much easier for them to ‘let go’ of impressionistic marking.

A number of criticisms of the new scale emerged from the interviews. Some as-
pects of writing were found to be missing from the new scale. Some of these as-
pects mentioned by raters were excluded from the scale based on the findings in
Phase 1 of this study. These were, for example, spelling, punctuation, capitalisa-
tion, sentence structure (as operationalised in the grammatical complexity) and
certain aspects of academic style. Raters had not been briefed on how the new
scale was designed and did therefore not know that these categories had been ex-
cluded based on empirical findings. It might have been useful to inform raters on
the scale development process before they used the scale so that they understood
why these categories were excluded. This awareness might help raters in the rat-
ing process.

262
Raters also suggested a number of categories appeared to be missing. These had
not been included in the scale as they were found difficult to operationalize. These
were, for example, strength of ideas (appropriacy and development) and quality of
expression (clarity of ideas, conciseness). These aspects are areas which raters can
currently include when using the existing descriptors. However, what is not clear
with more general descriptors such as the ones used in the DELNA scale, is
whether raters end up focussing on the same aspects of writing. There was some
evidence in the interviews that raters focussed on very different features when us-
ing the same category on the rating scale. For example, when rating academic
style, some raters focussed more on non-academic vocabulary whilst others were
more irritated by persistent use of markers of writer identity such as ‘I’. It could
therefore be argued that categories that could be interpreted in a variety of ways
by raters add to the unreliability that has been observed in performance assess-
ments. Others might suggest that it is the role of rater training to counteract any of
these differences. Some of the extracts from the interviews suggest, though, that
raters are very fixed in their views of the writing product and it might not be pos-
sible to change the rating behavior of all raters to achieve high levels of rater reli-
ability. Similarly, authors such as Huot (1990) might argue that forcing raters to
discard their valuable personal experience and background might reduce the valid-
ity of a test. However, I would like to suggest that although raters’ personal ex-
perience and background are important in the rating process, achieving a certain
level of reliability is important for validity. This reliability has been shown to be
difficult to achieve with more conventional rating scales (see for example Cason
& Cason, 1984) and it might therefore be necessary to guide the rating process by
using more detailed, empirically-based descriptors.

Similarly, some raters felt that important aspects of writing were lost because the
new scale descriptors were too specific. For example, the only aspect of academic
style included in the new scale was that of hedging. The DELNA rating scale,
however, has more open-ended descriptors for style, allowing raters to award or
penalize a variety of aspects in this category. Some raters noted that the vagueness
of the descriptors allowed them to look at a more complete picture of the style of a
writing script, whilst hedging is very narrow and not necessary for a successful
script. This, of course, is a valid criticism which can be extended to a number of
features in the new scale. It could therefore be argued that the empirically devel-
oped descriptors take a narrower view of writing, not giving a true representation
of what raters take into account when using less rigid descriptors. Similar criti-
cisms were levelled at analytic rating scales when compared to more holistic type
of ratings by authors such as Huot (1990), Charney (1984) and Barrit, Stock and
Clarke (1986).

263
However, it is also possible to see this from a different point of view. The nature
of the more general descriptors allows different raters to focus on a variety of as-
pects in the same category. In the case of style, for example, it is possible that one
rater focusses typically on hedging whilst another rater is more concerned with
penalizing informal vocabulary. These differences will very likely result in low-
ered reliability. It would be hard to imagine how these differences could be elimi-
nated to arrive at more reliable ratings (which are of course the basis for any va-
lidity argument). Also, as discussed above in the section on Warrant 1, Alderson
(2005) suggested that raters should focus on more specific, rather than global
abilities, when diagnosing writing ability.

This discussion of Warrant 2, focussing on the discrimination power of the trait


scales, has shown that, on the whole, the new trait scales were more discriminat-
ing than the existing ones and the raters experienced very few problems using
them.

10.1.3 Warrant 3: The rating scale descriptors reflect current applied linguistics
theory as well as research

Messick (1989) argued that the construct validation process includes the collec-
tion of empirical evidence (which was discussed in Warrant 1 and 2 above) and a
theoretical rationale. Warrant 3 will now be considered in terms of the theoretical
underpinnings of the two rating scales. This warrant is important for the validity
of rating scales in all assessment contexts and can also be found in Alderson’s
(2005) list of aspects of diagnostic assessment.

The existing DELNA scale was based on pre-existing rating scales and has been
further developed according to the intuitions of administrators and raters involved
in the DELNA assessment. No information is available on the theoretical basis for
the DELNA scale, but it was adopted from a context other than the one it is cur-
rently used for. Overall, Fulcher (2003) would describe the existing DELNA scale
as an intuitively developed scale (see Chapter 3, p. X).

The categories in the new scale, however, were based on a taxonomy derived
from our understanding of language and/or writing development (as was described
in Chapter 4). A taxonomy was necessary as currently no theory of writing is
available. As a result, a number of different models were used and a taxonomy
was established. The descriptors in the new scale were developed empirically,
based on the investigation of writing samples collected in the context of the
DELNA assessment. This is important, as it shows that the descriptors are based
on actual performance and therefore closely represent what actually happens in
writing scripts. Therefore, the new scale is based both on linguistic theory as well

264
as on research (empirical investigation). Thus, it could be argued that the new
scale has more construct validity than the DELNA descriptors because it more
closely reflects current applied linguistics theory and is based on an empirical in-
vestigation.

10.1.4 Summary of construct validity

Table 106 below summarizes the three warrants relating to construct validity dis-
cussed above. The final column shows that in this specific context, the new scale
has more construct validity as it was able to discern more aspects of writing which
can be reported back to stakeholders as diagnostic information and it resulted in
higher discrimination on the trait scales. Most raters found the new scale easier to
use. Finally, it was established that the new scale more closely reflects current
theory and the fact that the descriptors are based on actual student performance
gave it construct validity.

Table 106: Summary table for warrants relating to construct validity

Scale which pro-


vides more evi-
Justification/
New scale DELNA scale dence for As-
evidence
sessment Use Ar-
gument
Construct validity
Warrant 1: The Six separate aspects Only one major New scale
scale provides the of writing were dis- writing factor was
intended assessment tinguished established; raters
outcome appropri- often resorted to
ate to purpose and halo effect
context and the rat-
ers perceive the
scale as represent-
ing the construct
adequately
Warrant 2: The trait Discrimination gen- Discrimination gen- Mixed, but slightly
scales successfully erally higher than erally lower than in favour of the
discriminate be- DELNA scale; rat- new scale; raters new scale
tween test takers ers’ perceptions report problems
and raters report generally positive, using scale and of-
that scale is func- but mixed ten do not seem to
tioning adequately understand the di-
agnostic purpose of
their rating

265
Warrant 3: The rat- Based on taxonomy Basis of categories New scale
ing scale descrip- of writing and rat- not clear; descrip-
tors reflect current ing models; de- tors intuitively im-
applied linguistics scriptors empiri- proved over years
theory as well as cally-developed
research

10.2 Reliability

A crucial aspect of reliability in rating scale development relates to the extent to


which test takers receive the same score from one rater to the next. Warrant 4 be-
low relates to reliability.

10.2.1 Warrant 4: Raters rate reliably and interchangeably when using the scale

Historically, reliability was considered to be separate from validity (Chapelle,


1999). However, since Messick’s more inclusive definition of validity, reliability
is now seen to be one facet of validity, that is, a necessary but not sufficient condi-
tion to validity (Davies & Elder, 2005). In other words, validity cannot be
achieved in a writing assessment, if the ratings are unreliable. However, reliability
on its own does not guarantee validity.

A number of different types of reliability estimates and factors contributing to re-


liability were measured as part of the FACETS analysis. Firstly, the rater separa-
tion ratio was generally lower for the trait scales on the new scale, which indi-
cated that the difference between the harshest and the most lenient rater was
smaller than for the corresponding existing trait scale. Nonetheless, in all cases,
the differences between the harshest and most lenient rater were still found to be
statistically significant. Myford and Wolfe (2004) did note, however, that very
small differences between raters are shown to be statistically significant by the
FACETS program. It is interesting to observe that even trait scales on the new
scale, which constrain raters to counting certain aspects of writing, at times still
resulted in unexpectedly large differences in the scores awarded. This shows that
even a very detailed empirically devised rating scale is not able to completely
eliminate the influence that raters and their background introduce into the rating
situation. This is consistent with the views of other authors who have developed
models of measurement error in performance testing (see for example Fulcher,
2003; McNamara, 1996; Skehan, 1998a).

The second aspect of reliability investigated was the rater point biserial coefficient
(or single-rater/rest-of-rater correlation). The pattern that emerged was that raters

266
generally ranked candidates more similarly when using the new trait scales, result-
ing in a higher rater point biserial value. A high rater point biserial for a trait scale
directly results in higher candidate discrimination, arguably a necessary condition
for a valid rating scale.

The third rater reliability statistic measured was the percentage of exact agree-
ment. This is higher if more raters choose exactly the same scale categories. Gen-
erally, the percentage of exact agreement was higher when the new trait scales
were applied, except for coherence and cohesion. The percentage of exact agree-
ment of raters is a variable that has only recently been introduced into the
FACETS output (Myford & Wolfe, 2004). This measure needs to be considered
with some caution and is probably less meaningful than the other rater statistics,
for two reasons. Firstly, exact agreement can be achieved if raters avoid the outer
scale categories and tend to mainly award the inner band levels. This rating be-
havior has been described as the central tendency effect (Landy & Farr, 1983;
Myford & Wolfe, 2003) and is not desirable. However, if high exact agreement is
achieved because of a central tendency effect, then this should inevitably result in
a lower candidate separation and more raters displaying a low infit mean square
value. This was generally not observed in the case of the new scales. Another rea-
son for the high exact agreement could be because a rating scale has few band
levels. Therefore, the chance of two raters awarding the same score level is much
higher with fewer levels to choose from. If the raters were choosing band levels
purely by chance, without referring to any descriptors, the percentage agreement
for the rating scale with fewer categories would inevitably be higher. Therefore,
unless the trait scales that are compared have the same number of band levels, the
measure of percentage of exact agreement is difficult to compare. In the case of
this study, some trait scales on the new scale had fewer scale categories than the
existing trait scales they were compared to.

Another aspect which also contributes to reliability is the percentage of raters who
rated with too much or too little variation (compared to what the FACETS pro-
gram would predict) and whose ratings therefore resulted in very high or low infit
mean square values. The percentage of raters identified to be rating with too much
or too little variation was found to be significantly lower when raters used the em-
pirically developed new scale. A possible reason for this lies in the raters’ reaction
when confronted with scale descriptors that do not offer enough information on
which to make a defensible decision on a band level. It is possible that raters react
to this in two different ways. Some raters might, when not sure what band level to
award to a script, choose a play-it-safe method and mainly award scores that are
in the inner levels of the rating scales. Some evidence of this was found in the in-
terviews and questionnaires of the raters. Anastasi (1988) points out that if raters
avoid using the extreme categories of the rating scale, this reduces the discrimina-

267
tion power of a rating scale. Other raters, when struggling to decide on a band
level, might however attempt to use most levels on the rating scale, and although
they might be quite self-assured about which band level to award when rating, the
resulting ratings might be slightly erratic and therefore inconsistent.

The results of this study seem to suggest that such rating behavior is not only the
result of rater background or individual characteristics that can be alleviated by
training, but might be the product of rating scale descriptors which are not de-
tailed enough to provide a solid basis for the raters in the rating process. There-
fore, it could be argued that Lumley’s (2002; 2005) suggestion that the rating
scale is mainly an inanimate object with little influence on the rating process
might have to be revisited. Because all aspects of the rating situation except the
scale were kept stable in this study, the findings suggest that the rating scale does
have a significant influence on rater behavior. Interestingly, Myford and Wolfe
(2003; 2004) suggest that to reduce both the central tendency effect and inconsis-
tencies in raters, the scale categories need to be defined more precisely so that rat-
ers will have a better idea of what the different band levels mean. This is exactly
what was attempted in this study, with some success it seems.

Overall, the results of the FACETS analysis suggest higher reliability in the rat-
ings based on the new scale. A summary of the findings relating to Warrant 4 can
be found in Table 107 below.

Table 107: Summary of evidence relating to reliability


Scale which pro-
vides more evi-
Justification/
New scale DELNA scale dence for As-
evidence
sessment Use Ar-
gument
Reliability
Warrant 4: Raters Rater reliability Rater reliability New scale
rate reliably and generally higher generally lower
interchangeably than with DELNA than with new scale
when using the scale
scale

10.3 Authenticity

Bachman and Palmer (1996) define authenticity as ‘the degree of correspondence


of the characteristics of a given language test task to the features of a target lan-
guage use task’ (p. 23). Authenticity is important as it links the test task to the
idea of generalizability. In the case of this study, it is not the aim to evaluate the
validity of the task but rather to compare the relative validity of two rating scales.

268
Authenticity is therefore a way of assessing the extent to which the score interpre-
tations generalize beyond performance on the assessment to language use in the
target language use (TLU) domain. To establish the authenticity of a rating scale,
we need to consider if what raters are doing is representative of how readers in the
TLU domain would approach a piece of writing. Warrant 5 reflects the degree of
authenticity of the rating scale.

10.3.1 Warrant 5: The scale reflects as much as possible how writing is perceived
by readers in the TLU domain.

No data was collected as evidence for this warrant; however a discussion of this
issue is necessary. Weigle (1998) argues that rating scales are inevitably a reduc-
tion of the construct being measured. Because of this, some authors have argued
that holistic rating is more authentic because it mirrors more closely the natural
process of reading (e.g. White, 1995). On the other hand, we again need to be
mindful of the purpose of this assessment. Alderson (2005), as discussed earlier in
this chapter, pointed out that a diagnostic test should give detailed feedback and
focus on specific, rather than global abilities. By its nature, therefore, diagnosis
reduces authenticity.

Table 108: Summary of evidence relating to authenticity


Scale which pro-
vides more evi-
Justification/
New scale DELNA scale dence for Assess-
evidence
ment Use Argu-
ment
Authenticity
Warrant 5: The Raters focus more Raters read more New scale
scale reflects as on details which is holistically which is
much as possible appropriate in a less suitable to di-
how writing is per- diagnostic context agnosis
ceived by readers in
the TLU domain

The DELNA scale, which often results in more global ratings, is for that reason
not necessarily appropriate for a diagnostic context. When assessing diagnosti-
cally, we are less interested in the generalizability of the ratings to a context be-
yond the assessment context than in identifying detailed strengths and weaknesses
of learners.

269
Therefore, the authenticity of the rating scale might be less of an issue in diagnos-
tic assessment than it is in proficiency tests. Table 108 below summarizes the dis-
cussion of this aspect of test usefulness.

10.4 Impact (test consequences)

According to Bachman and Palmer (1996), the impact of test use operates at two
levels: a micro level which is concerned with effects on individuals (stakeholders)
and a macro level which is concerned with effects on the educational system or
society. Because DELNA is generally considered a low-stakes test, impact will
only be considered at the micro level. Three warrants have been formulated about
individuals who could potentially be affected: the test takers, other stakeholders
and the raters.

10.4.1 Warrant 6: The feedback test takers receive is relevant, complete and
meaningful

No data to support this warrant was collected in the context of this study, as no
test takers were interviewed. Therefore, any suggestions made in this section are
purely speculative. It has been argued earlier in this chapter that the feedback pro-
vided to test takers would be more detailed based on the ratings with the new
scale, as the raters were able to discern more aspects of writing ability. Therefore,
it is possible to assume that the feedback based on the new scale would provide a
more detailed picture of the learners’ strengths and weaknesses. The feedback
might also be more meaningful as the descriptors are more concrete and test tak-
ers could therefore act upon them. However, neither rating scale was designed to
be directly used as feedback (as was mentioned earlier) and therefore more mean-
ingful descriptors need to be designed if students are truly meant to benefit from
the feedback (see Chapter 11).

10.4.2 Warrant 7: The test scores and feedback are perceived as relevant, com-
plete and meaningful by other stakeholders

Other stakeholders who might be impacted upon are teachers (or staff in self-
access labs) and the departments of test takers. More specific feedback on learn-
ers’ difficulties might lead to the introduction of language tutorials in certain con-
tent courses as well as teachers being able to offer more specific help. Again, al-
though this is an important issue for the validity of the scale, no data was col-
lected, so no conclusions can be drawn.

270
10.4.3 Warrant 8: The impact on raters is positive

Although the concept of impact is generally seen to be external to tests, a type of


internal impact was observed as part of this study - the impact of the scale on the
rating behaviour of the raters. This was an unexpected outcome of this study. This
fact only came to light in the interviews. A number of raters reported that their
rating behavior changed after using the new scale. Rater 3 mentioned that her
awareness was raised by the more explicit descriptors in the new scale. Another
rater, Rater 5, also pointed out that the more detailed descriptors were able to clar-
ify the differences between two rating scale categories on the DELNA scale.
These results suggest that even if there are reasons for not adopting the new scale
in practice, it might be possible to use it successfully as a training tool for raters.
A possible objection against using more explicit descriptors might be that such
rating scales might increase the time taken to rate. This was not a problem in this
study, but if it was a concern, assessment centers could provide very detailed de-
scriptors during training and use less descriptive rating scales in operational situa-
tions in the hope that raters would transfer the information from the training ses-
sion into the rating situation as suggested by Weir (1990). The training scale could
be kept separately in a training manual, so that raters have the chance to refer to it
whenever they are not sure how to use certain level descriptors. This could be ac-
companied by scripts with matching examples.

Table 109: Summary of evidence relating to impact


Scale which pro-
Justification/ vides more evidence
New scale DELNA scale
evidence for Assessment Use
Argument
Impact (test consequences)
Warrant 6: The feedback No data was No data was col- N/A
test takers receive is rele- collected lected
vant, complete and mean-
ingful
Warrant 7: The test No data was No data was col- N/A
scores and feedback are collected lected
perceived as relevant,
complete and meaningful
by other stakeholders
Warrant 8: The impact on Use of new No aspects came Possibly new scale
raters is positive scale (during to light in study
operation or (but question was
training) raised not directly inves-
awareness tigated)

271
The three warrants pertaining to impact can be seen in Table 109 above. No data
was collected from test takers and stakeholders other than the raters although
speculatively we could expect the new scale to be more useful. Warrant 8 sug-
gested that the new scale would be more useful.

10.5 Practicality

Bachman and Palmer (1996) define practicality in the following way. If the avail-
able resources are divided by the required resources and the result is equal to or
larger than one, the test development and use can be seen as practical. If the result
is lower than one, then the test development and use are not practical. The authors
list three types of relevant resources that need to be examined. Firstly, human re-
sources, which include test developers, raters, test administrators and clerical sup-
port. The second type of resource are material resources which can be divided into
space (e.g. rooms for test development and administration), equipment (e.g. com-
puters, software) and materials (e.g. paper, library resources). The final resource is
time. Bachman and Palmer (1996) argue that all the resources mentioned above
are ultimately a function of the financial budget. In the case of the new scale, two
aspects of practicality need to be considered. First of all, it is important to con-
sider if the development of an empirically-developed rating scale is practical. This
will be considered in Warrant 10. The discussion of this warrant is based on the
findings described in the previous chapter. Secondly, it needs to be considered if
the use of the scale is practical. This will be discussed under Warrant 9 below. In
this case, evidence is gathered from the process of scale development.

10.5.1 Warrant 9: The scale use is practical

Raters seemed to differ greatly in the time they took when using the new scale.
The majority of raters, however, agreed that the time taken to rate a script with the
new scale was slightly longer than that for the DELNA scale, but that this was
within reason. A significant increase in the time taken by raters might mean that
testing centers would have to pay more to their staff, which might not be viable (a
concern raised by North, 2003). Almost all raters agreed that the additional time
taken was not a threat to the practicality of the assessment. In fact, a number of
them seemed to agree that if the category of coherence were to be simplified, very
little or no additional time would be needed in comparison to the existing descrip-
tors.

10.5.2 Warrant 10: The scale development is practical

It is also important to consider if the development of the two scales in question is


practical. The two rating scales were developed in very different ways. The

272
DELNA scale was developed based on a pre-existing scale taken from another
context and has since been continuously changed based on rater feedback. The
new scale, on the other hand, was developed based on an analysis of writing per-
formances, which was extremely time-consuming. There-fore, the development of
the DELNA scale can be considered to be more practical. Also, the scale is avail-
able when the assessment is first administered, while the new scale could only be
developed after sufficient test taker performances were available.

Table 110: Summary of evidence relating to practicality


Scale which provides more evi-
Justification/
New scale DELNA scale dence for Assessment Use Argu-
evidence
ment
Practicality
Warrant 9: Marginally Marginally DELNA scale (but no great differ-
The scale use more time- quicker to use ence)
is practical consuming to than new scale
use than
DELNA scale
Warrant 10: More time- Less time- DELNA scale
The scale de- consuming to consuming as
velopment is develop it evolves over
practical years

Overall, it can be said that in terms of practicality the arguments were in favor of
the DELNA scale (see Table 110 above).

10.6 Conclusion:

Table 111 below presents a summary of the warrants investigated for the Assess-
ment Use Argument of the two rating scales.

Table 111: Summary of evidence for Assessment Use Argument


Scale which pro-
vides more evi-
Justification/
New scale DELNA scale dence for Assess-
evidence
ment Use Argu-
ment
Construct validity
Warrant 1: The scale Six separate as- Only one ma- New scale
provides the intended pects of writing jor writing fac-
assessment outcome ap- are distinguished tor is estab-
propriate to purpose and lished; raters
context and the raters often resort to
perceive the scale as rep- halo effect

273
resenting the construct
adequately
Warrant 2: The trait Discrimination Discrimination Mixed, but possibly
scales successfully dis- generally higher generally slightly in favour of
criminate between test than DELNA lower than the new scale
takers and raters report scale; raters’ per- new scale; rat-
that scale is functioning ceptions gener- ers report
adequately ally positive, but problems using
mixed scale and often
do not rate di-
agnostically
Warrant 3: The rating Based on taxon- Basis of cate- New scale
scale descriptors reflect omy of writing gories not
current applied linguis- and rating mod- clear; descrip-
tics theory as well as re- els; descriptors tors intuitively
search empirically- improved over
developed years
Reliability
Warrant 4: Raters rate Rater reliability Rater reliabil- New scale
reliably and inter- generally higher ity generally
changeably when using than with lower than
the scale DELNA scale with new scale
Authenticity
Warrant 5: The scale re- Raters focus Raters read New scale
flects as much as possible more on details more holisti-
how writing is perceived which is appro- cally which is
by readers in the TLU priate in a diag- less suitable to
domain nostic context diagnosis
Impact (test consequences)
Warrant 6: The feedback No data was col- No data was N/A
test takers receive is rele- lected collected
vant, complete and mean-
ingful
Warrant 7: The test No data was col- No data was N/A
scores and feedback are lected collected
perceived as relevant,
complete and meaningful
by other stakeholders
Warrant 8: The impact on Use of new scale No aspects Possibly new scale
raters is positive (during operation came to light
or training) raised in study (but
awareness question was
not directly
investigated)
Table 111: Summary of evidence for Assessment Use Argument (continued)

274
Practicality
Warrant 9: The scale use Marginally more Marginally DELNA scale (but no
is practical time-consuming quicker to use great difference)
to use than than new scale
DELNA scale
Warrant 10: The scale More time- Less time- DELNA scale
development is practical consuming to consuming as
develop it evolves over
years

Returning to the overarching research question which guided the discussion in this
chapter, the following conclusions can be drawn. Bachman and Palmer (1996)
suggest that ‘the most important consideration in designing and developing a lan-
guage test is the use for which it was intended’ (p.17). We need, therefore, to re-
member that the purpose of this test is to provide detailed diagnostic information
to the stakeholders on test takers’ writing ability. Most warrants provided evi-
dence in favour of the new scale. Warrant 1, which focuses on the construct valid-
ity of the assessment and pays special attention to the context of the test, is espe-
cially important.

However, not all the warrants above favoured the new scale. For example, the
warrants relating to practicality were in favour of the DELNA scale. Also, I was
only able to speculate on the impact (test consequences) of the two scales, as no
data was collected to support Warrants 6 and 7. But, as Weigle (2002) argues, it is
impossible to maximise all of the aspects described above. The task of the test de-
veloper is to determine an appropriate balance among the qualities in a specific
situation. Since each context is different, the importance of each quality of test
usefulness discussed above varies from situation to situation. Test developers
should therefore strive to maximize overall usefulness given the constraints of a
particular context, rather than try to maximize all qualities. In the context of
DELNA, a diagnostic assessment, it could be argued that the warrants relating to
construct validity are the most central (as is the case in most assessment situa-
tions). Authenticity and test consequences are possibly less important, considering
that diagnostic tests are generally regarded as low-stakes for the students (Alder-
son, 2005). Practicality is always a crucial consideration, but wherever possible,
construct validity should not be sacrificed simply to ensure practicality.

Overall, the new scale has been shown to generally function more validly and re-
liably in the diagnostic context that it was trialled in than the pre-existing scale.

275
Chapter 11: Conclusion

11.1 Introduction

The overall aim of this study was to establish whether an empirically-developed


rating scale with detailed level descriptors would be more valid in a diagnostic
context than a rating scale representative of those commonly used in proficiency
tests. Based on a taxonomy of a number of theories and models of language abil-
ity, writing development and rater decision-making, eight constructs were chosen
which served as the traits for the rating scale. During Phase 1 of the study, 601
writing scripts at five different proficiency levels were analysed using discourse
analytic measures representing these eight traits. Based on the findings, scale de-
scriptors for ten trait scales were formulated. In the second phase of the study, the
validation phase, ten experienced DELNA raters rated the same one hundred writ-
ing scripts, first using the existing DELNA rating scale and then using the new
scale. The resulting ratings were analysed using a range of statistical analyses (in-
cluding multi-faceted Rasch analysis and principal factor analysis). To supple-
ment the quantitative findings with qualitative data, the raters were asked to com-
plete a questionnaire and a subset of seven raters was interviewed.

I will first summarize my main findings and then discuss their theoretical and
practical implications and limitations. Finally, I will outline some potentially fruit-
ful areas of future research.

11.2 Summary of findings

Based on the taxonomy described above, eight constructs were chosen as the basis
for the traits in the rating scale. These were: accuracy, fluency, complexity, me-
chanics, reader-writer interaction, content, coherence and cohesion. The aim of
Phase 1 was to identify discourse analytic measures for each of these constructs
that were able to successfully differentiate between writing scripts at different
DELNA band levels and that were at the same time sufficiently simple to be used
by raters during the rating process.

Table 112 below shows the discourse analytic measures which were chosen as
suitable measures for the design of the rating scale.

277
Table 112: Discourse analytic measures included in rating scale
Construct Discourse analytic measure
Accuracy Percentage error-free t-units
Fluency Number of self-corrections
Complexity (lexical) Number of AWL words
Mechanics Number of paragraphs from five-
paragraph model
Reader-writer interaction Number of hedging devices
Content Percentage of data described
Number of reasons and supporting ideas
in Parts 2 and 3 of content
Coherence Categories from topical structure analy-
sis
Cohesion These/this, number of linking devices in
combination with qualitative analysis of
linking devices

The analysis during Phase 1 also identified several measures which were not able
to discriminate between the five DELNA proficiency levels or that proved too
complex for inclusion in the scale. These measures (listed in Table 113 below)
were therefore not used in the rating scale.

Table 113: Discourse analytic measures not included in rating scales


Construct Discourse analytic meas- Reason for exclusion of measure
ure
Fluency Number of words Not discriminating
Complexity Clauses per t-unit Not discriminating
(grammatical)
Complexity Average word length Too complex for rating process
(lexical)
Complexity Sophisticated words over Too complex for rating process
(lexical) total lexical words
Mechanics Number of spelling mis- Not discriminating
takes
Mechanics Number of punctuation Not discriminating
mistakes
Reader/writer Number of markers of Not discriminating
identity writer identity
Reader/writer Number of boosters Not discriminating
identity
Reader/writer Number of passives Not discriminating and large num-
identity ber of writers did not use passives

278
The measures in Table 113 above were then used to develop the new rating scale
and the scale was validated in Phase 2.

The quantitative findings of Phase 2 suggested that, on the whole, the individual
trait scales on the new scale outperformed the trait scales on the pre-existing scale
in a number of categories: they generally resulted in higher candidate discrimina-
tion, higher rater reliability and smaller differences between the raters in terms of
severity. There were by and large also fewer raters identified as rating with too
much or too little variation when using the new scale.

When the ratings based on the scales as a whole were analysed, it became clear
that raters had awarded very similar scores across the different trait categories
when using the DELNA scale. When the new scale was applied, the rating pro-
files were more jagged, suggesting that the different categories on the scale were
measuring different aspects of writing. This was confirmed by a principal factor
analysis which suggested that the ratings based on the DELNA scale only resulted
in one main factor (accounting for 64% of the variance of the writing score),
whilst those based on the new scale resulted in six factors which accounted for
83% of the variance.

The qualitative findings showed that the raters experienced problems when as-
signing scores with the DELNA descriptors. This was mainly because of the
vagueness of the level descriptors which often did not provide enough information
to the raters. Raters reported a number of strategies they had developed to deal
with these problems. It was furthermore clear from the interviews that raters were
not sufficiently aware that rating in the context of diagnostic assessment required
different types of rating processes than for example when rating a proficiency test
– an issue which needs to be addressed in rater training. When asked about the
new scale, the raters generally reported that they enjoyed being provided with
more objective level descriptors because these provided more information to back
up their ratings.

11.3 Implications

The current study has a number of theoretical and practical implications, which
will be discussed in the following section.

11.3.1 Theoretical implications

The theoretical implications of this study relate to models of performance assess-


ment, models of rater decision-making processes, rating scale classification, and

279
rater reliability in performance assessment. Each of these will be discussed in turn
below.

11.3.1.1 Models of performance assessment

The first theoretical implication resulting from this study relates to the current
models of performance assessment. In the literature review, a number of models
of writing performance were discussed (see Chapter 2). Earlier models were de-
veloped by Kenyon (1992), McNamara (1996) and Skehan (1998a). The latest
model available is by Fulcher (2003). Most of these models were conceived in the
context of oral performance assessment and were therefore adapted for the pur-
pose of this study.
The findings from this study suggest a number of possible additions to Fulcher’s
model. The expanded model can be seen in Figure 53 below.

Figure 53: Expanded model of performance assessment (after Fulcher, 2003)

A number of features have been added to this model to reflect the findings of this
study. New variables are indicated in bold font and striped boxes, and new arrows
are indicated in bold font. Firstly, ‘scale development method’ was included in the
model as an additional variable. This variable has a direct influence on the rating
scale, as was shown in this study. An arrow was also added from ‘orientation /
scoring philosophy’ to ‘scale develop-ment method’ as the scoring philosophy is
likely to influence the scale development method. Scale developers might choose
a certain method to develop a rating scale because of an underlying scoring phi-
losophy, but they could also change this scoring philosophy based on the resulting

280
rating scale as indicated by the double-sided arrow linking scoring philosophy and
the rating scale.

A variable called ‘raters’ attitudes’ was also added. This variable influences the
raters and their use of the rating scale. Raters’ attitudes are influenced by their un-
der-standing of the context of the assessment, the purpose of the assessment and
the expected assessment outcomes. An arrow was included between this new vari-
able and the rater characteristics as these can inform and shape the raters’ atti-
tudes. This study, for example, showed that raters were not aware of the impor-
tance of rating for diagnosis. This had a direct influence on the raters’ use of the
scale descriptors and their rating process. Most raters also had a background in
rating a proficiency test and often seemed to transfer this approach to rating into
the context of DELNA, a diagnostic assessment. This understanding should of
course be shaped in rater training (as indicated by an arrow). This same arrow is
double-sided because the raters’ understanding of the purpose of the assessment
can also have a direct influence on the rater training (and its effectiveness).

Fulcher’s (2003) model of test performance included a variable he called ‘score


and inferences about the test taker’. In the expanded model, this variable has been
divided into two: ‘assessment outcomes (score reporting)’ and ‘inferences about
test takers’. Both of these are directly affected by the interaction between the rat-
ing scale and the test takers’ performance. The assessment outcomes are further-
more directly influenced by the construct definition (which in turn is closely re-
lated to the context of the assessment, although this is not reflected in the model
because of space reasons). The assessment outcomes di-rectly influence the infer-
ences that can be made about test takers. Additionally, inferences that certain
stakeholders need to make about test takers influence the assessment outcomes (in
terms of how scores are reported – see practical implications).

Like Fulcher’s model, this expanded model is also seen as provisional and requir-
ing further research.

11.3.1.2 Models of rater decision-making processes

The second theoretical implication is related to the models of the rating process
available in the literature. Several models (e.g. Erdosy, 2000; Milanovic et al.,
1996; Sakyi, 2000) have shown that raters differ in their typical rating procedures.
Several different rating approaches have been identified. These include the fol-
lowing:
x the read-through-once-then-scan approach
x the performance criteria-focussed approach
x the first-impression-dominates approach

281
In this study, most raters were identified as following ‘the first-impression-
dominates approach’ when using the existing DELNA descriptors. However,
when using the new rating scale, the raters seemed to follow ‘the per-formance
criteria-focussed approach’. Because the same raters were used in this study, this
may mean that existing models need to be either expanded or adjusted to allow for
the role the rating scale seems to play in the rating approach that raters choose.
More specific rating scale descriptors seem to be conducive to ‘the performance
criteria-focussed approach’ whilst vaguer, broader level descriptors seem to in-
duce raters to be guided by their first impression.

Existing models of the rating process were formulated on the basis of think-aloud
protocols. It is quite possible (as has been generally acknowledged) that think-
aloud protocols might result in a Hawthorne effect among the raters, suggesting
that raters are a lot more careful when rating than they actually are in reality. Fur-
ther doubts about think-aloud protocols have also been raised by a recent study by
Barkaoui (2007a; 2007b). In the study described here, no think-aloud protocols
were collected, but raters reported in their interviews some of the strategies that
they employed when rating with the existing rating scale descriptors.

11.3.1.3 Rating scale classification

The third theoretical implication refers to the classification of rating scale types
commonly found in the literature. Weigle (2002), in her comprehensive summary
of the different types of rating scales, distinguished holistic and analytic rating
scales and listed the differences in a table (see Table 4 in Chapter 3). Weigle (as
well as authors such as Bachman & Palmer, 1996; Cohen, 1994; Fulcher, 2003;
Grabe & Kaplan, 1996; Hyland, 2003; Kroll, 1998) described these two types of
rating scales as being distinct from each other. However, this study seems to sug-
gest that although these two scales are different, it is also necessary to distinguish
two types of analytic scales: less detailed, a priori developed scales and more de-
tailed, empirically-developed scales. It is possible that further research will be
able to describe different degrees of analycity in rating scales. Therefore, Wei-
gle’s summary table can be expanded in the following manner (Table 114):

282
Table 114: Extension of Weigle’s (2002) table to include empirically-developed analytic
scales
Quality Holistic Scale Analytic Scale – Analytic Scale –
intuitively devel- empirically devel-
oped oped
Reliability Lower than analytic Higher than holis- Higher than intui-
but still acceptable. tic. tively developed
analytic scales.
Construct Validity Holistic scale as- Analytic scales Higher construct
sumes that all rele- more appropriate as validity as based on
vant aspects of writ- different aspects of real student per-
ing develop at the writing ability de- formance; assumes
same rate and can velop at different that different as-
thus be captured in rates. But raters pects of writing
a single score; ho- might rate with halo ability develop at
listic scores corre- effect. different speeds.
late with superficial
aspects such as
length and hand-
writing.
Practicality Relatively fast and Time-consuming; Time-consuming;
easy. expensive. most expensive
Impact Single score may More scales can Provides even more
mask an uneven provide useful di- diagnostic informa-
writing profile, may agnostic informa- tion than intuitively
be misleading for tion for placement, developed analytic
placement and may instruction and di- scale; especially
not provide enough agnosis, but might useful for rater
relevant informa- be used holistically training.
tion for diagnostic by raters; useful for
purposes. rater training.
Authenticity White (1985) ar- Raters may read Raters assess each
gues that reading holistically and ad- aspect individually.
holistically is a just analytical
more natural proc- scores to match ho-
ess than reading listic impression.
analytically.

Researchers and practitioners need to be made aware of the differences between


analytic scales and need to be careful when making decisions about the type of
scale to use or the development method to adopt.

283
11.3.1.4 Rater reliability in performance assessment

The fourth theoretical implication has to do with how rater reliability can be
achieved in performance assessment. A number of researchers have focussed on
counter-acting rater effects through extensive rater training and restandardization
sessions (see for example Elder et al., 2007; Elder et al., 2005; McIntyre, 1993;
Weigle, 1994a, 1994b, 1998; Wigglesworth, 1993). But whilst the rating scale has
been acknowledged repeatedly as a source of measurement error (Fulcher, 2003;
McNamara, 1996; Skehan, 1998a), very little time and energy seems to have been
invested in reducing this type of measurement error. The quantitative results for
the comparison of the individual trait scales show that introducing descriptors
based on discourse analytic measures can be very successful in reducing meas-
urement error and therefore adding to the validity and reliability of a writing as-
sessment.

11.3.2 Practical implications

This study has three practical implications. These relate to the weighting of
scores, rater training, and rating scale development and score reporting. Each will
be discussed in turn below.

11.3.2.1 Weighting of trait categories

One issue that was not covered in this study is the question of the weighting of
categories. Depending on the context that a rating scale is used in, it might be sen-
sible to give extra weight to certain rating scale categories. For example, it could
be argued that in an assessment in which both grammatical accuracy and spelling
are evaluated, it might make sense to give more weighting to grammatical accu-
racy as correct spelling is arguably less important (especially as a large number of
writers nowadays use computers). However, the weight of categories might be
less important in diagnostic contexts, as the categories are reported individually to
maximise the feedback to stakeholders. Nevertheless, it might be useful for stu-
dents to understand which aspects of writing are particularly important for their
further success at university. This is arguably different for different disciplines.
Students taking language-rich subjects, for example, need to be able to organise
their writing to a much greater extent than students in the sciences who mainly
have to write technical reports which often follow prescribed formats. Similarly,
students at post-graduate level involved in writing theses and dissertations need to
produce much lengthier discourse than most undergraduate students. It might
therefore be argued that the emphasis of the feedback should differ for different
students. Feedback could also vary according to the proficiency level of students.
However, this might not be practical.

284
If scores for different categories need to be averaged to provide more accessible
information to groups of stake-holders (e.g. the academic departments of test tak-
ers), it might be useful to draw on the factor loadings as a basis for the weightings.
For example, factor 1 (accuracy, lexical complexity, coherence and cohesion) ac-
counted for 33 percent of the variance, whilst factor 2 (hedging and interpretation
of data) accounted for only 13 percent of the variance. Therefore, factor 1 could
be given more weigh-ting in the final score, if averaging was desired in that con-
text. For this, either the percentage variance could be used, or the score in each
category could be multiplied by the eigenvalue of that factor. However, in the
feedback for test takers the scores should not be averaged.

11.3.2.2 Rater training

Two issues relating to rater training emerged from the findings of this study.
Firstly, as suggested in the previous chapter, even if a rating scale such as the new
scale is not used under operational rating conditions, there is some evidence from
the interviews that raters profited from using the more detailed new descriptors
because it made them more aware of what aspects of writing to look for and how
to differentiate rating scale categories and band levels. The new scale seemed to
force raters to rate analytically. A more detailed rating scale could be used during
training and could also be kept on the premises where the rating is undertaken for
raters to refer to when experiencing problems. Similarly, raters could be invited to
regularly review the more detailed descriptors in their own time after a break in
rating.

Raters also need to be clear about the purpose of an assessment they are involved
in. This study showed that raters seem to be transferring their rating experience
from proficiency tests to the context of diagnostic assessment, which has a differ-
ent aim and focus. Raters, therefore, need to be made aware of the differences be-
tween these types of assessment during their initial induction training and this idea
needs to be repeated in every rater training session. Raters need to understand that
their scores are not averaged and not used as a basis for student placement, but
rather serve as feedback to students to help them identify their strengths and
weaknesses in order to help them identify the appropriate support structures avail-
able at university.

11.3.2.3 Rating scale development and score reporting

The final practical implication relates to score reporting. Firstly, some more gen-
eral considerations will be discussed. The second part of this section will then
provide some suggestions how scores could be reported in the context of diagnos-

285
tic assessment in an English for academic purposes (EAP) context, such as the one
which provided the context for this study.

Table 115 below presents several different purposes for which writing tests are
administered. The second column of the table provides a brief, general definition
of each test type. The third and fourth columns indicate that depending on the
purpose of the writing test, the rating scale in use might need to be different and
that the score will be reported in a different manner. For example, whilst for a
proficiency test it might be less important if the rating scale in use is holistic or
analytic (as long as it results in reliable ratings), the rating scale used in diagnostic
assessment would need to be analytic and at the same time should provide a dif-
ferentiated score profile. The need for these different types of scales is a conse-
quence of the way the scores are reported. Results of tests of writing proficiency
are usually only reported as one averaged score but, as Alderson (2005) suggests,
the score profiles of diagnostic tests should be as detailed as possible and there-
fore any averaging of scores is not desirable.

Table 115: Rating scales and score reporting for different types of writing assessment
Purpose of Definition Rating scale Score reporting
writing test
Proficiency test Designed to test general Holistic or analytic. One averaged score.
writing ability of stu-
dents.
Placement test Designed to be used to Usually holistic Usually one aver-
place students in spe- aged score, but
cific writing course. scores might be
grouped according
to the focus of the
courses.
Achievement Designed to show that Depends on the fo- Depends on the fo-
test student has achieved a cus of the writing cus of the writing
certain standard on a course. course.
writing course.
Diagnostic test Designed to identify Needs to be analytic In detail; separate
strengths and weak- and needs to result for each trait rating
nesses in writing ability; in differentiated scale.
designed to provide de- scores across traits.
tailed feedback which
students can act upon;
designed to focus on
specific rather than
global abilities.

286
The rating scale used for a placement test would differ depending on the context.
If, for example, it is more important to establish the overall writing ability of a
student to place test takers in courses, then a holistic scale (or an analytic scale)
which results in one overall (averaged) score would be needed. However, if the
courses have different foci in terms of the aspects of writing they concentrate on,
then an analytic scale might be more useful in differentiating between test takers.
This might mean that the score is not necessarily reported as one overall score, but
as a group of scores. For example, if one course focuses more on sentence level
features of writing (e.g. grammatical accuracy and mechanics) and another on
overall text organisation in different writing genres, then the scores might be re-
ported in two parts, rather than one averaged score. Finally, the rating scale used
in achievement tests and the way the score is reported depends on the focus of the
writing course under consideration.

The following section proposes more specific types of feedback that could be pro-
vided to different stakeholder groups in the context of a diagnostic EAP writing
assess-ment such as the one that provided the background for this study. Three
groups of stakeholders will be considered: the test takers, the academic English
tutors of the candidates and the content teachers or the academic departments of
the students.

11.3.2.3.1 Feedback to test takers

In the case of the test takers, the feedback should be clear, in sufficiently simple
language and detailed, that it can be acted upon. The feedback for each student
should contain advice on the test taker’s strengths and weaknesses in all the areas
assessed.

For example, the feedback on the student’s ability in the category of hedging
could read as follows (Figure 69 below):

In academic writing, writers usually tone down the strength of the claims they make. They
would use phrases and words like ‘might’, ‘it is likely that’, ‘one can assume that’ and ‘it
seems that’. A good writer might use several of these phrases or words to soften what he or
she is saying. You used only a few in the entire essay.
Figure 69: Feedback on hedging for test takers

Apart from detailed explanations, such as the one on hedging presented above, it
might also be useful to provide students with a visual presentation of their per-
formance across different traits. An example of this can be seen in Figure 54 be-
low. In this way students can easily see which aspects of writing require more ur-
gent attention.

287
Figure 54: Visual feedback for test takers

Next to the feedback for each category of the rating scale, the test taker should
find a summary and a recommendation of what steps to take to improve any
weaknesses found. This could take the form of an action plan for the student. A
complete feedback portfolio for the student represented in the graph above, might
look as follows (see Table 116 below).

Table 116: Feedback profile for test takers


COMMENT ON YOUR ACTION PLAN
WRITING
Accuracy
Only about a quarter of your You should enrol in an ESOL credit course. ESOL 100
sentences were error-free. might be the most appropriate course to help you with
The raters noted that some of your problems with accuracy. In your own time, you
your most common errors could also seek help at the English Language Self-
were: sentences with no Access Centre (ELSAC). When you make an appoint-
verbs, problems with select- ment there, the tutor will guide you to the best resources
ing the correct tense, missing available on campus.
plural –s.
Fluency
You made many self- You should plan more before you write. It is, of course,
corrections during your writ- useful to self-correct your writing, but it is a good idea
ing. This is very distracting not to cross out too much because you don’t want to
for the reader. However, re- confuse your reader. It might be helpful for you to prac-
member that this is not a tice your writing under a time limit at home, so that
problem when you are writ- when you are writing, for example, in an exam situation,
ing on the computer. you are used to writing under pressure and you are still

288
able to plan without compromising the quality of your
essay.
Complexity
You used very little aca- You should enrol in an ESOL credit course. ESOL 100
demic vocabulary in your might be the most appropriate course to help you with
writing. your problems with vocabulary. In your own time, you
could also seek help at the English Language Self-
Access Centre (ELSAC). When you make an appoint-
ment there, the tutor will guide you to the best resources
available on campus. There are also several useful self-
help books available in the library which you can use to
study academic vocabulary in your own time. See for
example, the following book: (give example) and you
can access the following website from home (give ex-
ample).
Mechanics
Your paragraphing was very Well done, no action needed.
good. Your essay had a clear
introduction and conclusion
and the different sections of
the essay were well differen-
tiated by paragraphs.
Reader/writer interaction
In academic writing, writers You should enrol in an ESOL credit course. ESOL 100
usually tone down the might be the most appropriate course to help you to im-
strength of the claims they prove your hedging. In your own time, you could also
make. They would use seek help at the English Language Self-Access Centre
phrases and words like (ELSAC). When you make an appointment there, the
‘might’, ‘it is likely that’, tutor will guide you to the best resources available on
‘one can assume that’ and ‘it campus. Attached to this feedback is a list of possible
seems that’. A good writer hedging devices. It might also be useful to have a look
might use several of these at the model answer attached to this feedback to give
phrases or words to soften you an idea when it is appropriate to hedge.
down what he or she is say-
ing. You used only a few in
the entire essay.
Data description
You described all the main Imagine the readers does not have the data you are de-
trends in the data provided to scribing. You need to make sure they can understand the
you, but you failed to back information by your description alone. You should have
this up with any relevant fig- a look at the model answer provided. You can get fur-
ures. ther help with this by taking ESOL credit courses, by
enrolling in Engwrit (English Writing) or going to
ELSAC.
Table 116: Feedback profile for test takers

289
Data interpretation
You provided enough ideas Well done, no further action required.
and support for your ideas in
this section.
Data extension of ideas
You provided enough ideas Well done, no further action required.
and support for your ideas in
this section
Coherence
The topics of your sentences The best way to improve your coherence in writing is to
often did not link back to enrol in an ESOL credit course like ESOL 100 or ESOL
previous sentences or you 101.
attempted to link back, but
failed by using incorrect
linking devices.
Cohesion
You overused simple linking The best way to improve your cohesion in writing is to
devices like ‘and’, ‘but’, enrol in an ESOL credit course like ESOL 100 or ESOL
‘because’. 101. You could also go to the ELSAC for help and there
are many self-help books and websites available which
the tutor at the ELSAC can show you. Have a look at the
linking devices used in the model answer attached to
this action plan.

Table 116: Feedback profile for test takers (continued)

Because this detailed feedback could possibly be confusing to students, it is also


important to include an overall recommendation of what students should focus on
most or what they should focus on first.

11.3.2.3.2 Feedback to academic English tutors

The second group of stake holders that should receive feedback are the English
tutors of the test takers. These tutors might be employed in a self-access lab at the
university or they might teach an ESOL credit paper. The feedback to the teachers
should ideally also be clear and detailed, but the language used can be more so-
phisticated and include metalinguistic terms. Keeping with the example of hedg-

290
ing used above, the feedback about hedging to a tutor might read as follows (Fig-
ure 55):

Whilst we would expect a writer of high proficiency levels to possibly use a number of
hedging devices in this type of writing, this test taker used hardly any.
Figure 55: Feedback on hedging for academic English tutors

An overall recommendation at the end of the feedback could summarize the in-
formation and ask the tutor to focus attention on certain weaknesses.

11.3.2.3.3 Feedback to content teachers and academic departments

The third group of stakeholders who regularly receive feedback in diagnostic as-
sessment are the university departments. In this case, very detailed feedback is of
less use. It is, however, of interest to the departments how their student cohorts as
a whole perform. This might help them to schedule language-focussed content
tutorials, which could then focus on common weaknesses in their group of test
takers. In this case, the feedback could possibly focus on each individual category
in the rating scale and summarize the behaviour of all students in the group.
Again, keeping with the example of hedging, the feedback for a cohort of students
could read (Figure 56):

We expect students to qualify the claims they make by using phrases like ‘it seems that’ ‘…
might be the reason’ or ‘it is likely that’. These phrases are referred to as hedging devices.
Of the twenty learners in your group, the majority did not use sufficient hedging devices in
their writing assessment. Whilst we would expect several hedging devices, most of your
learners used only a few. There were five learners who used no hedging devices at all.
Figure 56: Feedback on hedging for content teachers and academic departments

It might also be useful for this group of stakeholders to receive visual feedback on
the performance of the whole group. Figure 57 below shows what such a graph
might look like. Each grey bar represents the range of the whole group in terms of
test scores, whilst the black crossbars represent the mean of the group. In this
way, the content teachers get a visual impression of how spread their group is in
terms of abilities and with which aspects of writing the group as a whole had the
greatest problems. Specialist terms like for example cohesion and coherence
would have to be explained separately.

291
Figure 57: Visual group feedback profile for content teachers and departments

Another possible option for feedback to departments and content teachers is to


average some of the scores of the assessment into related categories. For example,
one category could give information about language ability (which could for ex-
ample include categories such as accuracy and lexical complexity), another cate-
gory could be organisation (which would be made up of skills necessary in writ-
ing such as paragraphing and coherence and cohesion). A final category could
give information about students’ ability in the content categories in the rating
scale1. In this way, departments would receive a clear overview of the abilities of
their cohorts and a clear direction on whether more help was needed with lan-
guage tutorials or with writing tutorials.

Special consideration needs to be given to the trait category of content. Feedback


on performance in this category is less useful to stakeholders, as it is less likely to
generalize to other types of writing at university. However, this does not mean
that stakeholders should not receive feedback on these trait categories. Academic
departments and content teachers might find it useful to know that students had
problems generating content, even if the content of the DELNA task is not neces-
sarily representative of assignments or exams in the students’ academic depart-
ments. Similarly, students might find it informative to know that they were not
able to describe the data provided in a satisfactory manner or that they did not
generate sufficient ideas.

292
Not all of the ideas proposed above may be practical, but they provide a frame-
work for developing best-practice in providing feedback in a diagnostic assess-
ment context.

11.4 Limitations

Although the study reported in this book was carefully designed, several short-
comings must be acknowledged. These can broadly be divided into two groups.
The first group relates to limitations in the analysis of the 601 writing scripts and
the development of the resulting new scale. The second group of limitations is as-
sociated with the validation phase of the new scale.

11.4.1 Limitations of Phase 1

The first limitation of Phase 1 of this study, and at the same time one of the major
limitations of this study, is the manner in which the five different writing levels
used as the basis for the analysis of the writing scripts were established. Because
no independent measure of writing ability was available to form the five different
sub-corpora, average ratings from the administration of the DELNA writing as-
sessment were used. These are of course based on the pre-existing scale. This
means that the existing rating scale had a direct influence on the development of
the new scale. Two factors may have reduced this potential influence. Firstly, the
ratings of the two raters were averaged in an attempt to reduce the influence of
any extreme individual raters. Secondly, the overall score was chosen, in the hope
that this might represent a general writing ability score. The findings from the fac-
tor analysis in Chapter 9 suggest that because raters’ scores seem to be influenced
by a holistic, global impression of writers’ ability, the overall DELNA writing
score is affected by the scale only to a certain extent. It is therefore hoped that this
flaw in the research design only had a limited effect on the outcome of this study.

The second limitation of Phase 1 is that it was not possible to establish any suit-
able measures for a number of categories in the analysis. One aspect which is piv-
otal to the assessment of performance as opposed to competence (North, 2003) is
fluency. However, although two measures of fluency which could be applied to
the writing product were identified in the literature review and included in the
analysis of the scripts, neither seemed to be very useful in the assessment of writ-
ing. The number of words produced, a measure of temporal fluency, was not suc-
cessful in discriminating between the five levels of writing (and therefore not in-
cluded in the new scale), whilst the number of self-corrections, a measure of re-
pair fluency, was criticized by the raters. Similarly, the new scale lacked a meas-
ure of grammatical complexity, arguably one of the more serious limitations of
the scale. More time might have resulted in the successful exploration of other

293
potential measures, which could be more fruitfully applied in assessment situa-
tions. For example, the number of passives or complex nominals per t-unit might
be promising measures. Whilst a measure for paragraphing was identified, a num-
ber of raters criticized the simplicity of this measure. Further research might be
able to develop a discourse analytic measure of paragraphing which is able to as-
sess a wider representation of paragraphing, including the organisation within
paragraphs. Further limitations of the scale include the complexity of the meas-
ures of coherence, the lack of a measure for lexical cohesion and more varied
measures of academic style (which was only assessed by the number of hedging
devices used). Apart from these limitations with the scale categories, there were
also criticisms of missing scale categories. Some raters noted that there were no
categories in the new scale which could be applied to the inappropriate use of in-
formal vocabulary and abbreviations.

A third limitation relating to the first phase of this research is the fact that the rat-
ing scale was based on a specific task type. This of course limits the generalizabil-
ity of the resulting ratings to other contexts. Messick (1994) de-scribes the aim of
performance testing as not being to measure particular performances but to be
able to deduce competence from those performances. Fulcher (1996c) argued that
aspects of performance based on a general theory of language ability (such as a
model of com-municative competence would provide) result in greater gener-
alizability of a test. As soon as task effects are built into the descriptors, the rat-
ings are dependent on the tasks that the writers performed and therefore result in a
context dependent measure that does not generalize.

However, Chalhoub-Deville (1995), in the context of an empirically-derived rat-


ing scale for oral performance, was able to demonstrate a large amount of overlap
between her framework and the framework developed by another research group
(Hinofotis, Bailey, & Stern, 1981) in a different context. She therefore argued
that, although this did not establish generalizability, at least validity was estab-
lished.

A case for generalizability can also be made from the perspective of the task in
use. If the task generally represents what is expected in the target use domain (in
this case the academic setting), then this would establish generalizability even if
the scale is based on this task alone. Research into faculties’ perceptions of which
tasks best represent the writing expected in their target language domain have
shown that there are major differences depending on the content area (e.g. S.
Smith, 2003) thus showing that arriving at a task type suitable to all disciplines is
difficult and potentially impossible.

294
A fourth limitation relating to the first phase of this research is the fact that the
raters were not directly involved in scale development. Although the scale was
developed on an empirical basis, raters could have been given a voice in the scale
development process. For example, the category of coherence could have been
simplified earlier in the scale development process if raters had been asked to trial
some sample performances. Besides getting them involved in the actual develop-
ment of the scale, it might have proven useful to explain to them certain decisions
that were made in the development process prior to the validation phase. It is pos-
sible that raters would have understood perfectly well why for example spelling
was not included into the scale and that they therefore approached the entire scale
differently.

The last limitation relating to Phase 1 is the fact that the new scale (as well as the
existing DELNA scale) is used to assess the performance of both native and non-
native speakers of English. It could be argued that these two groups of candidates
experience very different problems and should therefore not be assessed using a
common assessment instrument (see for example Elder, 1995). On the other hand,
it is very difficult to establish the language background of students. In the context
of DELNA, the first language of students is established based on a self-report
questionnaire. Anecdotal evidence suggests that a number of students report Eng-
lish as their first language even if this is not the case. Similarly, there are nowa-
days more and more students who learn English from a very early age (for exam-
ple in Singapore or India), but who would experience very different problems in
their language proficiency than for example students who grew up and underwent
their schooling in a country like New Zealand, Australia, the United Kingdom or
the United States of America. For practical reasons, it seems almost inevitable that
the scale used needs to be applicable to all students, regardless of their back-
ground.

The limitations listed above all related to the first phase of this study, the analysis
of the writing scripts and the scale development phase. Now, we turn to shortcom-
ings of the second phase.

11.4.2 Limitations of Phase 2

The first limitation, and possibly the most important one, is the fact that the design
of the second phase of this study was not counterbalanced. As mentioned earlier
in Chapter 8, all ten raters first rated all one hundred scripts using the DELNA
scale and only then rated the hundred scripts using the new scale. Ideally, how-
ever, half of the raters should have used the new scale first and then the DELNA
scale, and the other half of the raters should have rated the scripts using the scales
in the opposite order. However, this was not possible for practical reasons. The

295
raters were only able to rate such large numbers of scripts at certain times during
the year, which compromised the nature of the design possibilities. There was
some indication from the interviews however that this order had no influence on
the outcome of the study, because the raters were used to rating with the existing
scale. However, if the new scale had been used first, this might have influenced
the outcome of the findings, as a number of raters reported changing their rating
behaviour after using the new descriptors. Also, because of the large number of
scripts that were rated, it is highly unlikely that raters were able to remember in-
dividual scripts from the first rating round to the second (two to three months
later).

A second limitation of Phase 2 relates to sample size. Only ten raters were used
and these raters rated only one hundred scripts each. The number of raters and
scripts had to be kept to these limits for financial reasons. Although two research
grants were secured to reimburse raters for the time they spent rating, these were
able to cover the expenses for only ten raters. Experts in multi-faceted Rasch
measurement suggest that a larger number of raters and scripts would have re-
turned even more stable results (Mike Linacre and Carol Myford, personal com-
munications). In particular, the scale category statistics would be more trustwor-
thy. As mentioned in Chapter 9, each band level should be used by at least ten rat-
ers for FACETS to return stable results. This was generally, but not always, the
case. Because this problem was anticipated, a fully crossed design was chosen,
which meant that all raters rated all one hundred scripts under both conditions.
This is generally regarded as helping to improve the stability of the statistics re-
turned by FACETS (Myford & Wolfe, 2003).

11.5 Suggestions for future research

The findings of the current study point to various directions for future research,
which again can be divided into two groups: those relating to Phase 1 of the cur-
rent study and those relating to the second phase.

11.5.1 Suggestions for future research: Phase 1

The first group of suggestions for further research follow from the shortcomings
identified relating to the scale development. No suitable measure, was, for exam-
ple, found for fluency in writing, which is an integral part of assessing perform-
ance (North, 2003). More research is necessary to establish what factors contrib-
ute to creating fluency in writing (especially in the writing product) and how these
factors are different to aspects of fluency in speech (as have been described by
Skehan, 2003). Similarly, as this and a number of other studies (e.g. Cumming et
al., 2005) have shown that writers do not necessarily produce more subordination

296
in an assessment context, more detailed research is necessary to establish if there
are other measures of grammatical complexity which could be successfully incor-
porated into a rating scale for writing. Further research might also be able to es-
tablish a less mechanical measure of paragraphing. As suggested previously, a
multiple regression analysis of all the measures of topical structure used in this
study might have been able to ascertain the measures most indicative of writing
ability. This analysis should be conducted in the future and trialled on another
group of raters. Finally, a measure of lexical cohesion should be sought in future
developments of the scale as well as more measures of reader-writer identity.

11.5.2 Suggestions for future research: Phase 2

A second group of suggestions for future research relates to the second phase of
this study.
Firstly, it might be useful to conduct think-aloud protocols of raters employing the
two very different rating scales used in this study. Although the quantitative rat-
ings give an indication of differences between the rating processes employed by
raters, and anecdotal evidence was collected from the raters during the interviews,
it might be fruitful to obtain an insight into the cognitive processes of raters dur-
ing the rating procedure. However, some doubt has been cast on the validity of
think-aloud protocols (e.g. Barkaoui, 2007a, 2007b; Stratman & Hamp-Lyons,
1994). It is not entirely clear if the online accounts of raters provide a complete
picture of their thought-processes, if the rating process is fundamentally altered in
the process of providing a think-aloud protocol, if raters are even able to describe
their thought processes or if these remain inaccessible. If think-aloud protocols are
seen to provide valid evidence of the rating process, then it might be useful to
compare the different processes raters follow when using the two different scales.
Findings from this could inform future scale development and rater training.
It would also be useful to establish if the raters rated more analytically because
they were unfamiliar with the new scale. Research needs to ascertain if certain
rating behaviour is associated with specific scale types, or if raters shift their rat-
ing patterns over time irrespective of what type of scale they are using.

A further area of possible future research could be a comparison of the rating be-
haviour of experienced and less experienced raters when employing the two
scales. Although not a focus of this study, there was some evidence that less ex-
perienced raters preferred using the more detailed, empirically-developed scale,
whilst more experienced raters preferred the less descriptive intuitively-developed
scale. If this is in fact the case, then it might be useful to employ more detailed
descriptors in rater induction sessions. Similarly, there was also some evidence
that professional background played a role in raters’ perceptions of the two scales.
Raters coming from a background in English as a first language teaching rather

297
than English for Speakers of Other Languages (ESOL) seemed more accustomed
to rating holistically and therefore preferred the pre-existing descriptors. It is pos-
sible that raters with a background in English rather than ESOL need to be trained
not to rate globally when assessing in a diagnostic context. However, not enough
data was collected on either of these groups of raters and further research is there-
fore necessary.

Furthermore, it would be interesting to investigate the perceptions from a number


of different stakeholder groups about the feedback they would like to receive.
Face validity is one facet of validity evidence, and in this case the only stake-
holder group questioned were the raters. It might however be that the candidates,
departments and staff in service providers such as the English Language Self-
Access Centre and the Student Learning Centre, have very different opinions from
the raters regarding what type of feedback about the students’ writing is desirable
and useful. Similarly, apart from obtaining stakeholder opinions relating to the
desired type and usefulness of feedback, it would be valuable to conduct a study
evaluating the usefulness of this type of feedback in an empirical manner. For ex-
ample, the feedback proposed in this chapter might be perceived as useful by test
takers, but might in fact not help them to improve their writing. A carefully de-
signed study investigating the effect of this type of feedback both on the discourse
produced by candidates, as well as on the effect on subsequent ratings by raters,
could potentially be very interesting.

11.6 Concluding remark

This study set out to explore a way to develop rating scales empirically, so that the
descriptors reflect more closely what happens in students’ performance. Although
the scale development method explored here is not the only possible way to de-
velop empirically-based descriptors, the resulting rating scale was shown to pro-
vide raters with a more explicit basis on which to base their rating decisions than
the more commonly used intuitively-developed rating scales. Not only were the
ratings on individual trait scales of the new scale more reliable and discriminating,
the resulting score profiles were more differentiated, because raters were able to
discern more different aspects of the writers’ performance. Therefore, it could be
argued that the type of scale developed in this study is more suitable in a diagnos-
tic context, where the aim is to provide students with feedback about their
strengths and weaknesses.

---
Notes:
1
A section specifically considering the feedback on the performance of content can be found be-
low.

298
Appendix 1:

Structures investigated for writer identity, hedges, boosters and cohesion

Writer Identity
I, you, we, us, our, me, mine, yours, my, your
Hedges
Can, could, may, might, perhaps, maybe, possible/possibly, suppose/supposed, I think, I
feel, sometimes, seem, relative/relatively, would, appear, probably, possibility, fairly,
usually, tend, hardly, more or less, should, suggest, indicate, potential/ly, assume, gener-
ally, about, believe, hypothesise, likely, speculate, estimate, doubt (used without a nega-
tive), presume
Boosters
Certain/ly, clear/ly, I know, definite/ly, fact, obvious/ly, sure/ly, like/ly, significant/ly,
enormous/ly, no/never, a lot, really, main/ly, very, extremely, at last, major, always,
demonstrate, substantially, will, all, many, apparent, evident, doubt (used in negative
sense, i.e. no doubt), doubtless, indeed, of course
Cohesion – anaphoric pronominals
this, that, these, those, it, he, she, its, her, him, his, me, their, them, they, there, here, the
former, the latter

299
References
Alderson, C. (1991). Bands and scores. In C. Alderson & B. North (Eds.), Language test-
ing in the 1990s: The communicative legacy. London: Modern English Publica-
tions/British Council/Macmillan.
Alderson, C. (2005). Diagnosing foreign language proficiency. The interface between
learning and assessment. London: Continuum.
Alderson, C., & Clapham, C. (1992). Applied linguistics and language testing: A case
study. Applied Linguistics, 13(2), 149-167.
Alderson, C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evalua-
tion. Cambridge: Cambridge University Press.
Allison, D. (1995). Assertions and alternatives: Helping ESL undergraduates extend their
choice in academic writing. Journal of Second Language Writing, 4, 1-16.
Anastasi, A. (1988). Psychological testing. New York: Macmillan.
Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of vari-
ance. Austral Ecology, 26, 32-46.
Anderson, M. J. (2005). PERMANOVA: a FORTRAN computer program for permuta-
tional multivariate analysis of variance. Department of Statistics: University of
Auckland, New Zealand.
Andrich, D. (1978). A general form of Rasch's extended logistic model for partial credit
scoring. Applied Measurement in Education, 4, 363-378.
Andrich, D. (1998). Threshold, steps, and rating scale conceptualization. Rasch Meas-
urement: Transactions of the Rasch Measurement SIG, 12, 648.
Arnaud, P. J. L. (1992). Objective lexical and grammatical characteristics of L2 written
compositions and the validity of separate-component tests. In P. J. L. Arnaud &
H. Bejoint (Eds.), Vocabulary and applied linguistics. London: McMillan.
Bacha, N. (2001). Writing evaluation: What can analytic versus holistic essay scoring tell
us? System, 29, 371-383.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment
Quarterly, 2(1), 1-34.
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and
rater judgements in a performance test of foreign language speaking. Language
Testing, 12, 238-252.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford
University Press.
Bachman, L. F., & Palmer, A. S. (forthcoming). Language assessment in practice. Ox-
ford: Oxford University Press.
Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language
proficiency: A critique of the ACTFL Oral Interview. Modern Language Journal,
70, 380-390.

301
Bamberg, B. (1983). What makes a text coherent? College Composition and Communica-
tion, 34(4), 417-429.
Banerjee, J., & Franceschina, F. (2006). Documenting features of written language pro-
duction typical at different IELTS band score levels. Paper presented at the
Workshop sponsored by the European Science Foundation entitled 'Bridging the
gap between research on second language acquisition and research on language
testing', Amsterdam, February 2006.
Bardovi-Harlig, K., & Bofman, T. (1989). Attainment of syntactic and morphological
accuracy by advanced language learners. Studies in Second Language Acquisi-
tion, 11, 17-34.
Barkaoui, K. (2007a). Effects of thinking aloud on ESL essay rater performance: A Fac-
ets analysis. Paper presented at the Language Testing Research Colloquium, Bar-
celona, June 2007.
Barkaoui, K. (2007b). Raters' perceptions of the effects of thinking aloud on their ESL
essay rating performance: A qualitative study. Paper presented at the Annual
conference of the American Association of Applied Linguistics: Costa Mesa, CA,
April 2007.
Barlow, M. (2002). MonoConc Pro 2.2. Houston: Athelstan.
Barnwell, D. (1989). 'Naive' native speakers and judgements of oral proficiency in Span-
ish. Language Testing, 6, 152-163.
Barritt, L., Stock, P., & Clarke, F. (1986). Researching practice: Evaluating assessment
essays. College Composition and Communication, 37, 315-327.
Bereiter, C. (1980). Development in writing. In L. W. Gregg & E. R. Steinberg (Eds.),
Cognitive Processes in Writing. Hillsdale, NJ: Lawrence Erlbaum.
Bloor, M., & Bloor, T. (1991). Cultural expectations and socio-pragmatic failure in aca-
demic writing. In P. Adams, B. Heaton & P. Howarth (Eds.), Academic writing in
a second language: Essays on research and pedagogy. Norwood, NJ: Ablex.
Brindley, G. (1991). Defining language ability: The criteria for criteria. In S. Anivan
(Ed.), Current developments in language testing. Singapore: SEAMEO Regional
Language Centre.
Brindley, G. (1998). Describing language development? Rating scales and SLA. In L. F.
Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition
and language testing research. Cambridge: Cambridge University Press.
Brown, A. (1995). The effect of rater variables in the development of an occupation-
specificic language performance test. Language Testing, 12, 1-15.
Brown, A. (2000). An investigation of the rating process in the IELTS oral interview.
IELTS Research Reports, 3.
Brown, A. (2003). Legibility and the rating of second language writing: An investigation
of the rating of handwritten and word-processed IELTS task two essays. IELTS
Research Reports, 4, 131-151.
Brown, J. D., Hildgers, T., & Marsella, J. (1991). Essay prompts and topics. Minimizing
the effect of mean differences. Written Communication, 8, 533-556.

302
Burneikaité, N., & Zabiliúté, J. (2003). Information structuring in learner texts: A possi-
ble relationship between the topical structure and the holistic evaluation of
learner essays. Studies about Language, 4, 1-11.
Burstein, J., Kukich, K., Wolff, S., Chi, L., Chodorow, M., Braden-Harder, L., et al.
(1998). Automated scoring using a hybrid feature identification technique. Paper
presented at the Annual Meeting of the Association of Computational Linguis-
tics, Montreal, Canada.
Canale, M. (1983). From communicative competence to communicative language peda-
gogy. In J. C. Richards & R. Schmidt (Eds.), Language and communication (pp.
2-27). London, UK: Longman.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to sec-
ond language teaching and testing. Applied Linguistics, 1, 1-47.
Carlson, S., Bridgeman, B., Camp, R., & Waanders, J. (1985). Relationship of admission
test scores to writing performance of native and nonnative speakers of English
(TOEFL Research Report No.19). Princeton, NJ: Educational Testing Service.
Carrell, P. (1988). Interactive text processing: Implications for ESL/second language
reading classrooms. In P. Carrell, J. Devine & D. Eskey (Eds.), Interactive ap-
proaches to second language reading. New York, Cambridge: Cambridge Uni-
versity Press.
Carroll, J. B. (1968). The psychology of language testing. In A. Davies (Ed.), Language
testing symposium: A psycholinguistic approach. Oxford: Oxford University
Press.
Carson, J. G. (2001). Second language writing and second language acquisition. In T.
Silva & P. K. Matsuda (Eds.), On second language writing. Mahwah, NJ: Law-
rence Erlbaum Associates.
Cascio, W. F. (1982). Applied psychology in personal management. Reston, VA: Reston
Publishing Company.
Cason, G. J., & Cason, C. L. (1984). A deterministic theory of clinical performance rat-
ing. Evaluation and the Health Professions, 7, 221-247.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Chalhoub-Deville, M. (1995). Theoretical models, assessment frameworks and test con-
struction. Language Testing, 14(1), 3-22.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F.
Bachman & A. Cohen (Eds.), Interfaces between second language acquisition
and language testing research. Cambridge: Cambridge University Press.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Lin-
guistics, 19, 254-272.
Charney, D. (1984). The validity of using holistic scoring to evaluate writing: A critical
overview. Research in the Teaching of English, 18, 65-81.
Cheng, X., & Steffensen, M. S. (1996). Metadiscourse: A technique for improving stu-
dent writing. Research in the Teaching of English, 30(2), 149-181.

303
Chenoweth, N. A., & Hayes, J. R. (2001). Fluency in writing. Generating text in L1 and
L2. Written Communication, 18(1), 80-98.
Cheville, J. (2004). Automated scoring technologies and the rising influence of error.
English Journal, 93(4), 47-52.
Chiang, S. (1999). Assessing grammatical and textual features in L2 writing samples: The
case of French as a Foreign Language. The Modern Language Journal, 83, 219-
232.
Chiang, S. (2003). The importance of cohesive conditions to perceptions of writing qual-
ity at the early stages of foreign language learning. System, 31, 471-484.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater's per-
formance on TOEFL essays. TOEFL Research Report No. RR-73, ETS RR-04-
04. Princeton, NJ: Educational Testing Service.
Clarkson, R., & Jensen, M.-T. (1995). Assessing achievement in English for professional
employment programs. In G. Brindley (Ed.), Language Assessment in Action.
Sydney: National Centre for English Language Teaching and Research.
Cobb, T. (2002). Web Vocabprofile. Retrieved 12 December 2005, from
http://www.lextutor.ca/vp/
Cohen, A. D. (1994). Assessing language ability in the classroom (2nd ed.). Boston, Mas-
sachusetts: Heinle and Heinle Publishers.
Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale as-
sessment programs. Journal of Educational Measurement, 37(2), 163-178.
Connor-Linton, J. (1995). Crosscultural comparison of writing standards: American ESL
and Japanese EFL. World Englishes, 14, 99-115.
Connor, U. (1990). Linguistic/rhetorical measures for international persuasive student
writing. Research in the Teaching of English, 24(1), 67-87.
Connor, U., & Farmer, F. (1990). The teaching of topical structure analysis as a revision
strategy for ESL writers. In B. Kroll (Ed.), Second language writing: Research
insights for the classroom. Cambridge: Cambridge University Press.
Connor, U., & Mbaye, A. (2002). Discourse approaches to writing assessment. Annual
Review of Applied Linguistics, 22, 263-278.
Cooper, T. C. (1976). Measuring written syntactic patterns of second language learners of
German. The Journal of Educational Research, 69, 176-183.
Corbel, C. (1995). Exrater: a knowledge-based system for language assessors. In G.
Brindley (Ed.), Language assessment in action. Sydney: National Centre for Eng-
lish Language Teaching and Research.
Council of Europe. (2001). Common European Framework of Reference for Languages:
Learning, teaching, assessment. Cambridge: Cambridge University Press.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
Crismore, A., Markkanen, R., & Steffensen, M. S. (1993). Metadiscourse in persuasive
writing: A study of texts written by American and Finnish university students.
Written Communication, 10, 39-71.

304
Crookes, G. (1989). Planning and interlanguage variation. Studies in Second Language
Acquisition, 11, 183-199.
Crowhurst, M. (1987). Cohesion in argument and narration at three grade levels. Re-
search in the Teaching of English, 21(2), 185-201.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language
Testing, 7, 31-51.
Cumming, A. (1997). The testing of writing in a second language. In C. Clapham & D.
Corson (Eds.), Encyclopedia of Language and Education (Vol. 7: Language Test-
ing and Assessment). Dordrecht: Kluwer Academic Publishers.
Cumming, A. (1998). Theoretical perspectives on writing. Annual Review of Applied Lin-
guistics, 18, 61-78.
Cumming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2005). Dif-
ferences in written discourse in independent and integrated prototype tasks for
next generation TOEFL. Assessing Writing, 10(1), 1-75.
Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL
2000 prototype writing tasks: An investigation into raters' decision making and
development of a preliminary analytic framework. TOEFL Monograph Series 22.
Princeton, New Jersey: Educational Testing Service.
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating
ESL/EFL writing tasks: A descriptive framework. The Modern Language Jour-
nal, 86, 67--96.
Cumming, A., & Mellow, D. (1996). An investigation into the validity of written indica-
tors of second language proficiency. In A. Cumming & R. Berwick (Eds.), Vali-
dation in Language Testing. Clevedon (England), Philadelphia: Multilingual
Matters.
Cumming, A., & Riazi, A. (2000). Building models of adult second-language writing in-
struction. Learning and Instruction, 10, 55-71.
Davidson, F. (1993). Statistical support for Training in ESL Composition Rating. In L.
Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts.
Norwood, NJ: Ablex Publishing Company.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Diction-
ary of Language Testing. Cambridge: Cambridge University Press.
Davies, A., & Elder, C. (2005). Validity and validation in language testing. In E. Hinkel
(Ed.), Handbook of research in second language teaching and learning. Mah-
wah, NJ: Lawrence Erlbaum.
Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical
Society, 51, 599-635.
Elder, C. (1993). How do subject specialists construe classroom language proficiency?
Language Testing, 10(3), 235-254.
Elder, C. (1995). The effect of language background on 'foreign' language test perform-
ance: Problems of classification and measurement. Language Testing Update, 17,
34-36.

305
Elder, C. (2003). The DELNA initiative at the University of Auckland. TESOLANZ
Newsletter, 12(1), 15-16.
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater re-
sponses to an online rater training program. Language Testing, 24(1), 37-64.
Elder, C., & Erlam, R. (2001). Development and validation of the diagnostic English lan-
guage needs assessment (DELNA): Final Report. Auckland: University of Auck-
land, Department of Applied Lnaguage Studies and Linguistics.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to
enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175-
196.
Elder, C., & von Randow, J. (2002). Report on the 2002 Pilot of DELNA at the University
of Auckland. Auckland: University of Auckland, Department of Applied Lan-
guage Studies and Linguistics.
Ellis, R. (1987). Interlanguage variability in narrative discourse: Style shifting in the use
of the past tense. Studies in Second Language Acquisition, 18, 1-20.
Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. Oxford: Oxford Univer-
sity Press.
Ellis, R., & Yuan, F. (2004). The effects of planning on fluency, complexity, and accu-
racy in second language narrative writing. Studies in Second Language Acquisi-
tion, 26, 59-84.
Engber, C. (1995). The relationship of lexical proficiency to the quality of ESL composi-
tions. Journal of Second Language Writing, 4, 139-155.
Erdosy, U. (2000). Exploring the establishment of scoring criteria for writing ability in a
second language: The influence of background factors on variability in the deci-
sion-making processes of four experienced raters of ESL composition. Unpub-
lished MA thesis, University of Toronto.
Farhady, H. (1983). On the plausibility of the unitary language proficiency factor. In J.
W. Oller (Ed.), Issues in language testing research. Rowley, Mass. : Newbury
House.
Field, A. (2000). Discovering statistics using SPSS for Windows. London: SAGE Publica-
tions.
Field, Y., & Yip, L. (1992). A comparison of internal cohesive conjunction in the English
essay writing of Cantonese speakers and native speakers of English. RELC Jour-
nal, 23(1), 15-28.
Fisher, W. P. (1992). Reliability statistics. Rasch Measurement: Transactions of the
Rasch Measurement SIG, 6(3), 238.
Fitzgerald, J., & Spiegel, D. (1986). Textual cohesion and coherence in children's writing.
Research in the Teaching of English, 20, 263-280.
Flahive, D. E., & Gerlach Snow, B. (1980). Measures of syntactic complexity in evaluat-
ing ESL compositions. In J. W. Oller & K. Perkins (Eds.), Research in Language
Testing. Rowley, Mass.: Newbury House.

306
Foltz, P. W., Laham, D., & Landauer, T. K. (2003). Automated essay scoring: Applica-
tions to educational technology. from Http://www.psych.nmsu.edu/~pfoltz/re-
prints/Edmedia99.html
Foster, P., & Skehan, P. (1996). The influence of planning and task type on second lan-
guage performance. Studies in Second Language Acquisition, 18, 299-323.
Frase, L., Faletti, J., Ginther, L., & Grant, L. (1999). Computer analysis of the TOEFL
Test of Written English. Research report No. 64. Princeton, NJ: Educational Test-
ing Service.
Freedman, S. W., & Calfee, R. C. (1983). Holistic assessment of writing: Experimental
design and cognitive theory. In J. Mosenthal, L. Tamor & S. Walmsley (Eds.),
Research in writing: Principles and methods. New York: Longman.
Friedlander, A. (1990). Composing in English: Effects of a first language on writing in
English as a Second Language. In B. Kroll (Ed.), Second langauge writing: Re-
search insights for the classroom (pp. 109-125). Cambridge: Cambridge Univer-
sity Press.
Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT
Journal, 41(4), 287-291.
Fulcher, G. (1995). Variable competence in second language acquisition: A problem for
research methodology? System, 23(1), 25-33.
Fulcher, G. (1996a). Does thick description lead to smart tests? A data-based approach to
rating scale construction. Language Testing, 13(2), 208-238.
Fulcher, G. (1996b). Invalidating validity claims for the ACTFL Oral rating scale. Sys-
tem, 24(2), 163-172.
Fulcher, G. (1996c). Testing tasks: Issues in task design and the group oral. Language
Testing, 13, 23-51.
Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman.
Grabe, W., & Kaplan, R. B. (1996). Theory and practice of writing. New York: Long-
man.
Granger, S., & Tyson, S. (1996). Connector usage in the English essay writing of native
and non native EFL speakers of English. World Englishes, 15(1), 17-27.
Grant, L., & Ginther, L. (2000). Using computer-tagged linguistic features to describe L2
writing differences. Journal of Second Language Writing, 9(2), 123-145.
Grierson, J. (1995). Classroom-based assessment in intensive English centres. In G.
Brindley (Ed.), Language assessment in action. Sydney: National Centre for Eng-
lish Language Teaching and Research.
Halliday, M. A. K. (1985). An introduction to functional grammar. London: Arnold.
Halliday, M. A. K. (1994). An introduction to functinal grammar. London: Edward Ar-
nold.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman.
Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In B. Kroll (Ed.),
Second language writing: Research insights for the classroom. New York: Cam-
bridge University Press.

307
Hamp-Lyons, L. (2001). Fourth generation writing assessment. In T. Silva & P. K. Ma-
tsuda (Eds.), On second language writing. Mahwah, NJ: Lawrence Erlbaum As-
sociates.
Hamp-Lyons, L. (2003). Writing teachers as assessors of writing. In B. Kroll (Ed.), Ex-
ploring the dynamics of second language writing. Cambridge: Cambridge Uni-
versity Press.
Harley, B., & King, M. L. (1989). Verb lexis in the written composition of young L2
learners. Studies in Second Language Acquisition, 11, 415-439.
Hasan, R. (1984). Coherence and cohesive harmony. In J. Flood (Ed.), Understanding
reading comprehension. Newark, DE: International Reading Association.
Hawkey, R. (2001). Towards a common scale to describe L2 writing performance. Cam-
bridge Research Notes, 5, 9-13.
Hawkey, R., & Barker, F. (2004). Developing a common scale for the assessment of writ-
ing. Assessing Writing, 9(2), 122-159.
Heatly, A., & Nation, P. (1994). Range [Computer program, available at
http://vuw.ac.nz/lals/]: Victoria University of Wellington, NZ.
Henry, K. (1996). Early L2 writing development: A study of autobiographical essays by
university-level students of Russian. Modern Language Journal, 80, 309-326.
Hill, K. (1997). Who should be the judge? The use of non-native speakers as raters on a
test of English as an international language. Paper presented at the Current de-
velopments and alternatives in language assessment: Proceedings of LTRC 96,
Jyvaskyla, Finland: University of Jyvaskyla and University of Tampere.
Hinkel, E. (2003). Simplicity without elegance: Features of sentences in L1 and L2 aca-
demic texts. TESOL Quarterly, 37(2), 275-300.
Hinofotis, F. B., Bailey, K., & Stern, S. L. (1981). Section II. Empirical research. In A. S.
Palmer, P. Groot & G. Trosper (Eds.), The construct validation of tests of com-
municative competence (pp. 106-126). Washington, DC: TESOL.
Hirano, K. (1991). The effect of audience on the efficacy of objective measures of EFL
proficiency in Japanese university students. Annual Review of English Language
Education in Japan, 2, 21-30.
Hoenisch, S. (1996). The theory and method of topical structure analysis. Retrieved 30
April 2007, from http://www.criticism.com/da/tsa-method.php
Hoey, M. (1991). Patterns of lexis in text. Oxford: Oxford University Press.
Homburg, T. J. (1984). Holistic evaluation of ESL composition: Can it be validated ob-
jectively? TESOL Quarterly, 18, 87-107.
Hu, Z., Brown, D., & Brown, L. (1982). Some linguistic differences in the written Eng-
lish of Chinese and Australian students. Language Learning and Communication,
1(1), 39-49.
Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge Uni-
versity Press.
Hunt, K. W. (1965). Grammatical structures written at three grade levels.Unpublished
manuscript, Champaign, IL: NCTE.

308
Huot, B. (1990). Reliability, validity, and holistic scoring: What we know, what we need
to know. College Composition and Communication, 41(2), 201-213.
Hyland, K. (1996a). Talking to the academy: Forms of hedging in science research arti-
cles. Written Communication, 13(1), 251-281.
Hyland, K. (1996b). Writing without conviction? Hedging in science research articles.
Applied Linguistics, 17(4), 433-454.
Hyland, K. (1998). Hedging in scientific research articles. Amsterdam: John Benjamins.
Hyland, K. (2000a). Hedges, boosters and lexical invisibility: Noticing modifiers in aca-
demic texts. Language Awareness, 9(4), 179-301.
Hyland, K. (2000b). 'It might be suggested that...': Academic hedging and student writing.
Australian Review of Applied Linguistics, 16, 83-97.
Hyland, K. (2002a). Authority and invisibility: Authorial identity in academic writing.
Journal of Pragmatics, 34, 1091-1112.
Hyland, K. (2002b). Options of identity in academic writing. ELT Journal, 56(4), 351-
358.
Hyland, K. (2003). Second language writing. Cambridge: Cambridge University Press.
Hyland, K., & Milton, J. (1997). Hedging in L1 and L2 student writing. Journal of Sec-
ond Language Writing, 6(2), 183-296.
Hymes, D. H. (1967). Models of the interaction of language and social setting. Journal of
Social Issues, 23(2), 8-38.
Hymes, D. H. (1972). On communicative competence. In J. Holmes (Ed.), Sociolinguis-
tics: Selected readings. Harmondsworth, Middlesex: Penguin.
Ingram, D. E. (1995). Scales. Melbourne Papers in Language Testing, 4(2), 12-29.
Intaraprawat, P., & Steffensen, M. S. (1995). The use of metadiscourse in good and poor
ESL essays. Journal of Second Language Writing, 4(3), 253-272.
Ishikawa, S. (1995). Objective measurement of low-proficiency EFL narrative writing.
Journal of Second Language Writing, 4, 51-70.
Ivanic, R. (1998). Writing and identity: The discoursal construction of identity in aca-
demic writing. Amsterdam: Benjamins.
Ivanic, R., & Weldon, S. (1999). Researching the writer-reader relationship. In C. N.
Candlin & K. Hyland (Eds.), Writing: Texts, processes and practices. London:
Longman.
Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an Oral
Proficiency Test? Exploring the Potential of an Information-Processing Approach
to Task Design. Language Learning, 51(3), 401-436.
Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL
composition: A practical approach. Rowley, MA: Newbury House.
Jafarpur, A. (1991). Cohesiveness as a basis for evaluating compositions. System, 19(4),
459-465.
Jamieson, J. (2005). Trends in computer-based second language assessment. Annual Re-
view of Applied Linguistics, 25, 228-242.

309
Jarvis, S. (2002). Short texts, best-fitting curves and new measures of lexical diversity.
Language Testing, 19(1), 57-84.
Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educa-
tional and Psychological Measurement, 20, 141-151.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin,
112(3), 527-535.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educa-
tional Measurement: Issues and Practice, 12, 5-17.
Kawata, K. (1992). Evaluation of free English composition. CASELE Research Bulletin,
22, 305-313.
Kellogg, R. T. (1996). A model of working memory in writing. In C. M. Levy & S.
Ransdell (Eds.), The science of writing. Theories, methods, individual differ-
ences, and applications. Mahwah, NJ: Lawrence Erlbaum.
Kennedy, C., & Thorp, D. (2002). A corpus-based investigation of linguistic responses to
an IELTS academic writing task: University of Birmingham.
Kenyon, D. (1992). An investigation of the validity of the demands of tasks on perform-
ance-based tasks of oral proficiency: Paper presented at the Language Testing
Research Colloquium, Vancouver, Canada.
Kepner, C. (1991). An experiment in the relationship of types of written feedback to the
development of second-language writing skills. Modern Language Journal, 75,
305-313.
Kintsch, W., & Keenan, J. (1973). Reading rate and retention as a function of the number
of propositions in the base structure of sentences. Cognitive Psychology, 5, 257-
274.
Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and pro-
duction. Psychological Review, 85, 363-394.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How
does it compare with face-to-face training? Assessing Writing, 12, 26-43.
Kobayashi, H., & Rinnert, C. (1996). Factors affecting composition evaluation in an EFL
context: Cultural rhetorical pattern and readers' background. Language Learning,
46(3), 397-437.
Koponen, M., & Riggenbach, H. (2000). Overview: Varying perspectives on fluency. In
H. Riggenbach (Ed.), Perspectives on fluency. Ann Arbor: The University of
Michigan Press.
Kroll, B. (1998). Assessing writing abilities. Annual Review of Applied Linguistics, 18,
219-240.
Lado, R. (1961). Language testing. New York: McGraw-Hill.
Landy, F. J., & Farr, J. L. (1983). The measurement of work performance: Methods, the-
ory, and application. San Diego, CA: Academic Press.
Lantolf, J. P., & Frawley, W. (1985). Oral proficiency testing: A critical analysis. Modern
Language Journal, 69(4), 337-345.

310
Larkey, L. S. (1998). Automatic essay grading using text categorisation techniques. Paper
presented at the Twenty first annual international ACM SIGIR conference on re-
search and development in information retrieval, Melbourne, Australia.
Larsen-Freeman, D. (1978). Implications of the morpheme studies for second language
acquisition. Review of Applied Linguistics, 39-40, 93-102.
Larsen-Freeman, D. (1983). Assessing global second language proficiency. In H. W.
Seliger & M. H. Long (Eds.), Classroom-oriented research in second language
acquisition. Rowley, MA: Newbury House.
Larsen-Freeman, D., & Strom, V. (1977). The construction of a second language acquisi-
tion index of development. Language Learning, 27, 123-134.
Laufer, B. (1994). The lexical profile of second language writing: Does it change over
time? RELC Journal, 25, 21-33.
Lautamatti, L. (1987). Observations on the development of the topic of simplified dis-
course. In U. Connor & R. B. Kaplan (Eds.), Writing across languages: Analysis
of L2 text. Reading, MA: Addison-Wesley.
Lautamatti, L. (1990). Coherence in Spoken and Written Discourse. In U. Connor & A.
M. Johns (Eds.), Coherence in Writing. Research and Pedagogical Perspectives.
Alexandria, Virginia: Teachers of English to Speakers of Other Languages.
Lee, I. (2002a). Helping students develop coherence in writing. English Teaching Forum,
July 2002, 32-39.
Lee, I. (2002b). Teaching coherence to ESL students: a classroom inquiry. Journal of
Second Language Writing, 11, 135-159.
Lennon, P. (1991). Error: Some problems of definition, identification and distinction. Ap-
plied Linguistics, 12, 180-196.
Linacre, J. M. (1988). FACETS: A computer program for the analysis of multi-faceted
data. Chicago: MESA Press.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.
Linacre, J. M. (1994). A user's guide to FACETS: Rasch measurement computer pro-
gram. Chicago: MESA Press.
Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome
Measurement, 3(2), 103-122.
Linacre, J. M. (2006). Facets Rasch measurement computer program. Chicago: Win-
steps.com.
Linacre, J. M., & Wright, B. D. (1993). A user's guide to FACETS (Version 2.6). Chi-
cago, IL: MESA Press.
Liu, D. (2000). Writing cohesion. Using content lexical ties in ESOL. Forum, 38(1), 28-
36.
Lloyd-Jones, R. (1977). Primary trait scoring. In C. R. Cooper & L. Odell (Eds.), Evalu-
ating writing. New York: National Council of Teachers of English.
Loewen, S., & Ellis, R. (2004). The relationship between English vocabulary knowledge
and the academic success of second language university students. New Zealand
Studies in Applied Linguistics, 10(1), 1-29.

311
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really
mean to the raters? Language Testing, 19(3), 246-276.
Lumley, T. (2005). Assessing second language writing. The rater's perspective. Frank-
furt: Peter Lang.
Lumley, T., & McNamara, T. (1995). Rater characteristics and rater bias: Implications for
training. Language Testing, 12(1), 54-71.
Lunt, H., Morton, J., & Wigglesworth, G. (1994). Rater behaviour in performance test-
ing: Evaluating the effect of bias feedback. University of Melbourne: July.
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.
Mackey, A., & Gass, S. (2005). Second language research: Methodology and design.
Mahwah, NJ: Lawrence Erlbaum.
Madsen, H. S. (1983). Techniques in testing. Oxford: Oxford University Press.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrica, 47, 149-
174.
McArdle, B. H., & Anderson, M. J. (2001). Fitting multivariate models to community
data: a comment on distance-based redundancy analysis. Ecology, 82, 290-297.
McIntyre, P. N. (1993). The importance and effectiveness of moderation training on the
reliability of teachers' assessments of ESL writing samples. Unpublished MA
thesis, University of Melbourne.
McKay, P. (1995). Developing ESL proficiency descriptions for the school context: The
NLLIA ESL bandscales. In G. Brindley (Ed.), Language assessment in action.
Sydney: National Centre for English Language Teaching and Research.
McNamara, T. (1996). Measuring second language performance. Harlow, Essex: Pearson
Education.
McNamara, T. (2002). Discourse and assessment. Annual Review of Applied Linguistics,
22, 221-242.
McNamara, T., & Roever, C. (2006). Language Testing: The social dimension. Oxford:
Basisl Blackwell.
Mehnert, U. (1998). The effects of different length of time for planning on second lan-
guage performance. Studies in Second Language Acquisition, 20, 83-106.
Meisel, J., Clahsen, H., & Pienemann, M. (1981). On determining developmental stages
in natural second language acquisition. Studies in Second Language Acquisition,
3, 109-135.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.).
New York: Macmillan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of per-
formance assessments. Educational Researcher, 23(2), 13-23.
Messick, S. (1996). Validity and washback in language testing. Language Testing, 13,
241-256.
Mickan, P. (2003). 'What's your score?' An investigation into language descriptors for
rating written performance. Canberra: IELTS Australia.

312
Milanovic, M., Saville, N., Pollitt, A., & Cook, A. (1995). Developing rating scales for
CASE: Theoretical concerns and analyses. In A. Cumming & R. Berwick (Eds.),
Validation in language testing. Clevedon: Multilingual Matters.
Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour
of composition markers. In M. Milanovic & N. Saville (Eds.), Studies in Lan-
guage Testing 3: Performance, cognition and assessment. Cambridge: Cam-
bridge University Press.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our
capacity for processing information. Psychological Review, 63(2), 81-97.
Mislevy, R. J. (1993). Foundations of a new test theory. In N. Frederiksen, R. J. Mislevy
& I. I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, N.J: Law-
rence Erlbaum Associates.
Mislevy, R. J. (1995). Test theory and language learning assessment. Language Testing,
12(3), 341-369.
Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33,
379-416.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of assessment
arguments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.
Monroe, J. H. (1975). Measuring and enhancing syntactic fluency in French. The French
Review, 48, 1023-1031.
Moussavi, S. A. (2002). An Encyclopedic dictionary of language testing (Third edition
ed.). Taiwan: Tung Hua Book Company.
Mugharbil, H. (1999). Second language learners' punctuation: Acquisition and aware-
ness. Unpublished PhD dissertation, University of Southern California.
Myford, C. M. (2002). Investigating design features of descriptive graphic rating scales.
Applied Measurement in Education, 15(2), 187-215.
Myford, C. M., & Wolfe, E. W. (2000). Monitoring sources of variability within the Test
of Spoken English assessment system. Princeton, NJ: Educational Testing Ser-
vice.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using
many-facet rasch measurement: Part I. Journal of Applied Measurement, 4(4),
386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using
many-facet rasch measurement: Part II. Journal of Applied Measurement, 5(2),
189-227.
Neuner, J. L. (1987). Cohesive ties and chains in good and poor freshman essays. Re-
search in the Teaching of English, 21(1), 92-105.
North, B. (1995). The development of a common framework scale of descriptors of lan-
guage proficiency based on a theory of measurement. System, 23(4), 445-465.
North, B. (2003). Scales for rating language performance: Descriptive models, formula-
tion styles, and presentation formats. TOEFL Monograph 24. Princeton: Educa-
tional Testing Service.

313
North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales.
Language Testing, 15(2), 217-263.
Nunan, D. (1992). Research methods in language learning. Cambridge: Cambridge Uni-
versity Press.
O'Loughlin, K. (1993). The assessment of writing by English and ESL teach-
ers.Unpublished manuscript, Cambridge, August 1993.
O'Loughlin, K., & Wigglesworth, G. (2003). Task design in IELTS academic writing
Task1: The effect of quantity and manner of presentation of information on can-
didate writing. IELTS Research Reports, 4, 89-129.
Oller, J. W. (1983). Evidence for a general language proficiency factor: An expectancy
grammar. In J. W. Oller (Ed.), Issues in language testing research. Rowley, Mas-
sachusetts: Newbury House.
Oller, J. W., & Hinofotis, F. B. (1980). Two mutually exclusive hypotheses about second
language ability: Indivisible or partially divisible competence. In J. W. Oller &
K. Perkins (Eds.), Research in Language Testing. Rowley, Massachusetts: New-
bury House.
Ortega, L. (1999). Planning and focus on form in L2 oral performance. Studies in Second
Language Acquisition, 21, 109-148.
Page, E. B. (1994). Computer grading of student prose, using modern concepts and soft-
ware. Journal of Experimental Education, 62, 127-142.
Pennington, M., & So, S. (1993). Comparing writing process and product across two lan-
guages: A study of six Singaporean university student writers. Journal of Second
Language Writing, 2, 41-63.
Perkins, K. (1980). Using objective methods of attained writing proficiency to discrimi-
nate among holistic evaluations. TESOL Quarterly, 14(1), 61-69.
Perkins, K., & Gass, S. (1996). An investigation of patterns of discontinuous learning:
Implications for ESL measurement. Language Testing, 13(1), 63-82.
Perkins, K., & Leahy, R. (1980). Using objective measures of composition to compare
native and non-native compositions. In R. Silverstein (Ed.), Occasional Papers in
Linguistics, No. 6. Carbondale: Southern Illinois University.
Perren, G. E. (1968). Testing spoken langauge: Some unresolved problems. In A. Davies
(Ed.), Language Testing Symposium. Oxford: Oxford University Press.
Pienemann, M., Johnston, M., & Brindley, G. (1988). Constructing an acquisition-based
procedure for second language assessment. Studies in Second Language Acquisi-
tion, 10, 217-234.
Polio, C. G. (1997). Measures of linguistic accuracy in second language writing research.
Language Learning, 47(1), 101-143.
Polio, C. G. (2001). Research methodology in second language writing research: The case
of text-based studies. In T. Silva & P. K. Matsuda (Eds.), On Second Language
Writing. Mahwah, NJ: Lawrence Erlbaum Associates.

314
Pollitt, A., Hutchinson, C., Entwhistle, N., & DeLuca, C. (1985). What makes exam ques-
tions difficult? An analysis of 'O' grade questions and answers (Research Report
for Teachers No.2). Edinburgh: Scottish Academic Press.
Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic
& N. Saville (Eds.), Performance testing, cognition and assessment: Selected pa-
pers from the 15th Language Testing Research Colloquium (LTRC), Cambridge
and Arnhem (Vol. 3). Cambridge: Cambridge University Press.
Porter, D. (1991). Affective factors in the assessment of oral interaction: Gender and
status. In S. Anivan (Ed.), Current developments in language testing. Singapore:
SEAMEO Regional Language Centre.
Powers, D. E., Burstein, J., Chodorow, M., Fowles, M. E., & Kukich, K. (2000). Compar-
ing the validity of automated and human essay scoring. GRE Board Research
Report No. 98-08a, ETS RR-00-10. Princeton, NJ: Educational Testing Service.
Quellmalz, E. S., Capell, F., & Chou, C. P. (1982). Effects of discourse and response
mode on the measurement of writing competence. Journal of Educational Meas-
urement, 19, 241-258.
Raffaldini, T. (1988). The use of situation tests as measures of communicative compe-
tence. Studies in Second Language Acquisition, 10(2), 197-216.
Rasch, G. (1960). Probablilistic models for some intelligence and attainment tests. Chi-
cago: University of Chicago Press.
Rasch, G. (1980). Probablilistic models for some intelligence and attainment tests. Chi-
cago: University of Chicago Press.
Reid, J. (1992). A computer text analysis of four cohesion devices in English discourse
by native and nonnative writers. Journal of Second Language Writing, 1(2), 79-
107.
Reid, J. (1993). Teaching ESL writing. Boston, Massachusetts: Prentice Hall.
Reynolds, D. W. (1995). Repetition in non-native speaker writing. More than quantity.
Studies in Second Language Acquisition, 17, 185-209.
Reynolds, D. W. (1996). Repetition in second language writing. Unpublished PhD disser-
tation, Indiana University.
Roberts, R., & Kreuz, R. J. (1993). Nonstandard discourse and its coherence. Discourse
Processes, 16, 451-464.
Sakyi, A. A. (2000). Validation of holistic scoring for ESL writing assessment: How rat-
ers evaluate compositions. In A. J. Kunnan (Ed.), Fairness and validation in lan-
guage assessment: Selected papers from the 19th Language Testing Research
Colloquium, Orlando, Florida. Cambridge: Cambridge University Press.
Schmidt, R. (1992). Psychological mechanisms underlying second language fluency.
Studies in Second Language Acquisition, 14, -357-385.
Schneider, M., & Connor, U. (1990). Analyzing topical structure in ESL essays. Studies
in Second Language Acquisition, 12(4), 411-427.
Scholz, G., Hendricks, D., Spurling, R., Johnson, M., & Vandenburg, L. (1980). Is lan-
guage ability divisible or unitary? A factor analysis of 22 English language profi-

315
ciency tests. In J. W. Oller & K. Perkins (Eds.), Research in language testing.
Rowley, Mass.: Newbury House.
Seliger, H. W., & Shohamy, E. (1989). Second language research methods. Oxford: Ox-
ford University Press.
Sharma, A. (1980). Syntactic maturity: Assessing writing proficiency in a second lan-
guage. In R. Silverstein (Ed.), Occasional Papers in Linguistics, No.6. Carbon-
dale: Southern Illinois University.
Shaw, P., & Liu, E. T.-K. (1998). What develops in the development of second-language
writing. Applied Linguistics, 19, 225-254.
Shaw, S. D. (2002). IELTS writing: Revising assessment criteria and scales (Phase 1).
Cambridge Research Notes, 9, 16-18.
Shaw, S. D. (2003). Legibility and the rating of second language writing: The effect on
examiners when assessing handwritten and word-processed scripts. Cambridge
Research Notes, 11, 7-15.
Shaw, S. D. (2004). Automated writing assessment: A review of four conceptual models.
Cambridge Research Notes, 17, 13-18.
Shermis, M. D., & Burstein, J. (Eds.). (2003). Automated essay scoring: A cross-
disciplinary perspective. Mahwah, N.J.: Lawrence Erlbaum.
Shohamy, E. (1998). How can language testing and SLA benefit from each other? The
case of discourse. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between
SLA and Language Testing Research. Cambridge: Cambridge University Press.
Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effects of raters' background
and training on the reliability of direct writing tests. The Modern Language Jour-
nal, 76(1), 27-33.
Skehan, P. (1996). A framework for the implementation of task-based instruction. Ap-
plied Linguistics, 17, 38-62.
Skehan, P. (1998a). A cognitive approach to language learning. Oxford: Oxford Univer-
sity Press.
Skehan, P. (1998b). Task-based instruction. Annual Review of Applied Linguistics, 18,
268-286.
Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan
& M. Swain (Eds.), Researching pedagogic tasks: Second language learning,
teaching and testing. Harlow: Longman.
Skehan, P. (2003). Task-based instruction. Language Teaching, 36, 1-14.
Skehan, P., & Foster, P. (1997). Task type and processing conditions as influences on
foreign language performance. Language Teaching Research, 1(3), 185-211.
Sloan, C., & McGinnis, I. (1982). The effect of handwriting on teachers' grading of high
school essays. Journal of the Association for the Study of Perception, 17(2), 15-
21.
Smith, D. (2000). Rater judgments in the direct assessment of competency-based second
language writing ability. In G. Brindley (Ed.), Studies in immigrant English lan-

316
guage assessment. Sydney: National Centre for English Language Teaching and
Research.
Smith, S. (2003). Standards for academic writing: Are they common within and across
disciplines? Unpublished MA thesis, University of Auckland, New Zealand.
Song, B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays
of native English-speaking and ESL students? Journal of Second Language Writ-
ing, 5(2), 163-182.
Spolsky, B. (1981). The gentle art of diagnostic testing. Paper presented at the Interuni-
versitaere Sprachtestgrupppe Workshop on Diagnostic Testing, 15 December,
Hasensprungmuehle.
Spolsky, B. (1992). The gentle art of diagnostic testing revisited. In E. Shohamy & R. E.
Walton (Eds.), Language assessment for feedback: Testing and other strategies.
Dubuque, Iowa: Kendall/Hunt Publishing Company.
Stahl, A., & Lunz, M. E. (1992). Judge performance reports. Paper presented at the an-
nual meeting of the American Educational Research Association: San Francisco,
CA.
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement ap-
proaches to estimating interrater reliability. Practical Assessment, Research and
Evaluation, 9(4), 1-20.
Stratman, J., & Hamp-Lyons, L. (1994). Reactivity in concurrent think-aloud protocols:
Issues for research. In P. Smagorinsky (Ed.), Speaking about writing: Reflections
on research methodology. Thousand Oaks, CA: Sage.
Sunderland, J. (1995). Gender and language testing. Language Testing Update, 17, 24-35.
Sweedler-Brown, C. (1993). ESL essay evaluation: The influence of sentence-level and
rhetorical features. Journal of Second Language Writing, 2(1), 3-17.
Tapia, E. (1993). Cognitive demand as a factor in interlanguage syntax: A study in topics
and texts. Indiana University.
Tavakoli, P., & Skehan, P. (2005). Strategic planning, task structure, and performance
testing. In R. Ellis (Ed.), Planning and task performance in a second language.
Oxford: Oxford University Press.
Tedick, D. J. (1990). ELS writing assignment: Subject-matter knowledge and its impact
on performance. English for Specific Purposes, 9, 123-143.
Tomita, Y. (1990). T-unit o mochiita kokosei no jiyu eisaku bun noryoku nosokutei (As-
sessing the writing ability of high school students with the use of t-units). Step
Bulletin, 2, 14-28.
Tribble, C. (1996). Writing. Oxford: Oxford University Press.
Tsang, W. K. (1996). Comparing the effects of reading and writing on writing perform-
ance. Applied Linguistics, 17, 210-233.
Turner, C. E. (2000). Listening to the voices of rating scale developers: Identifying sali-
ent features for second language performance assessment. The Canadian Modern
Language Review, 56(4), 555-584.

317
Turner, C. E., & Upshur, J. A. (2002). Rating scales derived from student samples: Ef-
fects of the scale maker and the student sample on scale content and student
scores. TESOL Quarterly, 36(1), 49-70.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language
tests. ELT Journal, 49(1), 3-12.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language
speaking ability: Test method and learner discourse. Language Testing, 16(1),
82-111.
Vande Kopple, W. J. (1983). Something old, something new: Functional Sentence Per-
spective. Research in the Teaching of English, 17, 85-99.
Vande Kopple, W. J. (1985). Some exploratory discourse on metadiscourse. College
Composition and Communication, 36, 82-93.
Vande Kopple, W. J. (1986). Given and new information and some aspects of the struc-
tures, semantics, and pragmatics of written texts. In C. Cooper & S. Greenbaum
(Eds.), Studying writing: Linguistic approaches. London: Sage.
Vann, R. J. (1979). Oral and written syntactic relationships in second language learning.
In C. Yorio, K. Perkins & J. Schachter (Eds.), On TESOL '79: The learner in fo-
cus. Washington, D.C.: TESOL.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-
Lyons (Ed.), Assessing Second Language Writing in Academic Contexts. Nor-
wood, New Jersey: Ablex Publishing Corporation.
Vollmer, H. J., & Sang, F. (1983). Competing hypotheses about second language ability:
A plea for caution. In J. W. Oller (Ed.), Issues in language testing research.
Rowley, Mass. : Newbury House.
Watson Todd, R. (1998). Topic-based analysis of classroom discourse. System, 26, 303-
318.
Watson Todd, R., Thienpermpool, P., & Keyuravong, S. (2004). Measuring the coherence
of writing using topic-based analysis. Assessing Writing, 9, 85-104.
Weigle, S. C. (1994a). Effects of training on raters of English as a second language com-
positions: Quantitative and qualitative approaches. Unpublished PhD disserta-
tion, University of California, Los Angeles.
Weigle, S. C. (1994b). Effects of training on raters of ESL compositions. Language Test-
ing, 11(2), 197-223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing,
15(2), 263-287.
Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.
Weigle, S. C., Lu, Y., & Baker, A. (2007). Validation of automated essay scoring for ESL
writers. Paper presented at the Language Testing Research Colloquium, Barce-
lona, June 2007.
Weir, C. J. (1990). Communicative language testing. New Jersey: Prentice Hall Regents.
White, E. M. (1985). Teaching and assessing writing. San Francisco: Jossey-Bass Inc.

318
White, E. M. (1995). An apologia for the timed impromptu essay test. College Composi-
tion and Communication, 46, 30-45.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consis-
tency in assessing oral interaction. Language Testing, 10(3), 305-323.
Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral
test discourse. Language Testing, 14, 85-106.
Wigglesworth, G. (2000). Influences on performance in task-based oral assessments. In
M. Bygate, P. Skehan & M. Swain (Eds.), Researching pedagogic tasks: Second
language learning, teaching and testing. Harlow: Longman.
Wild, C., & Seber, G. (2000). Chance encounters. A first course in data analysis and in-
ference. New York: John Wiley & Sons, Inc.
Wilkinson, A. (1983). Assessing language development: The Credition project. In A.
Freedman, I. Pringle & J. Yalden (Eds.), Learning to write: First lan-
guage/second language. New York: Longman.
Wilkinson, L., Blank, G., & Gruber, C. (1996). Desktop data analysis with SYSTAT. Up-
per Saddle River, NJ: Prentice-Hall.
Witte, S. (1983a). Topical structure analysis and revision: An exploratory study. College
Composition and Communication, 34(3), 313-341.
Witte, S. (1983b). Topical structure and writing quality: Some possible text-based expla-
nations of readers' judgments of students' writing. Visible Language, 17, 177-205.
Witte, S., & Faigley, L. (1981). Cohesion, coherence and writing quality. College Com-
position and Communication, 32(2), 189-204.
Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in
writing: Measures of fluency, accuracy and complexity. Technical Report No. 17.
Honolulu, HI: University of Hawai'i Press.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
Wu, J. (1997). Topical structure analysis of English as a second language (ESL) texts
written by college South-east Asian refugee students. Unpublished PhD disserta-
tion, University of Minnesota.
Young, R. (1995). Discontinuous interlanguage development and its implications for oral
proficiency rating scales. Applied Language Learning, 6, 13-26.
Yuan, F., & Ellis, R. (2003). The effects of pre-task planning and on-line planning on
fluency, complexity and accuracy in L2 monologic oral production. Applied Lin-
guistics, 24, 1-27.

319
Series editors: Rüdiger Grotjahn and Günther Sigott

Vol. 1 Günther Sigott: Towards Identifying the C-Test Construct. 2004.


Vol. 2 Carsten Röver. Testing ESL Pragmatics. Development and Validation of a Web-Based
Assessment Battery. 2005.
Vol. 3 Tom Lumley: Assessing Second Language Writing. The Rater’s Perspective. 2005.
Vol. 4 Annie Brown: Interviewer Variability in Oral Proficiency Interviews. 2005.
Vol. 5 Jianda Liu: Measuring Interlanguage Pragmatic Knowledge of EFL Learners. 2006.
Vol. 6 Rüdiger Grotjahn (Hrsg./ed.): Der C-Test: Theorie, Empirie, Anwendungen/The C-Test:
Theory, Empirical Research, Applications. 2006.
Vol. 7 Vivien Berry: Personality Differences and Oral Test Performance. 2007.
Vol. 8 John O‘Dwyer: Formative Evaluation for Organisational Learning. A Case Study of the
Management of a Process of Curriculum Development. 2008.
Vol. 9 Aek Phakiti: Strategic Competence and EFL Reading Test Performance. A Structural
Equation Modeling Approach. 2007.
Vol. 10 Gábor Szabó: Applying Item Response Theory in Language Test Item Bank Building.
2008.
Vol. 11 John M. Norris: Validity Evaluation in Language Assessment. 2008.
Vol. 12 Barry O’Sullivan: Modelling Performance in Tests of Spoken Language. 2008.
Vol. 13 Annie Brown / Kathryn Hill (eds.): Tasks and Criteria in Performance Assessment. Pro-
ceedings of the 28th Language Testing Research Colloquium. 2009.
Vol. 14 Ildikó Csépes: Measuring Oral Proficiency through Paired-Task Performance. 2009.
Vol. 15 Dina Tsagari: The Complexity of Test Washback. An Empirical Study. 2009.
Vol. 16 Spiros Papageorgiou: Setting Performance Standards in Europe. The Judges’ Contribution
to Relating Language Examinations to the Common European Framework of Reference.
2009.

Vol. 17 Ute Knoch: Diagnostic Writing Assessment. The Development and Validation of a Rating
Scale. 2009.

www.peterlang.de

You might also like