Guarding the Guardians.
In order to ensure a fair assessment of test
candidates language production, two procedures are
typically recommended in the field of language
testing: the use of a common rating scale, and the
training of raters. These two highly recommended
procedures in performance testing have traditionally
been viewed as means to heighten reliability of test
scores. In line with recent developments, this book
argues that the rater variable affects not only
reliability, but the very construct validity of test
scores. In the study, the scoring of four groups of
raters is compared: lay-people (i.e. non-linguists),
teachers without formal rater training, trained
raters, and finally a group of very experienced
raters. The results reveal differences between
groups for the scores they give, as well as for the
reasons for these scores and shows
that quantitative similarities in ratings may mask
significant qualitative differences in the reasons
for those ratings (Connor-Linton 1995: 99).
In order to ensure a fair assessment of test
candidates language production, two procedures are
typically recommended in the field of language
testing: the use of a common rating scale, and the
training of raters. These two highly recommended
procedures in performance testing have traditionally
been viewed as means to heighten reliability of test
scores. In line with recent developments, this book
argues that the rater variable affects not only
reliability, but the very construct validity of test
scores. In the study, the scoring of four groups of
raters is compared: lay-people (i.e. non-linguists),
teachers without formal rater training, trained
raters, and finally a group of very experienced
raters. The results reveal differences between
groups for the scores they give, as well as for the
reasons for these scores and shows
that quantitative similarities in ratings may mask
significant qualitative differences in the reasons
for those ratings (Connor-Linton 1995: 99).