Validity and validation

The Cambridge English test development cycle

The Cambridge English test development cycle (fully described in chapter 2 of Weir and Milanovic (2003) on the process of test development and revision within Cambridge English) starts with an appraisal of the intended context of use, the relationship of a new exam to existing examinations and frameworks and consultation with stakeholders (Perceived Need). Once a need is defined, planning takes place to establish a clear picture of who the potential candidates are likely to be and who the users of test results will be (Planning Phase). Initial test specifications are then produced linking needs to requirements of test usefulness, i.e. validity, reliability, impact, practicality and to frameworks of reference such as the CEFR. Initial decisions are made with regards to item types, text features, range of topics etc. Sample materials are then written and stakeholder feedback is sought (Design Phase). Trialling takes place where concrete views and significant evidence are collected to demonstrate test usefulness and alignment of level to framework of reference, e.g. CEFR (Development Phase). By default the test development model is cyclical and in this phase, it is still possible to make changes and go back to the design and planning phases.

Once the test specifications reach their final form, test materials are written and test papers are constructed, the test goes live, i.e., it is administered via our test centre network. Objective papers are scored mechanically, through clerical marking or a combination of both, and performance papers are rated by expert examiners. (Operational phase). Grading and post examination review and analysis take place. Results of live administrations are monitored across a number of years by obtaining regular feedback from stakeholders as well as carrying out instrumental research to investigate various aspects of candidate and examiner performance. Improvements where needed are then carried out (Monitoring Phase). Existing examinations are revised or updated where necessary in order to reflect developments in language learning, pedagogy and assessment as well as with evolution of the framework of reference (Evaluation and Revision Phase). Revision or update means going back to the beginning of the cycle, that is, the perceived need.

The Manual’s stages: Specification, Standardisation and Empirical Validation

The Manual outlines the alignment process as a set of stages: Specification, Standardisation and Empirical Validation, with Familiarisation as an essential component of the first two procedures.

Specification involves ‘mapping the coverage of the examination in relation to the categories of the CEF’ (2003:6). This corresponds to the Planning and Design phases of the Cambridge English test development model. These phases involve subject officers, exam managers, validation officers, and our network of chairs, item writing teams, and examiners. Task design and scale construction for performance tests now include explicit CEFR reference. This is documented in research publications (see Galaczi and Ffrench 2007 on revised speaking assessment scales for Main Suite and BEC (PDF, 95Kb)), examiner instruction booklets and item writer guidelines and is fed back to examiners and item writers via training and coordination sessions. The objectives and the content of the examination are described in the publicly available handbooks for teachers.

A major series of studies on the constructs of Writing, Reading, Speaking and Listening is being documented through the CUP/UCLES Studies in Language Testing (SILT) series. The approach maps exam specifications onto the CEFR, pointing up how criterial differences between CEFR levels are exemplified both in terms of cognitive and context parameters elicited by the given tasks. Here a socio-cognitive approach towards test validation is being used.

Standardisation involves ‘achieving and implementing a common understanding of the meaning of the CEF levels’ (2003:7). The Manual states that this involves: (a) Training professionals in a common interpretation of the CEF levels using productive skills samples and receptive skills calibrated items which are already standardized to the CEF; (b) Benchmarking where the agreement reached at Training is applied to the assessment of local performance samples; (c) Standard-setting where the cut-off scores for the test CEF level(s) are set.

These correspond to processes at different stages of the Cambridge English test development cycle. For objectively marked skills, the stability of the measurement scale to which all exam levels relate is achieved by an item banking methodology, that is in the Development phase where new items are pretested and calibrated using anchor items to monitor exam difficulty. The calibrated items are then stored in the Cambridge English Local Item Banking System (LIBS) where each item has a known difficulty and accordingly test papers are constructed to a target difficulty on the CEFR A2-C2 continuum and can be graded accordingly to a high degree of precision. This is better described as standard-maintaining rather than standard setting, given that the standard is a stable one which is carried forward. The current rationale for the standard of the objective papers owes something to an essentially normative view of skill profiles in a European context (as, probably, does the CEFR), and something to the progression depicted by the common measurement scale, which can be represented as a rational ‘ladder’ of learning objectives. Standardisation for the examiner-marked performance skills also happens at the Operational and Monitoring phases.

Where human raters are involved, we follow a rigorous system of recruitment, induction, training, coordination, monitoring and evaluation (RITCME). Obligatory standardisation of writing examiners and general markers takes place prior to every marking session, and the writing samples used are evaluated by the most senior examiners for the paper. Standardisation of oral examiners takes place once a year prior to the main administration session and the video samples of performances which are used are rated by the most experienced Senior Team Leaders and Team Leaders, representing a wide range of countries and familiarity with level. The marks provided are then subject to quantitative (SPSS and FACETS) and qualitative analysis before being approved for standardization purposes.

It should be noted that some of the samples used in training and benchmarking events formed what is currently known as the Council of Europe DVD for oral performance and receptive skills.

The Manual treats Familiarisation as an essential component of the Specification and Standardization stages. The current training and standardisation processes used by Cambridge English can be seen as familiarisation with the Cambridge levels, with explicit CEFR reference being introduced as appropriate. There is another sense in which familiarisation with the CEFR is timely for Cambridge English: staff or external experts, whose current frame of reference may not extend beyond the Cambridge levels, or whose daily duties do not require them to look at developments further afield, will benefit from a better understanding of the CEFR. This is being addressed in a variety of ways including: an induction worksheet and face to face workshop. Future plans for Familiarisation activities include the use of Cambridge English annual seminars and network meetings as well as a self access training course.

Empirical Validation involves the ‘collection and analysis of data on (a) task performance and item characteristics, (b) test quality, (c) learner performance and test scores, (d) rater behaviour and (e) the adequacy of standards set’ (2003:9). The manual goes on to say that empirical validation proceeds on two levels: internal and external validation. This relates to a very wide range of activities at the Development, Operational and Monitoring Phases of the test cycle, as well as research projects to look at specific issues. With regards to internal validation a variety of activities occur:

Statistical analyses of objective items before (pretest) and after live sessions. This includes the use of anchor tests, and information about candidates gathered each session via candidate information sheets;
Qualitative analysis of Writing and Speaking tasks before (trialling) and after live sessions which is documented in examiner, senior team leader and annual validation reports.
Statistical analysis of Writing examiners marking tendencies and monitoring via the Team Leader system through the entire marking period; in addition to systematic ‘marks collection’ exercise in Speaking co-ordination and monitoring process in live sessions.

With regards to external validation, a major project was carried out in 1998-2000 using the ALTE Can Do scales (Jones 2000, 2001, 2002), providing a strong empirical link between test performance and perceived real-world language skills, as well as between the Cambridge English exam levels and the CEFR scales.

Conclusion

Cambridge English can produce a variety of kinds of evidence of how its tests link to the CEFR, and of the quality of the processes which strengthen this link. In its current draft form at least, the Manual envisages such a variety: users applying it rationally and selectively, ‘contributing to a body of knowledge and experience’ and ‘adding to the compendium of suggested techniques’. That is, there should be different ways of constructing an argument of alignment. As the regulatory function of the CEFR gathers pace there is a risk that the Manual will become a more prescriptive, rubber-stamping procedure, which would be to the detriment of language testing and users of the results.

Where we are now (at the beginning of 2008) the major challenge for European language testers is to begin to look explicitly at direct cross-language comparison. A major multilingual benchmarking event for Speaking will take place in June, hosted by C.I.E.P. in Sevres. This will need new methodologies and kinds of evidence, but provides the best hope of a better answer to the question: ‘Is my B1 your B1?’ (Professor J. Charles Alderson)

References

Council of Europe (2003) Relating language examinations to the CEFR. Manual; Preliminary Pilot Version.
Council of Europe (2004) Reference supplement to the Preliminary Pilot Version of the Manual for Relating language examinations to the CEFR.
Galaczi, E and Ffrench, A (2007) Developing revised assessment scales for Main Suite and BEC Speaking tests (95Kb). Research Notes 30.
Jones, N (2000) Background to the validation of the ALTE ‘Can-do@ project and the revised Common European Framework, Research Notes 2, 11-13.
Jones, N (2001) The ALTE Can Do Project and the role of measurement in constructing a proficiency framework, Research Notes 5, 5–8.
Jones, N (2002) Relating the ALTE Framework to the Common European Framework of Reference, in Council of Europe, Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Case Studies, Strasbourg: Council of Europe Publishing, 167–183.
Taylor, L and N. Jones (2006). Cambridge ESOL exams and the Common European Framework of Reference (CEFR) (PDF, 92Kb). Research Notes 24.
Weir, C J and Milanovic, M (Eds) (2003) Continuity and Innovation: The History of the CPE 1913-2002, Studies in Language Testing 15, Cambridge: CUP/UCLES.