Validity and validation

Accuracy and reliability – an evidence-based approach

Validity is how well an assessment can be proven to accurately reflect the test taker’s true level of ability. Our continuous programme of product development and validation is evidence-based and draws on the expertise of our research and validation teams in fields including applied linguistics, educational measurement and statistical analysis.

Empirical validation

Association of Language Testers in Europe (ALTE)

Cambridge English is a founder member of ALTE, which was established in 1990 and now has 34 member organisations throughout Europe. ALTE is a collaborative association that works to increase standards and coherence in language qualifications throughout Europe, and provides a forum for discussion and collaboration through its regular conferences and meetings.

ALTE has developed its own quality management system including procedures for auditing member organisations. The ALTE Code of Practice (1994) and ALTE Principles of Good Practice (2001) are available from www.alte.org

Members of ALTE play a key role in the development of BULATS and in the SurveyLang project which delivered the European Survey on Language Competences for the European Commission in 2012.

For information on the ALTE audit system see Saville, N (2010) Auditing the quality profile: from code of practice to standards, Research Notes 39, 24–28 (PDF 718kb)

The draft Manual for Relating Language Examinations to the CEFR advocates the use of external empirical validation as part of its procedures for exam–CEFR alignment. Cambridge English conducted a major external validation project in 1998–2000 using the ALTE Can Do scales (Jones 2000, 2001, 2002). The research project provided a strong empirical link between test performance and perceived real-world language skills, as well as between the Cambridge English exam levels and the CEFR scales.

Item banking – building in reliability and consistency

The current system of Cambridge English exam levels has developed over nearly a century, starting in 1913 with the most advanced level, Cambridge English: Proficiency, now associated with CEFR Level C2. The system evolved as the need for new exam levels was recognised – that is, in response to the needs of particular groups of learners. Up to the 1990s the exams were designed, developed and administered without much support from statistics. Item writers, teachers and publishers shared an understanding of the levels, rooted in their understanding of the learners, and the system worked well for practical purposes.

However, when at the beginning of the 1990s Cambridge English began to address seriously the reliability and consistency of its assessments, it was clear that better statistical underpinning of the levels system was needed.

The methodology which was beginning to become more widely adopted at the time was item banking, which is an application of item response theory (IRT) (Bond and Fox 2001, Wright and Stone 1979). Item banking involves assembling a bank of calibrated items – that is, items of known difficulty. Designs for collecting response data ensure a link across items at all levels. Thus a single measurement scale can be constructed. This scale relates different testing events within a single frame of reference, greatly facilitating the development and consistent application of standards (see Figure 1).

Figure 1 Item banking approach to scale construction

Item banking approach to scale construction

Item banking is applicable to tests which are objectively marked, so that item response data can be collected – for Cambridge English exams this means the Reading, Listening and the Use of English papers. The Cambridge English Common Scale – a single measurement scale covering all the Cambridge levels – was thus constructed with reference to these skills. Common Scales have also been published for writing and speaking, based on qualitative analysis of the features of these performance skills at different levels (see Hawkey and Barker 2004 for a discussion of this).

As Figure 1 shows, constructing a single measurement scale requires all the item response data to be linked in some way. Two ways of achieving this are common person linking, where a group of learners might for example take test papers at two different levels, and common item linking, where different tests contain some items in common. This is the basic approach used in pretesting, where each pretest is administered together with an anchor test of already calibrated material.

Further information

References

Bond, T G and Fox, C M (2001) Applying the Rasch model, NJ: Lawrence Erlbaum Associates.
Hawkey, R and Barker, F (2004) Developing a Common Scale for the Assessment of Writing, Assessing Writing 9 (2).
Jones, N (2000) Background to the validation of the ALTE ‘Can-do’ project and the revised Common European Framework, Research Notes 2, 11–13.
Jones, N (2001) The ALTE Can Do Project and the role of measurement in constructing a proficiency framework, Research Notes 5, 5–8.
Jones, N (2002) Relating the ALTE Framework to the Common European Framework of Reference, in Council of Europe, Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Case Studies, Strasbourg: Council of Europe Publishing, 167–183.
Wright, B D and Stone, M H (1979) Best test design, Chicago, IL: MESA Press.

Testing cycles

The Cambridge English testing cycle and the draft Manual for Relating Language Examinations to the Common European Framework of Reference

The Common European Framework of Reference for Languages (CEFR) was published by the Council of Europe in 2001 as a common basis for the description and elaboration of learning, teaching and assessment. It has within a short period of time become highly influential in Europe and beyond.

As a result, many language testers now seek to align their exams to the CEFR. The Council of Europe has attempted to facilitate this by providing a ‘toolkit‘ of resources, including a draft pilot Manual for Relating Language Examinations to the CEFR, a technical reference supplement to this and exemplar materials illustrating the CEFR levels (Council of Europe 2003, 2004).

Cambridge English is one of many language testers to have piloted the Manual, and in December 2007 hosted a seminar in Cambridge on behalf of the Council of Europe where many of these case studies were reported, and experiences and recommendations shared.

Test alignment and the validation cycle

One issue to emerge from the above seminar was the following: what kind of evidence is relevant to demonstrating alignment, and how can it be collected? The Manual envisages a specific alignment project being organised, possibly on a one-off basis: participants are trained to carry out a set of activities, and reports are generated which constitute the evidential outcomes. However, from the viewpoint of Cambridge English, the alignment of tests to the CEFR, being a key aspect of their validity, should not be seen as a one-off exercise, but rather as integrated into every stage of the design and administration cycle.

Read more about the Cambridge English test development cycle, where we explain the relationship between these two approaches, seeing how we can build an argument based on the high quality of our existing processes, while generating evidence in line with the aims of the Manual. Bringing explicit CEFR reference into our current processes and documentation is work in progress, co-ordinated for practical purposes with revisions and updates. However, the close developmental links between Cambridge English exam levels and the CEFR already attest to a clear connection.

Taylor and Jones (2006) Cambridge ESOL exams and the Common European Framework of Reference (CEFR), Research Notes 24, 2-5 (PDF 124kb).

Further information

Cambridge English test development cycle
Familiarisation – making it work for us