Quality and accountability

Research in English

Research Notes

Issue 88 - August 2025

This issue presents six papers from the 2024 English Australia/Cambridge University Press & Assessment Action Research in ELICOS (English Language Intensive Courses for Overseas Students) Program. The theme is how AI could be used as a resource for doing action research.

Download Issue 88

Our uncompromising commitment to quality

Our staff – the largest dedicated research team of any UK-based language assessment organisation – are our greatest asset in delivering our commitment to excellence. Our rigorous systems of quality are subject to independent checks and meet international standards, providing accountability and giving confidence to those who rely on our exams.

Quality management and validation in language assessment

The Cambridge English Principles of Good Practice outline the systems and processes that drive our search for excellence and continuous improvement. While these systems involve complex research and technology, the underlying philosophy is simple:

Validity – are our exams an authentic test of real-life English?
Reliability – do our exams behave consistently and fairly?
Impact – do our assessments have a positive effect on teaching and learning?
Practicality – do our assessments meet learners’ needs within available resources?
Quality – how we plan, deliver and check that we provide excellence in all of these fields.

We have published Principles of Good Practice to:

make our claims for quality assessment transparent
give teachers, researchers, policy makers and others interested in our exams further insight into our processes
share best practice with the wider world of language learning and assessment.

Download Principles of Good Practice (PDF 798kb)

Exam reliability data

In Principles of Good Practice we state our commitment to providing users with data that will allow them to evaluate for themselves the reliability of our exams (appendix, Reliability section F). That data can be found below in the Reporting Reliability section.

The tools and analysis used to develop these figures are also listed for those unfamiliar with analysing and reporting test reliability.

Reliability and validity are the two most important properties of a test. They form part of the Cambridge English VRIPQ approach as described in the Principles of Good Practice booklet. It is a general principle that in any testing situation one needs to maximise validity and reliability to produce the most useful results for test users, within existing practical constraints.

Cambridge English takes the view that reliability is an integral component of validity; there can be no validity without reliability. Hence any approach to estimating reliability must reflect potential sources of evidence for the construct validity of the tests.

Reliability (expressed normally by a figure between 0 and 1) indicates the replicability of the test scores when a test is given twice or more to the same group of people, when two tests that are constructed in the same manner are given to the same group of people , or when the same performance is marked independently by two different examiners. The expectation is that the candidates would receive nearly the same results on all occasions. If the candidates’ results are consistent on all occasions, the test is said to be reliable; the degree of score consistency is therefore a measure of reliability of the test.

There are various ways to estimate the reliability of an exam. Most Cambridge English exams have two main types of component: objective papers and performance papers. Objective papers are the ones that do not require human judgement for their scoring, i.e. tests of reading comprehension, listening comprehension and use of English. The scores achieved in these sub-components are simply calculated by adding up the total number of correct responses to each section. The reliability estimates for these papers are calculated using a statistic called Cronbach’s Alpha. The closer the Alpha is to 1, the more reliable the test is.

Writing performance is usually marked by one human rater, however a selection of responses are marked by a second or third rater as well. We use this sample of responses, marked by more than one examiner, to estimate reliability for writing by calculating a statistic called Gwet’s AC2. This statistic is an estimate of inter-rater reliability.

For speaking, the Feldt Reliability Test is applied. This can be used when the score of a test is the sum of scores given by two raters or judges. We use it to assess reliability for speaking, as almost all Cambridge English speaking tests use a paired format structure where two Oral Examiners assess the performance of the candidates.

What is common to all these methods is a scale which ranges between 0 and 1, very similar to the Alpha used for objective papers.

Scores from the sub-components of a qualification are reported on the Cambridge English Scale (CES). These sub-component CES scores are used to calculate a candidate’s overall score, also reported on the CES, and it is this which determines a candidate’s grade, and CEFR level where relevant. While it is worth having a measure of the reliability of each sub-component, what matters most to candidates and test users is the reliability of the overall score. We use the standard error of measurement (SEM) from the sub-components, as well as the standard deviation of the overall CES scores to calculate the reliability of the overall score.

SEM is not a separate approach to estimating reliability, but rather a different way of reporting it. Language testing is subject to the influence of many factors that are not relevant to the ability being measured. Such irrelevant factors contribute to what is called ‘measurement error’. The SEM is a transformation of reliability in terms of test scores. While reliability refers to a group of test takers, the SEM shows the impact of reliability on the likely score of an individual: it indicates how close a test taker’s score is likely to be to their ‘true score’, to within some stated probability. For example, where a candidate receives an overall CES score of 186 with an SEM of 2.5, there is a high probability that their true score is between 181 and 191. This is a very useful piece of information that test users can use in their decision making.

Tables 1–10 below report typical reliability and SEM figures for Cambridge English exams for 2024.

Components: The reliability figures for objective and performance papers are based on raw scores. SEM figures are based on CES scores for Cambridge English Qualifications and raw scores for TKT and YLE.

We can see from the tables below, that reliability is typically above 0.8 across all components. SEM is around 5 to 9 CES scores for most objective components and around 2-3 for performance components, as well as for TKT and YLE.

Overall score: The overall reliability for these exams is above 0.90 and the SEM is around 2.5. These figures demonstrate a high degree of trustworthiness in the overall CES scores reported.

Table 1: Cambridge English: A2 Key (KET)

	Reliability	SEM
Reading	0.88	6.01
Writing	0.90	2.84
Listening	0.87	6.29
Speaking	0.96	2.40
Total score	0.97	2.36

Table 2: Cambridge English: A2 Key for Schools (KET for Schools)

	Reliability	SEM
Reading	0.87	6.02
Writing	0.92	2.61
Listening	0.86	6.44
Speaking	0.95	2.26
Total score	0.96	2.37

Table 3: Cambridge English: B1 Preliminary (PET)

	Reliability	SEM
Reading	0.88	5.69
Writing	0.93	2.17
Listening	0.84	6.44
Speaking	0.96	2.14
Total score	0.95	2.28

Table 4: Cambridge English: B1 Preliminary for Schools (PET for Schools)

	Reliability	SEM
Reading	0.89	5.58
Writing	0.93	2.17
Listening	0.85	6.33
Speaking	0.96	2.03
Total score	0.96	2.24

Table 5: Cambridge English: B2 First (FCE)

	Reliability	SEM
Reading	0.83	6.75
Writing	0.94	1.64
Use of English	0.83	8.50
Listening	0.85	6.33
Speaking	0.95	2.06
Total score	0.94	2.57

Table 6: Cambridge English: B2 First for Schools (FCE for Schools)

	Reliability	SEM
Reading	0.83	6.83
Writing	0.91	2.27
Use of English	0.83	8.32
Listening	0.83	6.34
Speaking	0.95	1.98
Total score	0.94	2.57

Table 7: Cambridge English: C1 Advanced (CAE)

	Reliability	SEM
Reading	0.83	7.07
Writing	0.94	2.61
Use of English	0.80	9.17
Listening	0.82	7.32
Speaking	0.96	1.98
Total score	0.94	2.82

Table 8: Cambridge English: C2 Proficiency (CPE)

	Reliability	SEM
Reading	0.77	10.10
Writing	0.89	3.40
Use of English	0.77	11.76
Listening	0.78	9.95
Speaking	0.96	2.20
Total score	0.91	3.77

Table 9: Cambridge English: Young Learners

	Reliability	SEM
Pre A1 Starters Listening	0.84	1.48
Pre A1 Starters Reading and Writing	0.85	1.72
A1 Movers Listening	0.87	1.80
A1 Movers Reading and Writing	0.87	2.49
A2 Flyers Listening	0.86	1.72
A2 Flyers Reading and Writing	0.92	2.61

Table 10: TKT (Teaching Knowledge Test)

	Reliability	SEM
Module 1	0.92	3.69
Module 2	0.92	3.68
Module 3	0.91	3.71
TKT: CLIL	0.93	3.37
TKT: YL	0.93	3.33

Our uncompromising commitment to quality

Quality management and validation in language assessment

Exam reliability data

Reliability as an aspect of test quality

Measuring reliability

Overall reliability

Standard Error of Measurement (SEM)

Reporting reliability figures