Reliability and validity are the two most important properties of a test. They form part of the Cambridge English VRIPQ approach as described in the Principles of Good Practice booklet. It is a general principle that in any testing situation one needs to maximise validity and reliability to produce the most useful results for test users, within existing practical constraints.
Cambridge English takes the view that reliability is an integral component of validity; there can be no validity without reliability. Hence any approach to estimating reliability must reflect potential sources of evidence for the construct validity of the tests.
Reliability (expressed normally by a figure between 0 and 1) indicates the replicability of the test scores when a test is given twice or more to the same group of people, or when two tests that are constructed in the same manner are given to the same group of people. The expectation is that the candidates would receive nearly the same results on all occasions. If the candidates’ results are consistent on all occasions, the test is said to be reliable; the degree of score consistency is therefore a measure of reliability of the test.
There are various ways to estimate the reliability of an exam. Most Cambridge English exams have two main types of component: objective papers and performance papers. Objective papers are the ones that do not require human judgement for their scoring, i.e. tests of reading comprehension, listening comprehension and use of English. The scores achieved in these sub-tests are simply calculated by adding up the total number of correct responses to each section. The reliability estimates for these papers are calculated using a statistic called Cronbach’s Alpha. The closer the Alpha is to 1, the more reliable the test section is.
Performance papers, however, involve the judgement of human raters. Almost all Cambridge English Speaking tests use a paired format structure where two Oral Examiners assess the performance of the candidates.
In Writing tests each candidate’s performance is usually marked by one human rater, with a sample of scripts marked by a second or a third marker. When there are two examiners marking a performance test, we use the Pearson correlation between the two examiners as a measure of consistency of the ratings. When this is not the case, or where a sample of performances are marked by more than one examiner, we use a statistic called the g-coefficient, which is derived from Generalizability theory. What is common to all these methods is a scale which ranges between 0 and 1, very similar to the Alpha used for objective papers.
The decision to pass or fail a candidate is almost always taken at the syllabus level for Cambridge English exams. That means the overall score on the test is the composite of all the scores in a test’s subcomponents. It is this score which is reported to candidates. The score is reported in the range of 0 to 100 by scaling raw scores to standardised scores. While it is worth having a measure of the reliability of each test component, what matters most to candidates and test users is the overall reliability of the whole syllabus. This reliability is called the composite reliability.
SEM is not a separate approach to estimating reliability, but rather a different way of reporting it. Language testing is subject to the influence of many factors that are not relevant to the ability being measured. Such irrelevant factors contribute to what is called ‘measurement error’. The SEM is a transformation of reliability in terms of test scores. While reliability refers to a group of test takers, the SEM shows the impact of reliability on the likely score of an individual: it indicates how close a test taker’s score is likely to be to their ‘true score’, to within some stated probability. For example, where a candidate receives a score of 67 on a test with an SEM of 3, there is a high probability that their true score is between 64 and 70. This is a very useful piece of information that test users can use in their decision making.
Tables 1–12 below report typical reliability and SEM figures for Cambridge English exams for 2010.
Components: The reliability figures for objective papers are based on raw scores. Speaking is based on inter-rater correlation and Writing is based on g-coefficients. SEM figures are based on raw scores.
Total score: As can be seen from the tables below, the composite reliability for these exams is above 0.90 and the SEM is around 3. These figures demonstrate a high degree of trustworthiness in the overall scores reported.
Table 1: Cambridge English: Key
Key English Test (KET)
|
Reliability |
SEM |
Reading/Writing |
0.90 |
3.12 |
Listening |
0.86 |
1.78 |
Speaking |
0.87 |
2.40 |
Total score |
0.95 |
3.42 |
Table 2: Cambridge English: Key for Schools
Key English Test (KET) for Schools
|
Reliability |
SEM |
Reading/Writing |
0.91 |
2.81
|
Listening |
0.85 |
1.77 |
Speaking |
0.86 |
2.39 |
Total score |
0.95 |
3.23 |
Table 3: Cambridge English: Preliminary
Preliminary English Test (PET)
|
Reliability |
SEM |
Reading/Writing |
0.88 |
2.25
|
Listening |
0.77 |
2.14
|
Speaking |
0.84 |
1.63
|
Total score |
0.92 |
3.39 |
Table 4: Cambridge English: Preliminary for Schools
Preliminary English Test (PET) for Schools
|
Reliability |
SEM |
Reading/Writing |
0.89 |
2.21
|
Listening |
0.82 |
2.03
|
Speaking |
0.85 |
1.59
|
Total score |
0.93 |
3.28 |
Table 5: Cambridge English: First
First Certificate in English (FCE)
|
Reliability |
SEM |
Reading |
0.80 |
3.61 |
Writing
|
0.84 |
1.39 |
Use of English
|
0.84 |
3.18
|
Listening
|
0.81 |
2.16 |
Speaking |
0.84 |
1.50
|
Total score |
0.94 |
2.78
|
Note: Cambridge English: First for Schools figures are as for Cambridge English: First
Table 6: Cambridge English: Advanced
Certificate in Advanced English (CAE)
|
Reliability |
SEM |
Reading |
0.80 |
3.94 |
Writing
|
0.79 |
1.78 |
Use of English
|
0.83 |
3.72
|
Listening
|
0.73 |
2.33 |
Speaking |
0.82 |
4.31
|
Total score |
0.93 |
2.89
|
Table 7: Cambridge English: Proficiency
Certificate of Proficiency in English (CPE)
|
Reliability |
SEM |
Reading |
0.79 |
4.37
|
Writing
|
0.73 |
2.24
|
Use of English
|
0.84 |
4.12
|
Listening
|
0.74 |
2.18
|
Speaking |
0.85 |
4.47
|
Total score |
0.92 |
2.88
|
Table 8: Cambridge English: Business Preliminary
Business English Certificate (BEC) Preliminary
|
Reliability |
SEM |
Reading |
0.85 |
2.76
|
Writing
|
0.84 |
2.01
|
Listening
|
0.82 |
2.21
|
Speaking |
0.88 |
1.41
|
Total score |
0.94 |
3.15
|
Table 9: Cambridge English: Business Vantage
Business English Certificate (BEC) Vantage
|
Reliability |
SEM |
Reading |
0.88 |
2.87
|
Writing
|
0.70 |
2.25
|
Listening
|
0.83 |
2.08 |
Speaking |
0.82 |
1.66
|
Total score |
0.93 |
3.31 |
Table 10: Cambridge English: Business Higher
Business English Certificate (BEC) Higher
|
Reliability |
SEM |
Reading |
0.85 |
3.08 |
Writing
|
0.71 |
1.96
|
Listening
|
0.83 |
2.32 |
Speaking |
0.81 |
1.67
|
Total score |
0.93 |
3.10 |
Table 11: Cambridge English: Young Learners
Young Learners English (YLE)
|
Reliability |
SEM |
Starters Listening
|
0.76 |
1.30
|
Starters Reading and Writing
|
0.83 |
1.69 |
Movers Listening
|
0.81 |
1.65
|
Movers Reading and Writing
|
0.87 |
2.32
|
Flyers Listening
|
0.82 |
1.70
|
Flyers Reading and Writing
|
0.90 |
2.49
|
Table 12: TKT (Teaching Knowledge Test)
|
Reliability |
SEM |
Module 1
|
0.90 |
3.37 |
Module 2
|
0.94 |
3.44
|
Module 3
|
0.90 |
3.50
|
CLIL
|
0.91 |
3.62
|
KAL
|
0.95 |
3.50 |