In defence of norm-referencing
Posted on 27-04-2014
A couple of weeks ago Ofqual published their consultation on new GCSE grades. A lot of the media debate has focussed on the new 1-9 grading structure, but tucked away in the consultation document there is a lot of very interesting information about how examiners make judgments.
I’ve written before on this blog about the difference between norm-referencing and criterion-referencing. Briefly, norm-referencing is when you allocate a fixed percentage of grades each year. (Update – Dylan Wiliam has pointed out in the comments that this is not the correct definition of norm-referencing – see here for his comment). Each year, the top 10% get A grades, next 10% B, etc. It’s a zero sum game: only a certain number of pupils can get the top grade, and a certain number have to get the lowest grade. This seems intrinsically unfair because however hard an individual works and however highly they achieve, they are not really going to be judged on the merits of their own work but on how it stacks up against those around them. More than x% of pupils might be performing brilliantly, but they can’t be recognised by this system. It seems much fairer to set out what it is you want pupils to know and do in order to achieve a certain grade, and to give them the grade if they meet that criteria. That’s criterion-referencing.
The old O-level allocated fixed percentages of grades, and when it was abolished, the new GCSE was supposed to be criterion-referenced. I say ‘supposed’, because whilst criterion-referencing sounds much fairer and better, in practice it is fiendishly difficult and so ‘pure’ criterion-referencing has never really been implemented. Criteria have to be interpreted in the form of tests and questions, and it is exceptionally hard to create tests, or even questions, of comparable difficulty year after year– even in seemingly ‘objective’ subjects like maths or science.
We are not the only country to have this problem. The Ofqual report references the very interesting example of New Zealand. Their attempt at pure criterion referencing in 2005 led to serious problems. A New Zealand academic wrote this report about it, which includes a number of interesting points.
Taken at face value, criterion-referenced assessment appears to have much to recommend it (the performance demonstrated is a well-specified task open to interpretation) and norm-referencing very little to recommend it (the level of performance must be gauged from the relative position obtained), nevertheless, there are difficulties that make the introduction of criterion-referenced assessment in areas like reading, mathematics, and so on, much less smooth than this view might lead one to anticipate.
Likewise, in his book Measuring Up (which I reviewed in three parts here, here and here), the American assessment expert Daniel Koretz outlines some of the flaws with criterion-referenced assessments. The basic flaw at the very heart of criterion-referencing may be that we are ill-equipped to make absolute judgments. In the words of Donald Laming, ‘there is no absolute judgment. All judgments are comparisons of one things with another.’
As a result, our system has never been purely criterion-referenced. Tim Oates says this of the system we use at the moment:
‘In fact, we don’t really have a clear term for the approach that we actually use. ‘Weak-criterion referencing’ has been suggested: judgement about students meeting a standard, mixed with statistical information about what kinds of pupils took the examination.’
Ofqual are proposing to continue with this approach, but to improve it. I support their direction of travel, but I wonder if they couldn’t have gone a bit further – say, for example, actually reintroducing fixed grades.
One argument against fixed allocations of grades is that it won’t allow you to recognise genuine improvement in the system – or indeed genuine decline. If the top x% always get the top grade, you have no idea if standards are improving or declining. However, this argument no longer holds water because Ofqual are proposing to bring in a national reference test:
The performance of the students who take the test will provide a useful additional source of information about the performance of the cohort (rather than individual students) for exam boards awarding new GCSEs. If, overall, students’ performance in the reference test improves on previous years (or indeed declines) this may provide evidence to support changing the proportion of students in the national cohort achieving higher or lower GCSE grades in that year. At present such objective and independent evidence is not available when GCSE awards are made.
I think the reference test is an excellent idea. Ideally, in the long-term it could assume the burden of seeing if overall standards are improving, leaving GCSEs free to measure the performance of individual pupils. In that case, why not have fixed grades for GCSEs? Alan Smithers makes a similar point in the Guardian here.
One reason why Ofqual might not have wanted to reintroduce fixed allocations of grades at the moment is because, despite all the real technical flaws with criterion-referencing which I have outlined above, there is still an element of hostility to norm-referencing amongst many educationalists. In my experience, I sense that many people think that norm-referencing is ‘ideological’ – that the only people who advocate it are those who want to force pupils to compete against each other.
Nothing could be further from the truth. Norm-referencing has some basic technical advantages which make it a sensible and pragmatic choice. The Finnish system, for example, which is often seen as being opposed to the ideas of competition and pupil ranking, has a norm-referenced final exam where the ‘number of top grades and failed grades in each exam is approximately 5 percent.’ Not only that, but as the example of New Zealand shows, those countries who have experimented with fully criterion-referenced exams have faced serious problems. If we refuse to acknowledge the genuine strengths of norm-referencing, we risk closing down many promising solutions to assessment problems.