Norm-referencing and criterion-referencing

Posted on 03-11-2013

This is an excellent article by Tim Oates which looks at the different ways of awarding grades in national exams. In particular, it looks at norm-referencing and criterion-referencing.

As Tim Oates notes, there seems something intrinsically unfair about norm-referencing, which is where you allocate a fixed percentage of grades each year. Each year, the top 10% get A grades, next 10% B, etc. It seems unfair on the individual because however hard an individual works and however highly they achieve, they are not really going to be judged on the merits of their own work but on how it stacks up against those around them. More than 10% of pupils might be performing brilliantly, but they can’t be recognised by this system. It also has wider problems – it makes it hard for universities and employers who want to compare candidates from different years, as there is no guarantee that an A grade in one year has required the same level of performance as an A grade in another. And, whilst it seems that norm-referencing is the cure for grade inflation, theoretically it isn’t. Of course, in a narrow sense, it is – no more than 10% can get the top mark. But when people worry about grade inflation, they are not actually worried about grade inflation per se – they are worried about unjustified grade inflation. They aren’t worried about more people getting better grades; they are worried about more people getting better grades to which they are not entitled. This problem is theoretically just as acute with norm-referencing. Norm-referencing is a relative judgment of performance which is not anchored to absolute performance criteria. Therefore, if performance over time deteriorates, the system cannot respond. The top 10% get the top 10% of grades, even if their performance is not actually absolutely that good. It’s fair to note that in practice, this rarely happens, but theoretically it is possible.

For all these reasons, it seemed obvious to me when I began teaching that criterion-referenced exams were fairer AND more accurate. Criterion-referencing is when you set the standards and then measure if an individual candidate has reached them. What I liked about this is that it was an absolute measure of performance. It anchored our judgments with objective and external criteria, not fuzzy and changeable comparisons. If a pupil asked how to achieve the next grade, you could give them some actually meaningful advice, as opposed to saying ‘you need to do better than Jenkins Minor’. Theoretically you could end up with 100% of pupils achieving A-grades, but that would only be because of real improvement in the system – because everyone was meeting the absolute A-grade criteria. Criterion referencing would give you an accurate measure of how individuals and the system were performing.

However, criterion-referencing has not proved as successful as this would suggest. The problem, as Tim Oates notes in what I think is the crucial part of his article, is that it is not as easy to create the absolute criteria or standards as we think it is.

The first issue is the slippery nature of standards.
Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.

Of course, the simple solution would be to give every pupil exactly the same question paper every year. That would be the fairest and most accurate way of testing performance. Everyone would literally have sat the same test. But of course, for obvious reasons you can’t do that. This, I think, is the crucial problem with criterion-referencing. As an abstract, intellectual idea it is perfect. It is the scientific fair test. You carry out your intervention and then test everyone in the same way. But in the messy real world, you cannot reuse that test. Once it has been taken, it cannot be taken again by another cohort.

Examiners are in an extremely difficult situation. Year after year, they have to create tests of comparable difficulty to those of the year before that are nevertheless sufficiently different from those of the year before. If they make the test close in structure and format to that of the year before, then it will be easier to compare across years, but it will also be easier for the succeeding year to do well at the test. If they depart more significantly from the structure and format of the year before, they ensure that the succeeding year don’t gain an unfair advantage, but they also make it much harder to compare between the two years. The problem here is that public exams have so many different purposes. Amongst many other things, they have to make judgments about individuals sitting that particular test, but also make judgments about individuals sitting that particular test compared with individuals who have sat past tests.

When I began teaching, I had the idea that there was an objective method of measuring the relative difficulty of different questions. I particularly thought that this was the case with subjects like science and mathematics. When you saw those little marks in boxes after questions on exam papers, I thought that there was some scientific method of calculating those marks. But this does not seem to be the case. It would be nice to think that all questions with the same deep structure but with different surface features are of equal difficulty. Then you could just keep a question with the same deep structure and change the surface features. But changing the surface feature does affect the difficulty of a question. To give an obvious and well-known example, most people can solve the Tower of Hanoi problem but they can’t solve the same problem when it presented as a word problem. As Tim Oates notes, it’s because of the problem of determining the specific difficulty of an exam that examiners don’t rely entirely on judgment against criteria – instead, they combine judgment about students meeting a standard ‘with statistical information about what kinds of pupils took the examination.’ So whilst criterion referencing appeared to be an elegant solution to some of the problems of norm-referencing, it has proved very hard to implement and has generated many problems of its own. Harry Webb writes well about this here. Like me, he started out a fan of criterion-referencing, but now says ‘experience has shown me problems which I now conclude are so fundamental that I cannot see how it can work effectively.’

Ultimately, these are complex and tricky issues with no simple answers. Here are the last words from TIm Oates’s article:

It’s no surprise that an infallible means of nailing standards with absolute precision has not yet been developed.
Setting and maintaining standards in exams is not a simple business. Our exams are of extremely high quality but a lot is being asked of them. The way that they are used in selection and accountability demands a level of precision that is well-nigh impossible to deliver in an inexpensive and practical way.

If all this feels like an unacceptable end to this explanation, it reflects reality. Examiners, exam-board officials and the regulator all work hard to write high-quality exam papers, to be fair and accurate, and make evidence-informed decisions. And they are doing so through a period of great change.

Norm-referencing and criterion-referencing

All my new writing is now on Substack!