Comparative judgment: 21st century assessment

Posted on 15-11-2015

In my previous posts I have looked at some of the flaws in traditional teacher assessment and assessments of character. This post is much more positive: it’s about an assessment innovation that really works.

One of the good things about multiple-choice and short answer questions is that they offer very high levels of reliability. They have clear right and wrong answers; one marker will give you exactly the same mark as another; and you can cover large chunks of the syllabus in a short amount of time, reducing the chance that a high or low score is down to a student getting lucky or unlucky with the questions that came up. One of the bad things about MCQs is that they often do not reflect the more realistic and real-world problems pupils might go on to encounter, such as essays and projects. The problem with real-world tasks, however, is that they are fiendishly hard to mark reliably: it is much less likely that two markers will always agree on the grades they award. So you end up with a bit of a trade-off: MCQs give you high reliability, but sacrifice a bit of validity. Essays allow you to make valid inferences about things you are really interested in, but you trade off reliability. And you have to be careful: trade off too much validity and you have highly reliable scores that don’t tell you anything anyone is interested in. Trade off too much reliability and the inferences you make are no longer valid either.

One way of dealing with this problem has been to keep the real world tasks, and to write quite prescriptive mark schemes. However, this runs into problems of its own: it reduces the real-world aspect of the task, and ends up stereotyping pupils’ responses to the task. Genuinely brilliant and original responses to the task fail because they don’t meet the rubric, while responses that have been heavily coached achieve top grades because they tick all the boxes. Again, we achieve a higher degree of reliability, but the reliable scores we have do not allow us to make valid inferences about the things we really care about (see the Esse Quam Videri blog on this here). I have seen this problem a lot in national exams, and I think that these kinds of exams are actually more flawed than the often-maligned multiple choice questions. Real world tasks with highly prescriptive mark schemes are incredibly easy to game. Multiple-choice and short answer questions are actually not as easy to game, and do have high levels of predictive validity. I think the problem people have with MCQs is that they just ‘look’ wrong. Because they look so artificial, people have a hard time believing that they really can tell you something about how pupils will do on authentic tasks. But they can, and they do, and I would prefer them to authentic tasks that either don’t deliver reliability, or deliver reliability in such a way that compromises their validity.

Still, even a supporter of MCQs like me has to acknowledge – as I always have – that in subjects like English and history, you would not want an entire exam to be composed of MCQs and short answer questions. You would want some extended writing too. In the past, I have always accepted that the marking of such extended writing has to involve some of the trade-offs and difficult decisions outlined above. I’ve also always accepted that it has to be a relatively time-consuming process, involving human markers, extensive training, and frequent moderation.

However, a couple of years ago I heard about a new method of assessment called comparative judgment which offers an elegant solution to the problem of assessing tasks such as essays and projects. Instead of writing prescriptive mark schemes, training markers in their use, getting them to mark a batch of essays or tasks and then come back together to moderate, comparative judgment simply asks an examiner to make a series of judgments about pairs of tasks. Take the example of an essay on Romeo and Juliet: with comparative judgment, the examiner looks at two essays, and decides which one is better. Then they look at another pair, and decide which one is better. And so on. It is relatively quick and easy to make such judgments – much easier and quicker than marking one individual essay. The organisation No More Marking offer their comparative judgment engine online here for free. You can upload essays or tasks to it, and set up the judging process according to your needs.

Let’s suppose you have 100 essays that need marking, and five teachers to do the marking. If each teacher commits to making 100 pairs of judgments, you will have a total of 500 pairs of judgments. These judgments are enough for the comparative judgment engine to work out the rank order of all of the essays, and associate a score with each one. In the words of the No More Marking CJ guide here: ‘when many such pairings are shown to many assessors the decision data can be statistically modelled to generate a score for each student.’ If you want your score to be a GCSE grade or other kind of national benchmark, then you can include a handful of pre-graded essays in your original 100. You will then be able to see how many essays did better than the C-grade sample, how many better than the B-grade sample, and so on. This method of marking also allows you to see how accurate each marker is. Again, in the words of the guide: ‘the statistical modelling also produces quality control measures, such as checking the consistency of the assessors. Research has shown the comparative judgement approach produces reliable and valid outcomes for assessing the open-ended mathematical work of primary, secondary and even undergraduate students.’

I have done some trial judging with No More Marking, and at first, it feels a bit like voodoo. If, like most English teachers, you are used to laboriously marking dozens of essays against reams of criteria, then looking at two essays and answering the question ‘which is the better essay?’ feels a bit wrong – and far too easy. But it works. Part of the reason why it works is that it offers a way of measuring tacit knowledge. It takes advantage of the fact that amongst most experts in a subject, there is agreement on what quality looks like, even if it is not possible to define such quality in words. It eliminates the rubric and essentially replaces it with an algorithm. The advantage of this is that it also eliminates the problem of teaching to the rubric: to go back to our examples at the start, if a pupil produced a brilliant but completely unexpected response, they wouldn’t be penalised, and if a pupil produced a mediocre essay that ticked all the boxes, they wouldn’t get the top mark. And instead of teaching pupils by sharing the rubric with them, we can teach pupils by sharing other pupils’ essays with them – far more effective, as generally examples define quality more clearly than rubrics.

Ofqual have already used this method of assessment for a big piece of research on the relative difficulty of maths questions. The No More Marking website has case studies of how schools and other organisations are using it. I think it has huge potential at primary school, where it could reduce a lot of the burden and administration around moderating writing assessments at KS1 & KS2. On the No More Marking website, it says that ‘Comparative Judgement is the 21st Century alternative to the 18th Century practice of marking.’ I am generally sceptical of anything in education describing itself as ’21st century’, but in this case, it’s justified. I really think CJ is the future, and in 10 or 15 years’ time, we will look back at rubrics the way Marty McFly looks at Reagan’s acting.

Comparative judgment: 21st century assessment

All my new writing is now on Substack!