In defence of norm-referencing

A couple of weeks ago Ofqual published their consultation on new GCSE grades. A lot of the media debate has focussed on the new 1-9 grading structure, but tucked away in the consultation document there is a lot of very interesting information about how examiners make judgments.

I’ve written before on this blog about the difference between norm-referencing and criterion-referencing. Briefly, norm-referencing is when you allocate a fixed percentage of grades each year. (Update – Dylan Wiliam has pointed out in the comments that this  is not the correct definition of norm-referencing – see here for his comment). Each year, the top 10% get A grades, next 10% B, etc. It’s a zero sum game: only a certain number of pupils can get the top grade, and a certain number have to get the lowest grade. This seems intrinsically unfair because however hard an individual works and however highly they achieve, they are not really going to be judged on the merits of their own work but on how it stacks up against those around them. More than x% of pupils might be performing brilliantly, but they can’t be recognised by this system. It seems much fairer to set out what it is you want pupils to know and do in order to achieve a certain grade, and to give them the grade if they meet that criteria. That’s criterion-referencing.

The old O-level allocated fixed percentages of grades, and when it was abolished, the new GCSE was supposed to be criterion-referenced. I say ‘supposed’, because whilst criterion-referencing sounds much fairer and better, in practice it is fiendishly difficult and so ‘pure’ criterion-referencing has never really been implemented. Criteria have to be interpreted in the form of tests and questions, and it is exceptionally hard to create tests, or even questions, of comparable difficulty year after year– even in seemingly ‘objective’ subjects like maths or science.

We are not the only country to have this problem. The Ofqual report references the very interesting example of New Zealand. Their attempt at pure criterion referencing in 2005 led to serious problems. A New Zealand academic wrote this report about it, which includes a number of interesting points.

Taken at face value, criterion-referenced assessment appears to have much to recommend it (the performance demonstrated is a well-specified task open to interpretation) and norm-referencing very little to recommend it (the level of performance must be gauged from the relative position obtained), nevertheless, there are difficulties that make the introduction of criterion-referenced assessment in areas like reading, mathematics, and so on, much less smooth than this view might lead one to anticipate.

Likewise, in his book Measuring Up (which I reviewed in three parts here, here and here), the American assessment expert Daniel Koretz outlines some of the flaws with criterion-referenced assessments. The basic flaw at the very heart of criterion-referencing may be that we are ill-equipped to make absolute judgments. In the words of Donald Laming, ‘there is no absolute judgment. All judgments are comparisons of one things with another.’

As a result, our system has never been purely criterion-referenced. Tim Oates says this of the system we use at the moment:

‘In fact, we don’t really have a clear term for the approach that we actually use. ‘Weak-criterion referencing’ has been suggested: judgement about students meeting a standard, mixed with statistical information about what kinds of pupils took the examination.’

Ofqual are proposing to continue with this approach, but to improve it. I support their direction of travel, but I wonder if they couldn’t have gone a bit further – say, for example, actually reintroducing fixed grades.

One argument against fixed allocations of grades is that it won’t allow you to recognise genuine improvement in the system – or indeed genuine decline. If the top x% always get the top grade, you have no idea if standards are improving or declining. However, this argument no longer holds water because Ofqual are proposing to bring in a national reference test:

 The performance of the students who take the test will provide a useful additional source of information about the performance of the cohort (rather than individual students) for exam boards awarding new GCSEs. If, overall, students’ performance in the reference test improves on previous years (or indeed declines) this may provide evidence to support changing the proportion of students in the national cohort achieving higher or lower GCSE grades in that year. At present such objective and independent evidence is not available when GCSE awards are made.

I think the reference test is an excellent idea. Ideally, in the long-term it could assume the burden of seeing if overall standards are improving, leaving GCSEs free to measure the performance of individual pupils. In that case, why not have fixed grades for GCSEs? Alan Smithers makes a similar point in the Guardian here.

One reason why Ofqual might not have wanted to reintroduce fixed allocations of grades at the moment is because, despite all the real technical flaws with criterion-referencing which I have outlined above, there is still an element of hostility to norm-referencing amongst many educationalists. In my experience, I sense that many people think that norm-referencing is ‘ideological’ – that the only people who advocate it are those who want to force pupils to compete against each other.

Nothing could be further from the truth. Norm-referencing has some basic technical advantages which make it a sensible and pragmatic choice. The Finnish system, for example, which is often seen as being opposed to the ideas of competition and pupil ranking, has a norm-referenced final exam where the ‘number of top grades and failed grades in each exam is approximately 5 percent.’ Not only that, but as the example of New Zealand shows, those countries who have experimented with fully criterion-referenced exams have faced serious problems. If we refuse to acknowledge the genuine strengths of norm-referencing, we risk closing down many promising solutions to assessment problems.

16 responses to “In defence of norm-referencing”

  1. This is a really helpful blog Daisy, and you’re spot on that referencing could solve the most glaring issue. I’m interested in the control of percentage failures too, it’s something people forget: if you can limit top grades, you can limit bottom ones too.

    A bigger problem is how norm-referencing melds with the accountability agenda. If there will only ever be a finite number of grades available can we expect schools to keep passingincreasing floor targets? implies a ‘staticness’ of children’s abilities which may mean schools say “what CA we do? Our kids came in far behind and so the number of places at the

  2. Gah! Mobile phone problems…

    That second paragraph should read:

    A bigger problem is how norm-referencing melds with the accountability agenda. If there is only a finite number of grades available, can we really expect schools to keep passing increasing floor targets? The reference testing implies a ‘staticness’ of results – i.e. a cohort can only ever get what their ability revealed at an earlier point. A school with a group of students who are far behind are expected to not just get to the point of higher ability students but also leapfrog them and knock them down into a lower % band. For ‘accountability’ I think that’s tough – though I actually think it’s a probably a more honest reflection of what schools are facing.

  3. Keith Turvey says:

    Interesting blog Daisy which highlights fairly I think the issues and nuances of different positions. The other issue I think with norm referencing is learners’ and parents’ perceptions and the more general purposes of education. You note that in a norm referenced system it can ‘seem[s] intrinsically unfair because however hard an individual works and however highly they achieve, they are not really going to be judged on the merits of their own work but on how it stacks up against those around them.’ This issue is of great concern to both parents and learners I believe. It is not only ‘educationalists’ who would be concerned about this and I don’t think this can be ignored.

    Also I wonder how ‘absolutely’ reliable a reference test could be.

  4. kalinski1970 says:

    I agree with Laura that this makes complete sense, but you then have an issue with accountability. For example, If the new progress 8 expects anyone who averages a 4b to gain 8 C grades and the new floor target for primaries is 80% 4b then a lot of secondaries are going to be in trouble getting 80% 8 C grades. To be honest a lot of Primaries are also going to be in trouble!

  5. What you call norm-referencing is generally called “grading on the curve” here. In our math department we practice a sort of after-the-fact curve that combines professional judgement with cutoffs. We complete all our marking and set the cutoffs immediately after grading the final exam. The philosophy (if anyone ever bothers to think about it) goes like this: We’re professional mathematicians, and thus qualified to make expert judgements of the quality of work of students relative to the standards of our field and the expectations of the course in question. However, it is problematic to apply such “expert judgements” for (say) 1200 students across 10 sections of the course, with 10 different professors making judgements on their own classes — and to do so consistently. So instead, our judgement is “pooled”, as we collectively stare at the distribution of raw numerical term marks and consider that data in light of our judgment of what we’ve seen while grading the work. What does it “feel” like? Should there be a lot of A’s? Was the exam too hard? should the pass line be lowered to accommodate? Together we decide, and cutoffs reflecting our judgement of “the herd” are agreed upon by consensus. In the end the distribution could be almost anything, according to how we regard the overall performance of the group. But no individuals are considered. Once the cutoffs are set, we once more separate the data by class and assign grades to individuals according to those cutoffs.

    This system is far from perfect, and as ripe for abuse as any other. Yet I’ve never seen a system in which I’ve had more confidence of both fairness and adherence to general disciplinary standards. I’ve always assumed that this system was the default one. But more and more I’m finding it’s not.

    • I think a system like the one I describe could be adapted to your “norm-referenced” system using something like your “reference test”, which could be standardized against other years to prevent drift. That test should be a very simple one that keys questions to certain benchmarks understood to anchor key outcomes of the course, making it easy to judge equivalence. This could be used to establish cutoffs for norm-referencing. So the exact percentages would change from year to year according to the cohort’s performance on the reference test.

      An alternative would be to used a fixed “curve” but to publish student grades along with the reference test score for the cohort. Thus a B in a cohort in which the reference score was (let us say) 80% would be a stronger credential than a B in a cohort with score 65%

  6. Allocating grades on the basis of pre-determined quotas is not norm-referencing. In norm-referencing, students are compared to some well defined group of individuals who took the test at some earlier time. The classic example is the old US Scholastic Aptitude Test (SAT) in which up until it was re-normed in the 1990s, the comparison of every single student who took the test was compared with a group of male college-bound eastern-seaboard students who took the prototype test in 1941. What Daisy is describing here is cohort-referencing. Here’s an easy way to tell if you are taking a norm-referenced test or a cohort-referenced test. If sabotaging the performance of your neighbour helps your chances, you are taking a cohort-referenced test. In a norm-referenced test, your neighbour’s performance is irrelevant—it’s you against the norm group.

    It is also worth noting that strictly speaking, there’s no such thing as a norm-referenced test. What is norm-referenced is the interpretation of the data generated by the assessment. Some tests are designed to support norm-referenced interpretations, and it’s OK, as a short hand, to call these “norm-referenced tests” provided you realize that the same test can support norm-referenced, criterion-referenced, and cohort-referenced inferences.

    Daisy also quotes Tim Oates as saying that “we don’t really have a clear term for the approach that we actually use. ‘Weak-criterion referencing’ has been suggested: judgement about students meeting a standard, mixed with statistical information about what kinds of pupils took the examination.” Well I don’t know if Tim has seen my work on construct-referenced assessment and dislikes it, or if he doesn’t know about it, but I have been arguing for over 20 years that all assessments are, in fact, construct-referenced—they work to the extent that they support inferences relating to shared constructs of quality that exist in the heads of those who do the assessing and the interpreting. A 1994 paper on this can be found in the Papers tab on my web-site:

  7. As someone who took exams at the end of the norm-referencing O and A level era, I think that you have missed out a significant challenge with the process which is that there is no possibility of comparing students in different year groups. Not an issue when A-levels were used as a sifting measure for university, but when people started using them on job applications it is not possibly to compare ‘absolute’ achievement between different cohorts of students.

    There was also the issue, because exam boards covered a smaller geographical area, that what was the top 5% in one area, wasn’t comparable to the top 5% in another area.

  8. Sounds like contextualised cohort-referencing (Stringer 2011) to me. I wrote about this recently. I wrote:
    Stringer describes how prior data, in the form of ‘mean concurrent attainment’ in GCSE Maths, English and possibly Science, would be used to allow outcomes in subjects to increase or decrease with the ability profile of the entry. I would argue that this makes the assumption that it is possible to view examinations from a predictive perspective, whereby students who are awarded the same grade must ‘on average’ be the same in terms of the extent to which their attainments predict their future success (Newton 2012).
    I would also argue that it changes the nature of what examaination results tell us. When Tomlinson (2002) produced the ‘Inquiry into A level standards’ he claimed that in addition to ‘ranking’ students A Levels “must also contain an assurance that students have acquired the specific skills and knowledge that they need in order to embark on their chosen degree course”. Even with a clear set of criteria there is the danger that this aspect of examination results would be lost through contextualised cohort referencing.

    There’s an interesting publication by Cambridge Assessment – Research Matters Special Issue 2 which has an article ‘A level pass rates and the enduring myth of norm-referencing’ I don’t know about O levels but the article demonstrates that norm referencing didn’t happen at A Level, which might displease traditionalists.

  9. This article does not seem to deal with the fundamental objection to norm referencing in that norm referencing assumes that there are no difference between cohorts on an annual basis.

    For examples you could have a particularly strong cohort for whatever reason whereby a learner does not make the top ten per cent of learners whereas he or she in many other years would have done so.

    In effect you have a sliding percentage figure being re-interpreted into fixed grades.

    If there is genuine improvement then learners with a ten year age gap could be competing in a market place with the same grade whereas in reality they achieved completely different scores.

    I’m not against norm referencing but this article has not dealt with the significant issues that arises as a consequence of it and the unfairness it can cause.

    The problem is not about identifying improvements or otherwise but fundamental unfairness.

  10. […] This is spawned from this publication asking for feedback from OFQUAL about the new GCSEs; from this blog post by Daisy Christodoulou. […]

  11. Ros says:

    At the school where I used to teach in the last millenium, we operated a grading-on-the-curve (rather than strictly norm-referenced) system. All internal exam results were scaled to give a mean of 65 and a standard deviation of (I think) 15 in each subject in each year group. It made it very easy to see how a pupil was doing in say, French compared to science, and it also gave a good general sense of where the pupil was in relation to the rest of the year group. I really, really liked the system. The only problem, of course, was that half the teaching staff, most parents and many of the students didn’t understand it, despite the school’s best attempts to explain it.

    But for external exams, no one actually needs to understand the system, and I think it would be a huge improvement on the current system. It would make it much easier for universities, for example, not to have to keep revising their grades for offers. And not to have to try to distinguish between increasing numbers of students with a clean sweep of A*s.

  12. […] water. Educational assessment tends to be norm-referenced (for reasons Daisy Christodoulou explores here) but assessments of professional performance are almost invariably criterion-referenced in order to […]

  13. […] theory criterion referencing is obviously a fairer system, but it has proved very difficult to put fully into practice. Exam papers vary in relative ‘easiness’ or ‘hardness’ from year […]

  14. […] Norm/cohort referencing tells you I was in the upper 10% of the peer-group ability range for a few things, though a f*wit in Eng Lit. The modern equivalent holds no such truth – the upper 10% still exists, but you can’t identify it this way. […]

Leave a Reply

Your email address will not be published. Required fields are marked *