Problems with performance descriptors

A primary teacher friend recently told me of some games she and her colleagues used to play with national curriculum levels. They would take a Michael Morpurgo novel and mark it using an APP grid, or they would take a pupil’s work and see how many different levels they could justify it receiving. These are jokes, but they reveal serious flaws. National curriculum levels are, and always have been, vague and unhelpful.

For example, compare:

Pupils’ writing is confident and shows appropriate and imaginative choices of style in a range of forms.

Pupils’ writing in a range of forms is lively and thoughtful.

The first is a description of performance at level 7, the second at level 4. That’s what I mean about vague and unhelpful, and that’s why my friend was able to justify the same piece of work receiving several different levels.

However, what is frustrating is that many of the replacements for national curriculum levels rely on precisely the same kind of vague performance descriptions. In fact, in many conversations I have with people, they cannot even begin to imagine an assessment system that doesn’t use some form of descriptor. For many people, descriptors simply are assessment, and if a school is to create its own assessment system, then the first – and possibly last – step must surely involve the creation of a new set of descriptors.  Unfortunately, the truth is very different: as I’ve written here, descriptors do not give us a common language but the illusion of a common language. They can’t be relied on to deliver accuracy or precision about how pupils are doing. In this post, I will recap the problems with descriptors; in the next, I will suggest some alternatives.

First, Tim Oates shows here that creating accurate prose descriptions of performance, even in subjects like maths and science, is fiendishly difficult.

Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.

Second, Dylan Wiliam shows here in Principled Assessment Design that even very precise descriptors can be interpreted in completely different ways.

Even in subjects like mathematics, criteria have a degree of plasticity. For example, a statement like ‘Can compare two fractions to identify which is larger’ sounds precise, but whether students can do this or not depends on which fractions are selected. The Concepts in Secondary Mathematics and Science (CSMS) project investigated the achievement of a nationally representative group of secondary school students, and found out that when the fractions concerned were 3/7  and 5/7  then around 90% of 14-year-olds answered correctly, but when more typical fractions, such as 3/4 and 4/5  were used, then 75% answered correctly. However, where the fractions concerned were 5/7 and 5/9 then only around 15% answered correctly (Hart, 1981).

Finally, Paul Bambrick-Santoyo makes a very similar point in Driven by Data. I’ve abridged the below extract.

 To illustrate this, take a basic standard taken from middle school math:

Understand and use ratios, proportions and percents in a variety of situations.

To understand why a standard like this one creates difficulties, consider the following premise. Six different teachers could each define one of the following six questions as a valid attempt to assess the standard of percent of a number. Each could argue that the chosen assessment question is aligned to the state standard and is an adequate measure of student mastery:

Identify 50% of 20.

Identify 67% of 81

Shawn got 7 correct answers out of 10 possible answers on his science test. What percent of questions did he get correct?

J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. What percentage of free throws did he make?

Bambrick-Santoyo goes on to give examples of two even more difficult questions. As with the Dylan William example, we can see that whilst 90-95% of pupils might get the first question right, many fewer would get the last one right.

The problems with the vagueness and inaccuracy of descriptors are not just a problem with the national curriculum levels. It is a problem associated with all forms of prose descriptors of performance. The problem is not a minor technical one that can be solved by better descriptor drafting, or more creative and thoughtful use of a thesaurus. It is a fundamental flaw. I worry when I see people poring over dictionaries trying to find the precise word that denotes performance in between ‘effective’ and ‘original’. You might find the word, but it won’t deliver the precision you want from it. Similarly, the words ’emerging’, ‘expected’ and ‘exceeding’ might seem like they offer clear and precise definitions, but in practice, they won’t.

So if the solution is not better descriptors, what is the answer? Very briefly, the answer is for the performance standards to be given meaning through a) questions and b) pupil work. I will expand on this in a later post.

18 responses to “Problems with performance descriptors”

  1. ijstock says:

    You have simply alighted on the fundamental problem of trying to define the indefinable. Learning is a holistic phenomenon that defies close definition. In humanities subjects such as mine level descriptors never did make linear sense – and it has caused all sorts of trouble.

  2. […] current answer is to describe what the journey might look like. This is a path fraught with danger. Daisy Christodoulou makes clear that performance descriptors are, in the main, a nonsense. As she […]

  3. Again you zero in on a key problem. This has been a big problem over here. Descriptors have also befuddled parents because of the difficulty in interpreting them. The sometimes seem blurred by the same rhetorical devices politicians use to equivocate. I was on one ministry-run committee years ago in which the wording and labels for descriptors was discussed because there were complaints that the already-bland and barely meaningful descriptors in use were “stigmatizing”. The committee brainstormed alternatives for some time, but could not do any better — every set of terms someone would come up with would be dissected as just as stigmatizing as the ones in use (though I confess in some cases I could not even discern which ones described higher performance). It seems the problem was that one cannot describe performance of two different students without suggesting that one has … erm … performed better than the other. I’m thinking: do we want to assess, or not?

  4. On the face of it, Hart’s relation of success rates is difficult to believe: “However, where the fractions concerned were 5/7 and 5/9 then only around 15% answered correctly”, which suggests that students were doing far worse than random guessing (which would yield 50% correct answers. Hart is, however, capable in his handling of data and I presume he is using analysis that infers success rate by compensating for the expected success from random guessing (not hard to do) on this and on the previous datum reported. Just thought I’d mention this in case anyone comes away scratching their head from this. I must add that I have not looked at the Hart paper.

  5. 3rsplus says:

    The problems with the vagueness and inaccuracy of descriptors are not just a problem with the national curriculum levels. It is a problem associated with all forms of prose descriptors of performance. The problem is not a minor technical one that can be solved by better descriptor drafting
    All true. Please bring your expansion on “questions” and “pupil work” to front burner ASAP. Whether discarding “level descriptors” is a step forward, sideways, or even backward remains to be seen.

  6. julietgreen says:

    It’s good to see this mythology begin to be unpicked by bloggers like yourself and other writers. I had many frustrating years trying to explain to colleagues that a requirement for ‘accurate assessment of pupils’ writing’ was a non-starter and that ‘moderation’ did nothing more than confirm group biases. The level descriptors, as you point out, were ambiguous at the least and not obviously progressive. The teaching objectives of the new curriculum are not assessment criteria and nor should they ever be – but are even less likely to lead to any kind of sane judgement. We’re currently using them to this very end and laughing grimly as we do so. We know it’s a nonsense and we don’t know how to extricate ourselves.

  7. […] This is a follow-up to my blog from last week about performance descriptors. […]

  8. Another hilarious thing about the prose descriptors that have been used, even at GCSE, is that if one really took them seriously, one would be describing the performance of an exceptional undergraduate student. In practice, it simply becomes a case of asking the exam board what they mean by all this ambiguous prose. They will then give exemplars of work which supposedly corresponds to the prose description. Quite how they do remains a mystery. But the pretence is maintained that objectivity it being promoted.

  9. […] cloaking claptrap in the grandeur of hidden mysteries. Nowhere is it better exemplified than in performance descriptors, the mysteries of which can only be unlocked by those who have been initiated into the exam […]

  10. […] us a shared language: actually, as I argue here, they create the illusion of a shared language. I’ve also suggested two possible alternatives: criteria must be defined not through prose but through (1) questions and […]

  11. […] Christodoulou has done an awful lot of work on this. She notes that descriptors are very vague and lack precision or accuracy: “descriptors do […]

  12. […] or ‘understanding’ is entirely subjective. Daisy Christodoulou has written about this problem with criterion-referencing, citing both Tim Oates and Dylan Wiliam on the issue of the […]

  13. […] This taken from Dylan Wiliam’s Principled Assessment Deisgn and helpfully discussed further by Daisy Christdoulou. […]

  14. Daisy, if every dept is dong something different – how do you monitor across the school? Most schools really are just replacing levels with levels.

  15. […] ‘progress’ (i.e. getting better at something). If you are not, then a very quick introduction is this blog post by Daisy Christodoulou, and you can read a greater elaboration on the ideas there in her book Making Good […]

  16. […] critiques of the the adverb soup that we create when we develop rubrics and criteria sheets or use prose descriptors to define levels are entertaining. She suggest using comparative judgement may provide an […]

  17. […] between papers and year on year. For more info on reliability issues and marking using descriptors this piece from Daisy Christodoulou might be a useful starting […]

  18. […] Christodoulou’s discussion of the problems of performance descriptors here, and conveying tacit […]

Leave a Reply

Your email address will not be published. Required fields are marked *