Problems with performance descriptors

Posted on 22-05-2015

A primary teacher friend recently told me of some games she and her colleagues used to play with national curriculum levels. They would take a Michael Morpurgo novel and mark it using an APP grid, or they would take a pupil’s work and see how many different levels they could justify it receiving. These are jokes, but they reveal serious flaws. National curriculum levels are, and always have been, vague and unhelpful.

For example, compare:

Pupils’ writing is confident and shows appropriate and imaginative choices of style in a range of forms.

Pupils’ writing in a range of forms is lively and thoughtful.

The first is a description of performance at level 7, the second at level 4. That’s what I mean about vague and unhelpful, and that’s why my friend was able to justify the same piece of work receiving several different levels.

However, what is frustrating is that many of the replacements for national curriculum levels rely on precisely the same kind of vague performance descriptions. In fact, in many conversations I have with people, they cannot even begin to imagine an assessment system that doesn’t use some form of descriptor. For many people, descriptors simply are assessment, and if a school is to create its own assessment system, then the first – and possibly last – step must surely involve the creation of a new set of descriptors.  Unfortunately, the truth is very different: as I’ve written here, descriptors do not give us a common language but the illusion of a common language. They can’t be relied on to deliver accuracy or precision about how pupils are doing. In this post, I will recap the problems with descriptors; in the next, I will suggest some alternatives.

First, Tim Oates shows here that creating accurate prose descriptions of performance, even in subjects like maths and science, is fiendishly difficult.

Even a well-crafted statement of what you need to get an A grade can be loaded with subjectivity – even in subjects such as science. It’s genuinely hard to know how difficult a specific exam is.

Second, Dylan Wiliam shows here in Principled Assessment Design that even very precise descriptors can be interpreted in completely different ways.

Even in subjects like mathematics, criteria have a degree of plasticity. For example, a statement like ‘Can compare two fractions to identify which is larger’ sounds precise, but whether students can do this or not depends on which fractions are selected. The Concepts in Secondary Mathematics and Science (CSMS) project investigated the achievement of a nationally representative group of secondary school students, and found out that when the fractions concerned were 3/7  and 5/7  then around 90% of 14-year-olds answered correctly, but when more typical fractions, such as 3/4 and 4/5  were used, then 75% answered correctly. However, where the fractions concerned were 5/7 and 5/9 then only around 15% answered correctly (Hart, 1981).

Finally, Paul Bambrick-Santoyo makes a very similar point in Driven by Data. I’ve abridged the below extract.

 To illustrate this, take a basic standard taken from middle school math:

Understand and use ratios, proportions and percents in a variety of situations.

To understand why a standard like this one creates difficulties, consider the following premise. Six different teachers could each define one of the following six questions as a valid attempt to assess the standard of percent of a number. Each could argue that the chosen assessment question is aligned to the state standard and is an adequate measure of student mastery:

Identify 50% of 20.

Identify 67% of 81

Shawn got 7 correct answers out of 10 possible answers on his science test. What percent of questions did he get correct?

J.J Redick was on pace to set an NCAA record in career free throw percentage. Leading into the NCAA tournament in 2004, he made 97 of 104 free throw attempts. What percentage of free throws did he make?

Bambrick-Santoyo goes on to give examples of two even more difficult questions. As with the Dylan William example, we can see that whilst 90-95% of pupils might get the first question right, many fewer would get the last one right.

The problems with the vagueness and inaccuracy of descriptors are not just a problem with the national curriculum levels. It is a problem associated with all forms of prose descriptors of performance. The problem is not a minor technical one that can be solved by better descriptor drafting, or more creative and thoughtful use of a thesaurus. It is a fundamental flaw. I worry when I see people poring over dictionaries trying to find the precise word that denotes performance in between ‘effective’ and ‘original’. You might find the word, but it won’t deliver the precision you want from it. Similarly, the words ’emerging’, ‘expected’ and ‘exceeding’ might seem like they offer clear and precise definitions, but in practice, they won’t.

So if the solution is not better descriptors, what is the answer? Very briefly, the answer is for the performance standards to be given meaning through a) questions and b) pupil work. I will expand on this in a later post.