Assessment is difficult, but it is not mysterious

Posted on 31-05-2015

This is a follow-up to my blog from last week about performance descriptors.

In that blog, I made three basic points: 1) that we have conflated assessment and prose performance descriptors, with the result that people assume the latter basically is the former; 2) that prose performance descriptors are very unhelpful because they can be interpreted in so many different ways and 3) that there are other ways of assessing.

In response, David Didau wrote this post, in which he agreed with a lot of the things I said. I was pleased by this, because I really admire David’s work and think he has done great things in bringing important research to a wider audience. However, I was completely baffled by the end of the post, and I am going to explain why – if I’m a bit harsh, I’m sorry, and none of this changes the fact that I think most of David’s work is fantastic.

After agreeing with me about the vagueness of prose performance descriptors, he then suggested that as a replacement for prose performance descriptors, schools should use…wait for it…prose performance descriptors! Here is his example.

english assessment grid


I am really astonished by this. The above grid has all the flaws of national curriculum levels, and offers no improvement. It reproduces all the errors I discuss in my previous post. To take just one example, what does a challenging assumption about the cultural value of a text look like? Come to think of it, what does a text look like? A pupil might be able to make a challenging assumption about the value of a shampoo advert, or a limerick, or a piece of graffiti, but struggle to make one about the value of War and Peace, a newspaper article, or an unseen poem. One teacher might interpret this criterion in the context of a short poem pupils have studied before, in which case many of their pupils might achieve it, whilst another might interpret it in the context of a lengthy unseen poem, in which case many of their pupils will not achieve it. The parts on inference are particularly baffling. We know that inference is not a formal skill: we know that pupils (and indeed adults) can make great inferences about a text about baseball, and poor ones about sentences like ‘I believed him when he said he had a lake house, until he said it was forty feet from the shore at high tide.’ (Both those examples from Dan Willingham – see here for more from him about inference and reading). In short, the above grid will result in teachers collecting ‘junk data’ of the type Bodil Isaksen discusses here.

In the rest of the post, David appears to be suggesting that such an approach is OK just as long as we accept that it has flaws and can never be precise. In his words, ‘You can never hope for precision from performance descriptors, but then precision will always be impossible to achieve.’ In making this argument, David has basically proven my first point, which is that we have become so used to prose performance descriptors that we have come to assume that they are assessment and few alternatives are possible. Of course, if performance descriptors were the only way we could assess, then perhaps we would just have to accept the imprecision. But they aren’t! There are other ways, ways which offer far greater precision and accuracy. One example of a method that is far more precise is standardised tests. It’s true that they may not be perfectly precise, but there is still a world of difference between them and performance descriptors. Let me give a very concrete example of this: Back in 1995, Peter Pumfrey gave a group of 7-year-old  pupils who had been assessed as level 2 a standardised reading test. On this latter test, their reading ages ranged from 5.7 to 12.9.

And this, in short, is why I care so much about this, and why I think it is so important. There are pupils out there who are really struggling with basic skills. Flawed assessment procedures are hiding that fact from them and their teachers, and therefore stopping them getting the help they need to improve. Worse, the kinds of assessment processes which would help to identify these problems are being completely ignored.

It’s as though you saw someone try to measure the length of a room by taking a guess, and someone trying to measure the same distance using a measuring tape marked with centimetres. Then, because neither method can give you a measure to three decimal points, you conclude that ‘all measurement is imprecise and fundamentally mysterious’, so you’ll just use your best guess. Well, OK, both methods may be imprecise, but the latter method is far less so than the former, and you will be much better advised to buy a carpet based on it.

Complexity is not the same as mystery. I worry that by saying that assessment is mysterious and that it is very difficult to get a handle on how pupils are doing, we legitimise a woolly approach where anything goes because we can’t really measure anything anyway. We can do a lot better than we are doing at the moment, and one of the first things we can do is to stop depending so much on generic prose descriptors.

I realise that this leaves open the question David posed at the start of his article – ‘OK smart arse, what should we do?’ In my last post, and in others from the past, I’ve repeatedly argued that the better approach is to define criteria in terms of a) actual questions / test papers and b) actual pupil work. For example, back in December 2013 I wrote here that ‘we don’t get a shared language through abstract criteria. We get a shared language through teaching shared content, doing shared tasks, and sharing pupil work with colleagues.’ I realise I need to expand on these points further, and I will do so in my next blog post.

Sign up for email updates Sign up to the mailing list and get free spaced repetition resources