Assessment alternatives 2: using pupil work instead of criteria

In my last few blog posts, I’ve looked at the problems with performance descriptors such as national curriculum levels. I’ve suggested two alternatives: defining these performance descriptors in terms of 1) questions and 2) example work. I discussed the use of questions here, and in this post I’ll discuss the use of pupil work.

Take the criterion: ‘Can compare two fractions to see which is bigger’. If we define this in terms of a series of closed questions – which is bigger: 5/7 or 5/9?; which is bigger: 1/7 or 5/7? – it gives us more clarity about exactly what the criterion means. It also means we can then calculate the percentage of pupils who get each question right and use that as a guide to the relative difficulty of each question. Clearly, this approach won’t work for more open questions. Take the criterion: ‘Can explain how language, structure and form contribute to writers’ presentation of ideas, themes and settings’. There are ways that we can interpret this criterion in terms of a closed question – see my posts on multiple choice questions here. But it is very likely that we would also like to interpret this criterion in terms of an open question, for example, ‘How does Dickens create suspense in chapter 1 of Great Expectations.’ We then need some more prose descriptions telling us how to mark it. Here are some from the latest AQA English Literature spec.

Band 5 – sophisticated, perceptive
sophisticated analysis of a wide range of aspects of language supported by impressive use of textual detail

Band 4 – assured, developed
assured analysis of a wide range of aspects of language supported by convincing use of textual detail

Band 3 – clear, consistent
clear understanding of a range of aspects of language supported by relevant and appropriate textual detail

Band 2 – some demonstration
some familiarity with obvious features of language supported by some relevant textual detail

Band 1 – limited demonstration
limited awareness of obvious features of language

So in this case, defining the question hasn’t actually moved us much further on, as there is no simple right or wrong answer to this question. We are still stuck with the vague prose descriptors – this time in a format where the change of a couple of adjectives (‘impressive’ is better than ‘convincing’ is better than ‘relevant and appropriate’) is enough to represent a whole band’s difference. I’ve written about this adverb problem here, and Christine Counsell has shown how you can cut up a lot of these descriptors and jumble them up, and even experienced practitioners can’t put them back together in the right order again. So how can we define these in a more concrete and meaningful way? A common response is to say that essays are just vaguer than closed questions and we just have to accept that. I accept that in the case of these types of open question, we will always have lower levels of marking reliability than in the case of closed questions like ‘What is 5/9 of 45?’ However, I still think there is a way we can help to define this criterion a bit more precisely. That way is to define the above band descriptors in terms of example pupil work. Instead of spending a lot of time excavating the etymological differences between ‘sophisticated’ and ‘assured’, we can look at a sample piece of work that has been agreed to be sophisticated, and one that has been agreed to be assured. The more samples of work, the better, and reading and discussing these would form a great activity for teacher training and professional development.

So again, we have something that sits behind the criterion, giving it meaning. Again, it would be very powerful if this could be done at a national scale – imagine a national bank of exemplar work by pupils of different ages, in different subjects, and at different grades. But even if it were not possible to do this nationally, it would still be valuable at a school level. Departments could build up exemplar work for all of their frequent essay titles, and use them in subsequent years to help with marking and moderation meetings. Just as creating questions is useful for teaching and learning, so the collection of samples of pupil work is helpful pedagogically too. Tom Sherrington gives an idea of what this might look like in this blog here.

Many departments and exam boards already do things like this, of course, and I suspect many teachers of subjects like English and History will tell you that moderation meetings are some of the most useful professional development you can do. The best moderation meetings I have been a part of have been structured around this kind of discussion and comparison of pupil work. I can remember one particularly productive meeting where we discussed how the different ways that pupils had expressed their thoughts actually affected the quality of the analysis itself. The discussions in that meeting formed the basis of this blog post, about the value of teaching grammar.

However, not all the moderation meetings I have attended have been as productive as this. The less useful type are those where discussion always focusses on finer points of the rubric. Often these meetings can descend into fairly sterile and unresolvable arguments about whether an essay is ‘thoughtful’ or ‘sophisticated’, or what ratio of sophisticated analysis to unsophisticated incoherence is needed to justify an overall judgment of ‘sophisticated’. (‘It’s a level 2 for AF6 technical accuracy, but a level 8 for AF1 imagination – so overall it deserves a level 4’).

So, if we accept the principle that the criteria in mark schemes need interpreting through exemplars, then I would suggest that discussions in moderation meetings should focus more on comparison of essays to other essays and exemplars, and less on comparison of essays to the mark scheme.

criteria vs exemplars

Just as with the criterion / question comparison, this does not mean that we have to get rid of the criteria. It means that we have to define the words of the criteria in terms of the sophistication of actual work, not the sophistication, or more often the sophistry, of the assessor’s interpretation of the mark scheme.

There are two interesting pieces of theory which have informed what I’ve written above. The first is about how humans are very bad at making absolute judgments, like those against criteria. We are much better at making comparative judgments. The No More Marking website has a great practical demonstration of this, as well as a bibliography. The second is the work of Michael Polanyi and Thomas Kuhn on tacit knowledge, and on the limitations of prose descriptions. Dylan Wiliam has written about the application of this to assessment in Embedded Formative Assessment.

Over the next couple of weeks I’ll try to blog a bit more about both of these topics.

0 responses to “Assessment alternatives 2: using pupil work instead of criteria”

  1. I have written a fair bit about the idea of that standards exist in the minds of communities of assessors, rather than in performance descriptions. Here is a paper from the 1990s:

  2. We use our blogging system for exactly this purpose. The students’ work – all contained on an individual site – moves forward with them and our methods of asserting criteria for marking are based on a combination of model exemplars and worked examples of student work.

    It’s certainly driving impressive outcomes in our setting – so I’d concur with the message in this article.


  3. Thanks Dylan. This is really interesting. Can I ask what you think the balance should be between authentic tasks of the type described in your article, and more closed tasks like multiple choice questions? I would always want to include open essay questions in assessments of English, history, etc., and I can see the value of them in maths from your article, but ultimately I don’t see how they can ever be as reliable as more closed questions, so I would always to include a substantial element of these in any assessment.

    • I think we need to see closed questions addressing preliminary aspects of the curriculum and open questions addressing the higher order questions. This reflects an understanding of Bloom as putting knowledge not, as you argued in the Seven Myths, Daisy, at the *bottom* of the priority list, but at the *beginning* of the learning cycle. It will help achieve some consensus in assessing the essays consistently if one can establish whether the student knows what they are talking about, before seeing whether they can mould that knowledge into an argument.

      I think you should be careful about moving too heavily to closed questions, which, if taken too far, will sacrifice validity on the altar of reliability.

      To repeat the point I make in my long comment below, reliability can be increased by more sampling. That is what embedded assessment (e.g. piggy-backing on formative assessment for summative purposes) will offer, especially when combined analytics. This, in my view, is one of the big wins promised by ed-tech, which can crunch subjective judgments just as much as outputs from multiple choice tests, offering reliability *and* validity against our full range of learning objectives.

  4. Dick Schutz says:

    This is all good and well, but it’s been said since the earliest days of educational measurement–or since at least 50 years. Nothing wrong with saying it again, except that the discourse doesn’t provide an “alternative to performance descriptors such as national curriculum levels.” “Levels” are an important concern because the DfE has discarded them in future KS test reporting. [Note: Quit reading at this point and correct me if I’m wrong so far.]

    The “problems” are inherent when instructional assessment is divorced from instruction, and the instruction is a black box with known and unknown buggy characteristics. Under those conditions, whatever the characteristics and interpretations of the measurement/assessment/ evaluation, it’s after the fact and since it’s focus is on the student rather than on the instruction, it provides no self-corrective information for either the students or the instruction.

    God didn’t command that we divorce measurement and instruction in schooling or that we use only multiple-choice test items and constructed responses (now elevated to authentic tasks) as instrumentation. (As a matter of fact, in a well constructed test, the results of the two item types will correlate as highly as the reliability of the items permits, so the distinction is consequentially trivial.)

    With today’s information technology, it’s economically feasible to operationally implement instructional signal/noise detection that Dylan flagged in 1994–a useful perspective in “instructional intelligence” operations, which I’ve talked about elsewhere. It seems to me it would be more productive to get on with that endeavor rather than trying to make do with anachronistic testing theory and technology.

  5. I agree with your drift: exemplification rather than definition.

    But I don’t think there is any substantive difference between your exemplification by questions and your exemplification by answers. The answer, in the second case, is meaningless without the question; and the question, in the case of closed questions, gives a definition of the answer. So in both cases, you are in fact exemplifying the learning objective by a combination of question *and* answer. The only difference is that in the first case there is one acceptable answer and in the second there are many.

    Although I do not entirely understand what Dick Shultz is saying, I agree this does not provide an alternative to “levels”. It provides an alternative to definitions. You still need the criteria, but they will be seen as descriptions and not definitions.

    I agree with Dylan that “meaning” lies in the consistent understanding and application of the community. But I do not think that to say this is to say anything significant. The question is *how* you achieve consistent understanding in the community. Adding a definition to the OED is one plausible approach – albeit one that we have decided is ineffective in our case.

    And I am not at all convinced by the approach Dylan describes in his paper. This suggests that you can rely on moderation processes in which ” the domain of assessment is never defined explicitly” but markers just do the marking and…

    “when [the marker] has made an assessment, feedback is given by an ‘expert’ as to whether the assessment agrees with the expert assessment. The process of marking different pieces of work continues until the teacher demonstrates that she has converged on the correct marking
    standard, at which point she is ‘accredited’ as a marker for some flxed period of time.”

    The poor marker is subjected to a game of blind man’s bluff, in which she has to guess what is in the mind of some “expert” – appropriately enough encased in inverted commas. Being an admirer of Plato, I do not believe in anyone’s claim to expertise unless they can give a defensible, rational (and explicit) account of themselves. As Dylan summarises the method, “it is not necessary for the raters (or anybody else) to know what they are doing, only that they do it right”. But how does anyone know that it is right if no-one has articulated what it is? All you know is that it has been approved. It is a method that reminds me of what was known in 1930s Germany as “working towards the Fuhrer”.

    And I don’t think we have addressed the real problem, which is that the community is a very large one and moderation discussions typically occur around fairly small tables. Where, then, is the long-stop, where is the ultimate arbiter of meaning?

    1. We have rejected the rubric, the dictionary definition.

    2. I, at least, have rejected Dylan’s “expert”.

    3. I would also warn against the exemplar that becomes elevated into a reference: “do it like this”. The whole point of the concrete exemplar is that it elucidates without defining. The true objective remains remains abstract.

    Nor have we defined what we mean by “learning objective” which is a relative term. It says, “this is what we are aiming at” – but what sort of thing *is* it that we are aiming at?

    I have already suggested why I think Dylan’s term, “construct”, is unhelpful in an educational context. My proposal is “capability”, which I see as a certain sort of disposition. A disposition is a tendency to produce a certain sort of performance. The exemplar is just one instance of that *certain sort of performance*. And it is very important that it remains just one instance, and is not confused with the definition of the capability itself.

    To guard against this danger, I would suggest that the exemplars are:
    (a) many;
    (b) like American Presidents, limited to a fixed term, after which they should be replaced.

    Two more questions:

    1. who is it that produces the exemplars?
    2. what is the difference between an exemplar an appropriately marked question?

    I suspect that there is an important role in all of this for data analytics – and in two senses:

    1. In assessing the consistency with which a learning objective is interpreted.
    2. In assessing the relative merits (i.e. the validity) of different learning objectives, which may be stewarded by different issuing (awarding?) bodies.

  6. […] Assessment alternatives 2: Using pupil work instead of criteria  […]

  7. […] last few posts have all been about the flaws in using criteria, and alternatives to using them. In my last post, I mentioned that there were two pieces of theory which had influenced my thinking on this: […]

Leave a Reply

Your email address will not be published. Required fields are marked *